Distributed Generic Information Retrieval
Initial Specification and Implementation
Requirements and Assumptions
As participating members of a TDWG subgroup,
our goal is to contribute to the specification of a protocol for retrieving
structured data from multiple, heterogeneous databases. In this project,
we intend to inform the protocol specification by developing the software
that implements the protocol. The purpose of this document is to record
the understood requirements and assumptions we have for the protocol and
- the definition of message formats (i.e. requests and responses) by
which components must communicate
- an application that makes structured data available to compliant portals
- Provider Metadata
- a structured description of a provider's database, including things
such as: description of the database, URI to federation schema, supported
columns, supported operation types, and summary data (such as geographic
and/or taxonomic index info in our case)
- an application that communicates with multiple providers and performs
operations to retrieve and integrate data (and metadata); functions
as the point of access for users
- a centralized (public) repository of available providers (and perhaps
- Project Purpose & Goals
The initial purpose and scope of this project is to support distributed
data retrieval across a loosely coupled federation(s) of biological collection
databases. Many such databases exist (perhaps > 1,000) and a growing
subset (> 100) have been made publicly available via the Web. Several
client-server systems exist that allow a user to query several databases
at once, but the protocols, semantics and software are all tightly coupled
in each of these systems. There is no standard and/or unified method to
do distributed queries. This project hopes to establish an open standard
and lay the groundwork for a generic protocol, capable supporting many communities,
without regard to discipline or domain (data semantics). Our design goals
to use open protocols and standards, such as HTTP, XML, and UDDI to
leverage existing and emerging IT infrastructure;
to de-couple the protocol, software and semantics; [Portal and provider
software can be developed independently. We expect each portal to cater
toward different (sub)communities and data integration functions (e.g.,
collection data with geographic layers). Different implementations of
providers and portals may be targeted for different operating systems]
to automate the establishment of a new provider as much as possible,
automatable tasks include installation of provider software, testing,
and registration of the provider in a centralized, global registry.
- Protocol and Components
- Assumes each community will provide both a federation schema and
summary (metadata) schema specified using XML-Schema.
- Assumes the federation schema defines a core set of columns that
must be supported by providers. (The core subset for a collection
object record is currently: ProviderIP, InstitutionCode, CollectionCode,
- Must define format of message to obtain metadata information.
- Must define format of message to return metadata information.
- Must define the core set of query operations (current required operation:
exact match) and should define the set of optional query operations.
- Must define syntax to represent query (as in collections db query).
- Must define the core set of result set option types (i.e. count?,
brief, full, etc.).
- May allow for the ability to support a variety of federation schemas
(i.e. by taxonomic discipline).
- Must define format of message that encapsulates an entire request
for a search (including the target, query options and query).
- Must define format of query results.
- Must define format of error messages and return codes. This is expected
to be an enumerated list of error codes/responses.
- Must define format of message that encapsulates an entire response
from a search (including the result set, error messages, and return
- May define format of messages to perform heartbeat checks (i.e.
status) on providers.
- Must always communicate with providers via complete messages formatted
according to the protocol.
- Must issue identical request to each provider being queried.
- Should gracefully handle incomplete response messages as a result
of catastrophic failure at the provider level.
- Must be able to handle error conditions returned from providers.
- May timeout requests to providers.
- May request provider metadata information.
- May communicate with registry to discover providers.
- May limit which providers are queried according to metadata.
- May perform heartbeat/status checks on providers.
- Should provide a user interface of some form or another.
- Must always communicate with portal via complete messages formatted
according to the protocol.
- Must accept N number of requests at any given time.
- May support N number of collections databases, each of varying type
with different offerings.
- Must support the core subset of federation schema columns, and optionally
may support additional columns.
- Must support the core subset of operations, and optionally may support
- Must maintain metadata information for retrieval and must timestamp
that data in order to provide changes/updates to portal.
- Must be able to return metadata information in the appropriate response
- Must be able to translate query into appropriate form to select
results from a federation schema compliant database.
- Must be able to return results from an operation in the appropriate
- Must treat a result set as an indexed array, such that consecutive
requests will pan through results. May issue re-indexed data as a
result of an insert or delete.
- Must communicate error conditions back to caller in appropriate
- Recommended to be registered as a provider in the registry (may
be automated or manual).
- May respond to heartbeat/status checks.
- May timeout requests from portals and issue the appropriate error
- Must be able to store name, access, and services information on
- Must respond to requests for provider information.
- Must be available to anyone to find providers.
- May restrict who can register as a provider (for security)?
- May also allow for registration of portals for universal discovery.
- The Protocol will make use of XML over HTTP for at least portal <->
- A federation schema(s) will exist that providers will conform to (at
least a core subset of) for the purposes of standardized queries.
- Federation schema(s) will define data types and units of measure for
- Federation schema(s) will be represented as XML Schema as well as one
or more relations (tables and/or views).
- Federation schemas will be versioned and may exhibit attributes of inheritance
(where additional attributes beyond a generic set can be added for particular
disciplines. For example, the attribute "Depth" isn't used in bird databases,
but is commonly used in fish databases).
- Possible enhancements to the protocol imply that the protocol will be
versioned. Likely, the protocol will be versioned in conjunction with
the federation schema.
- Collections data will likely reside in tables/views on relational databases
and will be communicated with via DBI/ODBC/JDBC.
- The registry will likely be an existing UDDI registry, such as those
offered freely by IBM and/or Microsoft.
- Interfacing with a UDDI registry is already well defined.
- Collections will be able to be typed (i.e. will be classifiable) with
1 - N defined types.
- It is not required for providers to maintain any notion of state.
- A provider will be further divided into service-based components, suggested
ones for design are the "Query Translation Component" and the "Metadata
Component". Ideally, all of such components will have well defined APIs
such that each component could exist under several implementations and
will be pluggable.
- A portal will be further divided into service-based components, such
as the "Query Engine Component". Ideally, all of such components will
have well defined APIs such that each component could exist under several
implementations and will be pluggable.
- Access control will not be specified in this version.
- UI requirements are purposely excluded to allow for flexibility and
diverse portal offerings.
- Questions/Issues/Open Items
- What is the defined metadata schema?
- What will we call this thing?
- What are the logging requirements?
- What are the monitoring requirements?
- What are the configuration requirements?
- What are all the error codes to be enumerated?