DiGIR
Distributed Generic Information Retrieval
Initial Specification and Implementation

Requirements and Assumptions

Revision 0.92

Introduction
As participating members of a TDWG subgroup, our goal is to contribute to the specification of a protocol for retrieving structured data from multiple, heterogeneous databases. In this project, we intend to inform the protocol specification by developing the software that implements the protocol. The purpose of this document is to record the understood requirements and assumptions we have for the protocol and initial implementations.

Key Terminology:

Protocol

the definition of message formats (i.e. requests and responses) by which components must communicate

Provider

an application that makes structured data available to compliant portals

Provider Metadata

a structured description of a provider's database, including things such as: description of the database, URI to federation schema, supported columns, supported operation types, and summary data (such as geographic and/or taxonomic index info in our case)

Portal

an application that communicates with multiple providers and performs operations to retrieve and integrate data (and metadata); functions as the point of access for users

Registry

a centralized (public) repository of available providers (and perhaps portals)
Project Purpose & Goals
The initial purpose and scope of this project is to support distributed data retrieval across a loosely coupled federation(s) of biological collection databases. Many such databases exist (perhaps > 1,000) and a growing subset (> 100) have been made publicly available via the Web. Several client-server systems exist that allow a user to query several databases at once, but the protocols, semantics and software are all tightly coupled in each of these systems. There is no standard and/or unified method to do distributed queries. This project hopes to establish an open standard and lay the groundwork for a generic protocol, capable supporting many communities, without regard to discipline or domain (data semantics). Our design goals include:
- to use open protocols and standards, such as HTTP, XML, and UDDI to leverage existing and emerging IT infrastructure;
- to de-couple the protocol, software and semantics; [Portal and provider software can be developed independently. We expect each portal to cater toward different (sub)communities and data integration functions (e.g., collection data with geographic layers). Different implementations of providers and portals may be targeted for different operating systems]
- to automate the establishment of a new provider as much as possible, automatable tasks include installation of provider software, testing, and registration of the provider in a centralized, global registry.
Protocol and Components
1. Protocol
  1. Assumes each community will provide both a federation schema and summary (metadata) schema specified using XML-Schema.
  2. Assumes the federation schema defines a core set of columns that must be supported by providers. (The core subset for a collection object record is currently: ProviderIP, InstitutionCode, CollectionCode, CatalogNumber, LastEditedTimeStamp?)
  3. Must define format of message to obtain metadata information.
  4. Must define format of message to return metadata information.
  5. Must define the core set of query operations (current required operation: exact match) and should define the set of optional query operations.
  6. Must define syntax to represent query (as in collections db query).
  7. Must define the core set of result set option types (i.e. count?, brief, full, etc.).
  8. May allow for the ability to support a variety of federation schemas (i.e. by taxonomic discipline).
  9. Must define format of message that encapsulates an entire request for a search (including the target, query options and query).
  10. Must define format of query results.
  11. Must define format of error messages and return codes. This is expected to be an enumerated list of error codes/responses.
  12. Must define format of message that encapsulates an entire response from a search (including the result set, error messages, and return codes).
  13. May define format of messages to perform heartbeat checks (i.e. status) on providers.
2. Portal
  1. Must always communicate with providers via complete messages formatted according to the protocol.
  2. Must issue identical request to each provider being queried.
  3. Should gracefully handle incomplete response messages as a result of catastrophic failure at the provider level.
  4. Must be able to handle error conditions returned from providers.
  5. May timeout requests to providers.
  6. May request provider metadata information.
  7. May communicate with registry to discover providers.
  8. May limit which providers are queried according to metadata.
  9. May perform heartbeat/status checks on providers.
  10. Should provide a user interface of some form or another.
3. Provider
  1. Must always communicate with portal via complete messages formatted according to the protocol.
  2. Must accept N number of requests at any given time.
  3. May support N number of collections databases, each of varying type with different offerings.
  4. Must support the core subset of federation schema columns, and optionally may support additional columns.
  5. Must support the core subset of operations, and optionally may support additional operations.
  6. Must maintain metadata information for retrieval and must timestamp that data in order to provide changes/updates to portal.
  7. Must be able to return metadata information in the appropriate response format.
  8. Must be able to translate query into appropriate form to select results from a federation schema compliant database.
  9. Must be able to return results from an operation in the appropriate response format.
  10. Must treat a result set as an indexed array, such that consecutive requests will pan through results. May issue re-indexed data as a result of an insert or delete.
  11. Must communicate error conditions back to caller in appropriate form.
  12. Recommended to be registered as a provider in the registry (may be automated or manual).
  13. May respond to heartbeat/status checks.
  14. May timeout requests from portals and issue the appropriate error response.
4. Registry
  1. Must be able to store name, access, and services information on providers.
  2. Must respond to requests for provider information.
  3. Must be available to anyone to find providers.
  4. May restrict who can register as a provider (for security)?
  5. May also allow for registration of portals for universal discovery.
Assumptions
1. The Protocol will make use of XML over HTTP for at least portal <-> provider communication.
2. A federation schema(s) will exist that providers will conform to (at least a core subset of) for the purposes of standardized queries.
3. Federation schema(s) will define data types and units of measure for each column.
4. Federation schema(s) will be represented as XML Schema as well as one or more relations (tables and/or views).
5. Federation schemas will be versioned and may exhibit attributes of inheritance (where additional attributes beyond a generic set can be added for particular disciplines. For example, the attribute "Depth" isn't used in bird databases, but is commonly used in fish databases).
6. Possible enhancements to the protocol imply that the protocol will be versioned. Likely, the protocol will be versioned in conjunction with the federation schema.
7. Collections data will likely reside in tables/views on relational databases and will be communicated with via DBI/ODBC/JDBC.
8. The registry will likely be an existing UDDI registry, such as those offered freely by IBM and/or Microsoft.
9. Interfacing with a UDDI registry is already well defined.
10. Collections will be able to be typed (i.e. will be classifiable) with 1 - N defined types.
11. It is not required for providers to maintain any notion of state.
12. A provider will be further divided into service-based components, suggested ones for design are the "Query Translation Component" and the "Metadata Component". Ideally, all of such components will have well defined APIs such that each component could exist under several implementations and will be pluggable.
13. A portal will be further divided into service-based components, such as the "Query Engine Component". Ideally, all of such components will have well defined APIs such that each component could exist under several implementations and will be pluggable.
Exclusions
1. Access control will not be specified in this version.
2. UI requirements are purposely excluded to allow for flexibility and diverse portal offerings.
Questions/Issues/Open Items
1. What is the defined metadata schema?
2. What will we call this thing?
3. What are the logging requirements?
4. What are the monitoring requirements?
5. What are the configuration requirements?
6. What are all the error codes to be enumerated?