DiGIR Schema Readme

Explanation, Usage, and Tips

$Id: schemaReadme.html,v 1.3 2002/03/08 20:55:30 peejinator Exp $

DiGIR request and response formats are specified via a number of XML Schema documents. The primary structure of a request or response is specified in the protocol XML Schema (digir.xsd). For the latest version(s) see:

http://digir.sourceforge.net/prot

The available data elements, and their types, for searching and returning in a request or response are specified in a federation (or content) schema. A current federation schema and excellent example is that representing the Darwin Core V2 dataset (darwin2.xsd). For the latest version(s) see:

http://digir.sourceforge.net/fed/

Additionally, predefined record structures for responses can be defined and some examples of such (brief and full) exist in the above location.

Understanding the DiGIR Schema

Pulling together the DiGIR schema involved making use of a variety of XML Schema offerings. Lets take a look at some of the less straightforward techniques here:

In the schema you will see a number of abstract elements for various kinds of data. For example:
- searchableData
- alphaSearchableData
- numericSearchableData
- returnableData
- alphaReturnableData
- numericReturnableData
- searchableReturnableData
- alphaSearchableReturnableData
- numericSearchableReturnableData
As abstract elements, one cannot create concrete instances of them. The elements are provided to be used as substitutionGroups when content data elements are defined in federation schema. They act as placeholders in the protocol schema within the filter (particularly within COPs) structure.

In terms of data defined within a federation schema, it can be either searchable, returnable or both. Because XML Schema does not allow the use of multiple substitutionGroups per single element, each combination had to be created as its own element. One might consider why we did not create types and a complex inheritance tree to achieve the sort of typing we desired. The answer is two fold: 1) instance documents would not be as clean without the use of substitutionGroups and 2) XML Schema does not support multiple inheritance.

Examining how substitutionGroups were used here, one can think of them as acting like tag interfaces in java. In specific, in java there is an interface called Serializable and objects can implement this interface when they are serializable. Although there is no contract about supported implementation here, we have, in effect, tagged a data element as searchable, returnable, or both by specifying its substitutionGroup as one of the above. This is an important tehnique which one might find used elsewhere within the protocol schema.

One other note regarding these substituionGroups is the alpha and numeric versions of each. Clearly this is not an exhaustive type set, so I believe these will be somewhat shortlived. The idea behind having "typed" substitutionGroups is to allow comparison operators (COPs) to only deal with data of the sort valid for that particular operator. Consider the LIKE operator which can only accep alphanumeric data. If typing was not done in any manner here, the advantage of parser validation on an instance document would be lost and would have to be done at the provider itself. Its a nice idea but shortlived because as the datasets become more complex and as providers support more (or less) diverse sets of operators for each data element, these two generalized types (alpha and numeric) will not buy much. Eventually, each provider will specify in its own metadata the data elements and operators supported for each element. At that point, the use of the alpha and numeric can be removed and one can just deal with data in a general sense.
Similar to the Data elements above, we've defined the comparison operators (COPs) and made each of them part of substitutionGroups within the larger filter structure (LOP). However, with the COPs, each does extend from a COP (or multiCOP) type and this was important to ensure common and similar structure for each one (i.e. contined only one element and also the appropiate attribute).
Also, a few other place holder elements are defined, to be substituted or used in a request. These include:
- record: to be used as the container in custom (user defined) record structures
- requiredList: to be used as a flag (via substitutionGroup) for an element within a federation schema that is simply a sequence of the data elements that are required for each record within a result set
For the purposes of the multiCOP a listType (with alpha and numeric) versions are defined. This is necesary to limit the elements within a multiCOP (such as IN) to be N of any 1 element.
A perhaps odd thing one might notice about the schema is a redefinition of many built-in datatypes (string, decimal, dateTime) as a complexType. This was necessary to add an attribute to each type (currently through the attributeGroup dataAttributes). The advantage of this is to allow each element to in a federation schema to simply type itself without having to define an additional attribute. All types have not been defined in this manner, so more may/will need to be added in the future. There is also a complexData type defined which any non simple content (i.e. nested) data would extend from in order to inherit the attribute on all types. Examples can be found below.

Using the DiGIR Schema to create a Federation (content) schema

Given how the protocol schema is defined, it is necessary that all federation schema adhere to certain techniques of definition. Here are some tips on creating a federation schema for use with the DiGIR protocal.

All data elements must be of or extend from the data types in the protocol schema (i.e. complexData, stringData, decimalData, nonNegativeIntegerData, etc.). This is because the protocol schema defines data elements with an additional attribute called supported. The supported attribute is used to signify in data, that is result data, if an element was not supported by a provider. For example, here is how one would specify a simple typed data element, like Collector, in a federation schema.
```
<xsd:element name="Collector" type="digir:stringData" substitutionGroup="digir:alphaSearchableReturnableData" nillable="true"/>
```
And here is how that the Collector column that is not supported would look in the results:
```
<darwin:Collector supported="false" xsi:nil="true"/>
```
It is important to note that supported and null are two separate and distinct conditions.

The complexData type in the protocol schema is provided so one can specify ComplexContent data in a federate schema. ComplexContent is content that has child elements (i.e. has a nested structure or represents some sort of grouping). Although such data is not present in the first version of the Darwin Core v. 2.0 schema, it is envisioned that it can and will exist in the future. An example of using complexData to create a grouping/nested element such as BoundingBox (which is currently flat data) is:


   <xsd:group name="BoundingBoxPoints">
      <xsd:sequence>
         <xsd:element name="minLatitude" type="digir:decimalData" nillable="true"/>
         <xsd:element name="maxLatitude" type="digir:decimalData" nillable="true"/>
         <xsd:element name="minLogitude" type="digir:decimalData" nillable="true"/>
         <xsd:element name="maxLogitude" type="digir:decimalData" nillable="true"/>
      </xsd:sequence>
   </xsd:group>
   <xsd:element name="BoundingBox" substitutionGroup="digir:searchableData" nillable="true">
      <xsd:complexType>
         <xsd:complexContent>
            <xsd:restriction base="digir:complexData">
               <xsd:sequence>
                  <xsd:group ref="BoundingBoxPoints" minOccurs="0"/>
               </xsd:sequence>
            </xsd:restriction>
         </xsd:complexContent>
      </xsd:complexType>
   </xsd:element>

(Note the additional use of a group here. The benefit is that now our schema will allow instance documents to have all the elements of the group or none of the elements of the group, but no partial mix of the points will be valid.)

All data elements must be defined with the nillable attribute set to true (i.e. nillable="true"). In result data, if a column value is null the record should be returned with xsi:nil="true" and the element should be left empty. Also, as seen above, if an element is not supported in results, in order for the element to be left empty, it must be defined as null. For example, here is how one would specify a nillable data element:
```
<xsd:element name="Collector" type="digir:stringData" substitutionGroup="digir:alphaSearchableReturnableData" nillable="true"/>
```
And how the element looks as null in the results:
```
<darwin:Collector xsi:nil="true"/>
```

All data elements much define a substitutionGroup that tags them as searchable, returnable, or both. For example:

<xsd:element name="Collector" type="digir:stringData" substitutionGroup="digir:alphaSearchableReturnableData" nillable="true"/>

All data elements that can be instantiated must not be abstract. Any abstract element should not be used as a searchable concept.

Noting required elements (a returnable data element that is required in all result sets) is done through the creation of an abstract element with the substitutionGroup of digir:requiredList. The element is abstract, and therefore can never be instantiated and we flag it as the requiredList via the substituionGroup. An example is:


   <xsd:element name="requiredList" substitutionGroup="digir:requiredList" abstract="true">
      <xsd:complexType>
         <xsd:sequence>
            <xsd:element ref="DateLastModified "/>
            <xsd:element ref="InstitutionCode"/>
            <xsd:element ref="CollectionCode"/>
            <xsd:element ref="CatalogNumber"/>
            <xsd:element ref="ScientificName"/>
         </xsd:sequence>
      </xsd:complexType>	
   </xsd:element>

This should be done in all federation schema but the list (sequence) could contain no elements.

Schema Versioning Process

When dealing with any evolving specification, it is important to appropriately version it. Giving the distributed nature of this project, a clear process should be followed. The suggested process is:

Use the DiGIR CVS repository on the SourceForge site. The schema documents, for both the protocal and federation schemas, as well as some examples, are in the xml module in CVS. All edits to a schema should be done via a checked out version of the document and commits should be done regularly. It is not recommended that you keep a document out with changes for an extended period of time as that can introduce conflicts in a multi-developer environment.
Once you are satisfied with you changes and they have been tested, you are ready to "release" the schema. After the schema(s) are committed to CVS, you should tag (via cvs tag) the schema, as well as any dependant schemas with the release name (e.g. beta1, release1.0, etc.). All dependant schemas should use the same tag.
After tagging the files, create a directory with the same name as the tags used (e.g. beta1, release1.0, etc.) in the htdocs (webspace) area of the DiGIR SourceForge site under the appropriate directory (prot - for protocol schema and fed - for federation schema).
Then scp (secure copy) the same versions of the schema(s) to this(these) new directory(ies). Here, snapshots of all released version will reside. This location acts as, and should be used in other schema or instance documents, as the schemaLocation. You should not remove previous versions of the schemas as other people may still be referencing them.