The ZeeRex DTD Commentary

24th November 2003

Author: Rob Sanderson <azaroth@liverpool.ac.uk>

1. Structure
2. serverInfo
3. databaseInfo
4. metaInfo
5. indexInfo
        5.1. index
        5.2. indexInfo for Z39.50
        5.3. indexInfo for SRW
6. recordInfo
7. schemaInfo
8. configInfo

1. Structure

The ZeeRex DTD has a very simple structure. The main tag, explain, has six sections within it, only the first of which is required. The sections are:

serverInfo: The basic details required to start a network connection to the described server. This includes such things as the hostname, port, database name and any authentication required.
databaseInfo: This section contains full text information describing the database, its history, who is responsible for it and so forth.
metaInfo: The metaInfo section contains information about the record, such as when it was created and when and where it was aggregated from.
indexInfo: The details regarding how one can interact with the server are recorded here. This information is recorded in the form of 'indexes' and sort keywords.
One of the following, depending on the protocol:
- recordInfo: This section contains elements describing the various record syntaxes and any elementsets that may be available for those record syntaxes under Z39.50.
- schemaInfo: This alternative section describes the XML schemas used in SRW for retrieval and sorting.
configInfo: The data in this section concerns the configuration of the server, such as which protocol features it supports and any default values.

The following describes each of the above divisions in turn, but first there are two attributes on the explain tag that bear discussion.

The authoritative attribute contains either true or false, defaulting to false if not present, and describes whether or not the record should be treated as the final authority on the described database. This may only be set to true if the following conditions are met:

The author of the record has full access to the configuration of the database that they are describing. This implies that databases described by remotely probing them for supported features cannot be authoritative.
The record has not been harvested from another server. Aggregated records may be out of date, and hence cannot be authoritative. If the host server is acting as the sole distribution agent for another server, ie as a proxy IR-Explain---1 server for it, then it may be marked as authoritative as there is no other location where the record may be more up to date.

The second attribute, id, is to allow for a unique identifier to be assigned to the record. This identifier may then be used for retrieval purposes. The identifier may be modified on aggregation from one server to another.

2. serverInfo

The serverInfo element contains the information necessary to start a connection to the described server. It has four attributes and four possible subelements, host, port, database, and authentication.

The protocol attribute on serverInfo is used to record the protocol that should be used to connect to the server. The default value is "Z39.50", but "SRW", "SRU" and "SRW/U" are also possible values. "SRW/U" means that both versions of the web service are available at the same URL endpoint. The version attribute may be used to further specify the highest version of the protocol supported. The transport attribute allows the file to record the protocol used for transporting SRW or SRU messages. Normally this will be http, the default, but may be changed to allow https or different protocols. Finally, the wsdl element can be included to specify the URL to a wsdl file which describes the server.

The host element contains the address of the server which hosts the record. This address should be in a name which will resolve to the correct IP address. Only if the server does not have a symbolic name should the numeric IP address be given.

The port on the server to connect to should be given in the port element, in numeric form.

The database element should contain the name of the Z39.50 database which the record describes. For other protocols such as SRW and SRU, this element may be used to contain the remainder of the URL to the server without any preceeding '/' character.
The database element may also have two attributes describing the number of records in the database, numRecs, and the time that the database was last updated in lastUpdate. As with all dates in the ZeeRex record, this should be given in the ISO 8601 format (YYYY-MM-DD hh-mm-ss).

The last element in serverInfo is the optional authentication element. If this element is present, but empty, then it implies that authentication is required to connect to the server, however there is no publically available login. If it contains a string, then this is the token to give in order to authenticate. Otherwise it may contain three elements:

user: The username to supply.
group: The group to supply.
password: The password to supply.

An example serverInfo section might be:

<serverInfo protocol="Z39.50">
  <host>gondolin.hist.liv.ac.uk</host>
  <port>210<210</port>
  <database>IR-Explain---1</database>
  <authentication>
    <user>azaroth</user>
    <password>squirrelfish</password>
  </authentication>
</serverInfo>

3. databaseInfo

The databaseInfo section contains the full text descriptions of various aspects of the database. All of the elements in this section are repeatable and may have the following attributes:

lang: This is the language of the text within the element. This should be set to allow clients to present the appropriate text to the user based on language preferences. This should be recorded using the two letter code for the language, as defined in RFC 1766.
primary: This attribute, if set to true, denotes that this should be considered the default text to provide if no particular language is requested. For example, the record author might set this on the English version of a text which is also represented in the record in Spanish and German.

The title element in this context represents the title of the database.

description should contain a description of why this database might be of interest. Anything which does not fit under the other fields in may also be put into this element. Along with title, these two fields are permissable in the 'B' element set used for Friends and Neighbours and if present must appear in this order. The remaining fields on the other hand are not part of the 'B' element set and may appear in any order.

The author element should contain the name of the person or organisation to be credited with the creation of the database. On the other hand, the contact element is used to record information on a contact person for the database. This should include at least a name and some form or address, either electronic or postal.

extent is used to describe the completeness of the database, or the range of material that is included in it. For example a database which contained all the emails sent to a mailing list would be considered complete. If this database only maintained a smaller subset of the emails, then it should be noted in this element.

Any information which is considered useful regarding the history of the database may be recorded in the history element. This might include the sponsors for its creation, or significant moments in its history.

The langUsage element is used to record the languages used in the database records (as opposed to the ZeeRex record). If it is wished that this be searchable, then the codes attribute should contain the two-letter language codes separated by spaces.

If there are any restrictions on the usage or availability of the database or its contents then these should be recorded in the restrictions element. For example it might contain information regarding the copyright status of the records, or an indication that the database is only available between certain hours.

If the database concerns particular subjects from a controlled vocabulary then these may be recorded using subject elements within the subjects wrapper. These subjects might be drawn from the Library of Congress Subject Headings or another appropriate thesaurus.

The implementation element contains information concerning the underlying software. It has version and identifier attributes which may be used to identify particular releases. It may contain one or more title elements containing a human readable title to describe the server.

Finally, one or more links to other resources can be recorded, each in a link element within a links wrapper. The type of link is given in the type attribute, for which several values have been defined though more would be welcomed.

The list of types below may be added to at any point, it is not limited to updates only at new versions. Types will not be removed apart from at version boundaries. The same applies to the configInfo types further on.

Type	Description
www	URL to a native web interface to the database
z39.50	URL to a z39.50 interface
srw	URL to an SRW interface
sru	URL to an SRU interface
oai	URL to an OAI interface
rss	URL to an RSS news feed for the database
icon	URL to a graphical icon for the server

An example databaseInfo section might look something like:


<databaseInfo>
  <title lang="en" primary="true">The Science Fiction Foundation Collection</title>
  <description lang="en" primary="true">
     A database containing bibliographic records describing the books
     and articles in the Science Fiction Foundation's collection held
     at the University of Liverpool.
  </description>
  <author> Andy Sawyer </author>
  <contact> Rob Sanderson, azaroth@liv.ac.uk</contact>
  <langUsage codes="en fr ru">
     The records are in English, French and Russian.
  </langUsage>
  ...
</databaseInfo>

4. metaInfo

This section is quite short, containing a maximum of three elements. These elements describe the essential pieces of information concerning the ZeeRex record itself, as opposed to the database.

The dateModified element contains the date at which the record was created or last modified. This should be updated every time the record is changed by the owner. (An aggregator changing the authoritative attribute does not constitute a change which should be recorded in this element.)

The aggregatedFrom and dateAggregated elements should be present if the record has been harvested from another source. The latter, as the name implies, should contain the date on which the aggregation last took place. The aggregatedFrom element should contain enough information for a third party to retrieve the original, authoritative record. The contents should be in the form of a URL, using the z39.50r specification for Z39.50 servers or the appropriate form for other protocols.
For information about z39.50r URLs, see RFC 2056 (Denenberg, Kunze and Lynch. RFC 2056: Uniform Resource Locators for Z39.50. November 1996, available at http://lcweb.loc.gov/z3950/agency/defns/rfc2056.html

An example:

<metaInfo>
  <dateModified>2002-03-29 19:00:00</dateModified>
  <aggregatedFrom> z39.50r://gondolin.hist.liv.ac.uk:210/IR-Explain---1?id=ghlau-1;esn=F;rs=XML </aggregatedFrom>
  <dateAggregated>2002-03-30 06:30:00</dateAggregated>
</metaInfo>

5. indexInfo

This element is where the features of the database are recorded. In order not to repeat the confusion of 'term lists', interacting with a database is done via 'indexes' which are represented by attribute combinations for Z39.50. The indexInfo element may contain three different elements, set, index and sortKeyword, repeated as many times as needed.

5.1. index

An index represents a single type of search, scan or non-keyword sort that can be performed. This allows for different searches to be given specific titles, using the title element, and for multiple maps to be assigned to a single type of search if there is more than one way to do the same search.

The element has 4 attributes, the first three being search, scan and sort. These are true/false flags and record whether this particular request is allowed on the index described. If the flag is not present, then the implication is that the creator of the record did not know whether the function was available or not, such as might be the case for a remotely discovered server. The last element, id, is similar in function to the same element on the top level tag, but applies to the index.

Within the index element may appear one or more titles, which should be used when presenting the index as an option to a user. The protocol level information occurs within one or more map elements. If more than one is given, then these are to be considered alternative ways of accessing exactly the same information. For example, in Z39.50 one might wish to have both BIB1 and BIB2 attribute combinations available for clients which support one or the other. The map element may also have the primary attribute, with the same semantics -- one of the mappings should be used unless the client has a particular reason not to.

Each index can also have its own configInfo section, as described below. In this case, the information applies only to this index. This would be used, for example, to say that a particular index supports something which the rest did not.

It is worth describing the rest of the section separately for both Z39.50 and SRW.

5.2. indexInfo for Z39.50

One or more attr elements may occur within each map.
The attr element is used to record a single attribute. The type of it is given in the numeric form in the type attribute of the element, and the value is given as its contents. If the attribute set is not BIB1, then it should be given in the set attribute. An example or two is in order.

This would represent BIB1's USE attribute 1003, eg Personal Author.

<map> 
  <attr type="1">1003</attr>
</map>

This might then be represented in BIB2 as:

<map> 
  <attr type="1" set="1.2.840.10003.3.12">3</attr>
  <attr type="2" set="1.2.840.10003.3.18">3</attr>
  <attr type="12" set="1.2.840.10003.3.18">aut</attr>
</map>

A single index might contain both of these mappings to specify that either may be used.

indexInfo may also contain the set element. This may contain zero or more title elements, and has two required attributes, name and identifier. For Z39.50, this can be used to declare a short name for an attribute set to be then later used in the set attributes of attr as described above rather than the full OID every time. For example:

<set name="xd" identifier="1.2.840.10003.3.12"/>
<set name="bib2" identifier="1.2.840.10003.3.18"/>
...
<map> 
  <attr type="1" set="xd">3</attr>
  <attr type="2" set="bib2">3</attr>
  <attr type="12" set="bib2">aut</attr>
</map>

One may also record that the server will accept specific keywords on which to sort result sets. This is the `sort-field-designator' part of a Sort Request, as opposed to a set of attributes which may be described in an index with the sort attribute set to true. Each sortKeyword element should contain a single keyword which the server will accept to sort upon.

<indexInfo>
  <set name="xd" identifier="1.2.840.10003.3.12"/>
  <set name="bib2" identifier="1.2.840.10003.3.18"/>
  <index search="true" scan="true" sort="true" id="ghlau-mail-1">
    <title lang="en" primary="true">Author (keyword)</title>
    <map primary="true">
      <attr type="1">1003</attr>
    </map>
    <map> 
      <attr type="1" set="xd">3</attr>
      <attr type="2" set="bib2">3</attr>
      <attr type="12" set="bib2">aut</attr>
    </map>
  </index>
  <sortKeyword> private </sortKeyword>        
</indexInfo>

5.3. indexInfo for SRW

SRW has two essential components for searching, context sets and indexes. Each searchable access point has one or more names associated with a set. As more than one name can be associated with the same set, the sets are declared at the beginning of the section.

The set element may contain zero or more title elements, and declares the short form used in CQL in the name attribute and the identifying URI in the identifier attribute.

The index tag is much the same as its use for Z39.50, except that the sort attribute is not used as sorting in SRW is done via XPath. Each name and set combination is listed in a name within a map element. The name element has a set attribute, which contains the short name of the context set. Each index can have more than one mapping, if there are multiple ways to reach it.

Like Z39.50, an index may have one or more titles, distinguished by language.


<indexInfo>
  <set identifier="http://www.loc.gov/zing/cql/dc-indexes/v1.0/" name="dc"/>
  <set identifier="http://www.loc.gov/zing/cql/bath-indexes/v1.0/" name="bath"/>

  <index>
    <title lang="en">Book Title</title>
    <map><name set="dc">title</name></map>
    <map><name set="bath">title</name></map>
  </index>
</indexInfo>

6. recordInfo

This section concerns how records may be retrieved from a Z39.50 database. It consists of a list of one or more recordSyntax elements, which may contain any number of elementSet elements.

Each recordSyntax has an identifier attribute which should contain the OID for the record syntax as defined at http://lcweb.loc.gov/z3950/agency/defns/oids.html#5.

Inside a recordSyntax element may appear any number of elementSets, each representing a particular element set that the server supports for that record syntax. The name to use for the elementset is recorded in the name attribute.

Titles may be given to element sets by using the title element within the elementSet element. As for all text intended for users, this may be repeatable and has the lang and primary attributes.

An example recordInfo section:

<recordInfo>
  <recordSyntax identifier="1.2.840.10003.5.109.10">
    <elementSet name="F">
      <title lang="en" primary="true">Full XML Record</title>
    </elementSet>
  </recordSyntax>
</recordInfo>

7. schemaInfo

This element records the XML schemas in use by an SRW server, both for sorting and retrieval.

Each schema is recorded in a schema tag. It has several attributes which record how it can be used and the identifying information about the schema. If the sort attribute is true, then it can be used for sorting. Likewise, the retrieve attribute governs if it can be used for retreival. In the same way as indexSet it also has an identifier attribute which contains an identifying URI. The final attribute is location which is a URL to a copy of the schema itself, for validation and information purposes.

Inside the schema element may be any number of language differentiated title tags.


<schemaInfo>
  <schema identifier="http://www.loc.gov/zing/srw/dcschema/v1.0/"
    location="http://www.loc.gov/zing/srw/dc.xsd" 
    sort="false" retrieve="true">
    <title lang="en">Dublin Core</title>
  </schema>
</schemaInfo>

8. configInfo

This section contains configuration information about how the server is set up. It has three possible tags within it, each being repeatable as many times as required. Each has a type attribute to say what sort of configuration option it is.

The default tag contains a default value set in the server that may be overriden by a specific request. For example in SRW there are a lot of default values, such as index, context set for indexes and record schemas.

If the configuration option is not something that can be changed, then the setting tag is used. Another example from SRW or Z39.50 is the maximum number of records that can be retrieved at once.

Finally if the information is just that the server supports a particular feature of the protocol, then the supports element is used. Some examples of this are proximity, sort requests, or record element selection.

Type	Description
numberOfRecords	The default number of records that a server will return at once
contextSet	The default context set
index	The default index
relation	The default relation
sortSchema	The default schema used in sorting, in short name form
retrieveSchema	The default schema used for retrieved records
resultSetTTL	Default number of seconds that a result set will be maintained for
stylesheet	Default stylesheet URL, or if stylesheets are supported
recordPacking	Default record packing returned (string or xml)
numberOfTerms	Default number of terms to be returned in scan
maximumRecords	The maximum number of records that a server will return at once
proximity	Does the server support proximity (Empty)
resultSets	Does the server support result sets (Empty)
relation	A relation supported by the server or index
relationModifier	A relation modifier supported by the server or index
booleanModifier	A relation modifier supported by the server or index
sort	Does the server support sort
sortModifier	A supported parameter for sort (ascending, missingValue, caseSensitive)
maskingCharacter	A masking character supported (* or ?)
anchoring	Is anchoring supported? (^ character)
emptyTerm	Are empty terms supported (Empty)
proximityModifier	A proximity modifier supported by the server or index (relation, distance, unit, ordering)
extraSearchData	A type of extraRequestData available in the searchRetrieveRequest. The extra*Data fields are represented as two space separated words, the first the identifier for the extension and the second the individual element name from the extension. If there is only one word, then it is the extension id and all elements from within are supported.
extraScanData	A type of extraRequestData available in the scanRequest
extraExplainData	A type of extraRequestData available in the explainRequest
profile	The URI identifier of a supported profile
recordXPath	Is XPath retrieval supported?
scan	Is the Scan operation supported?
version	An older version of the protocol supported, as opposed to the one listed in serverInfo/@version


<configInfo>
  <default type="numberOfRecords">1</default>
  <setting type="maximumRecords">10</setting>
  <supports type="proximity"/>
  <supports type="relationModifier">stem</supports>
</configInfo>

Feedback to <mike@indexdata.com> is welcome!