In Part 1 of “Where the Problems Lie” I focused on some issues that I see with the set of technologies and standards that I have lumped, for simplicity’s sake, under the heading “MARC”. In this post I am passing along issues that my OCLC colleague Jean Godby ran into with her work to crosswalk different bibliographic metadata formats (e.g., MARC21, ONIX, Dublin Core) from one to another.
As you might imagine, doing this well requires both intimate knowledge of the data being captured in the various standards and a very detailed and painstaking process of determining where those elements need to end up — and how. Therefore, the issues below are often quite specific and backed up with evidence.
- Some critical information is represented redundantly. Example: A description of an e-book is spread across the 245, 008, and 300 fields, but is concisely represented in a more modern standard such as ONIX.
- Some MARC fields are ambiguous. Example: The MARC 300 field has an ‘extent’ sense when it appears in a record that describes a sound recording. But it has a ‘page count’ sense in a record that describes a printed book. The Crosswalk has to make an unreliable check for data in a free-text field to disambiguate the two senses. When distinctions exist in 5xx fields, they may not be recoverable at all.
- Many MARC free-text fields have formatting requirements. Example: ISBD formatting rules for titles require that only the first word be capitalized. Since this convention is not widely used outside the library community, it must be taken out when a MARC record is translated to a non-MARC standard.
- Punctuation in free-text fields is sometimes meaningful, sometimes not. Example: In a 100 field, the comma separates parts of a structured name and indicates an inverted presentation order. In a 500 field, the comma is just another character in a stream of text.
- Some MARC fields are coded with hidden assumptions. Example: A MARC record that describes the author of a printed book has no explicit mention of the author’s role or the physical format of the work. But a MARC record that describes a musical score identifies the material type in a code in the 008 field and the contributor’s role in a $4 field.
- MARC data elements are semantically complex and built up from many components. In most contemporary non-MARC standards, the elements are semantically simpler and resemble the words in a dictionary that make up a bibliographic description, such as title, contributor, personal name, and subject. Example: Marc 245 $a is not a title. It is an AACR2-defined access point that may contain a concise bibliographic description with title, author, author’s role, physical description, producing another one-to-many mapping.
- MARC has a “long tail.” In other words, the standard is large, but MARC tag usage studies (1, 2) show that most of the specification is rarely used. In fact, many of the most widely used fields are the 5xx notes, despite many opportunities for representing a description in more explicitly coded data. This can be interpreted as evidence that MARC is no longer the best fit for bibliographic data, perhaps because new concepts are not represented and some existing concepts are not modeled effectively.