In Part 1 of “Where the Problems Lie” I focused on some issues that I see with the set of technologies and standards that I have lumped, for simplicity’s sake, under the heading “MARC”. In this post I am passing along issues that my OCLC colleague Jean Godby ran into with her work to crosswalk different bibliographic metadata formats (e.g., MARC21, ONIX, Dublin Core) from one to another.
As you might imagine, doing this well requires both intimate knowledge of the data being captured in the various standards and a very detailed and painstaking process of determining where those elements need to end up — and how. Therefore, the issues below are often quite specific and backed up with evidence.
- Some critical information is represented redundantly. Example: A description of an e-book is spread across the 245, 008, and 300 fields, but is concisely represented in a more modern standard such as ONIX.
- Some MARC fields are ambiguous. Example: The MARC 300 field has an ‘extent’ sense when it appears in a record that describes a sound recording. But it has a ‘page count’ sense in a record that describes a printed book. The Crosswalk has to make an unreliable check for data in a free-text field to disambiguate the two senses. When distinctions exist in 5xx fields, they may not be recoverable at all.
- Many MARC free-text fields have formatting requirements. Example: ISBD formatting rules for titles require that only the first word be capitalized. Since this convention is not widely used outside the library community, it must be taken out when a MARC record is translated to a non-MARC standard.
- Punctuation in free-text fields is sometimes meaningful, sometimes not. Example: In a 100 field, the comma separates parts of a structured name and indicates an inverted presentation order. In a 500 field, the comma is just another character in a stream of text.
- Some MARC fields are coded with hidden assumptions. Example: A MARC record that describes the author of a printed book has no explicit mention of the author’s role or the physical format of the work. But a MARC record that describes a musical score identifies the material type in a code in the 008 field and the contributor’s role in a $4 field.
- MARC data elements are semantically complex and built up from many components. In most contemporary non-MARC standards, the elements are semantically simpler and resemble the words in a dictionary that make up a bibliographic description, such as title, contributor, personal name, and subject. Example: Marc 245 $a is not a title. It is an AACR2-defined access point that may contain a concise bibliographic description with title, author, author’s role, physical description, producing another one-to-many mapping.
- MARC has a “long tail.” In other words, the standard is large, but MARC tag usage studies (1, 2) show that most of the specification is rarely used. In fact, many of the most widely used fields are the 5xx notes, despite many opportunities for representing a description in more explicitly coded data. This can be interpreted as evidence that MARC is no longer the best fit for bibliographic data, perhaps because new concepts are not represented and some existing concepts are not modeled effectively.
When mixing and matching data records it seems that we simply have to many choices to find the ideal way to describe, record and recall records.
When we have to many choices, say choosing one dip of ice cream on a cone from (50) kinds of ice cream can leave us less satisfied as opposed to when we only have say (5) kinds to chose from and are more likely to land on an ideal or just good enough flavor (or descriptor framework).
When we have to many data choices from the start, the costs of having so many options can become daunting and we miss the simply ‘good enough’ for a ‘nearly perfect’, but not always repeatable description and standard.
Few people, even in the tech world programming write excellent and perfect queries the first time around. Query creation really is an art. The same thing is true for record descriptions as we all in many ways are artists of the craft of query and record creation.
I didn’t understand this topic:
“Some MARC fields are coded with hidden assumptions. Example: A MARC record that describes the author of a printed book has no explicit mention of the author’s role or the physical format of the work. But a MARC record that describes a musical score identifies the material type in a code in the 008 field and the contributor’s role in a $4 field.
What are the “hidden assumptions”? Could you give another example for this case?
Thanks!
The hidden assumptions are that the person named has the role of “author” and the item being described has a physical format of “book”. Those things are not explicitly coded, so therefore they are hidden assumptions.