see also: Millennium Character Map Code Tables
Best Practices
- Make diacritic and special character edits in OCLC whenever possible.
- When forced to use Millennium Character Maps, be sure to use only those character maps which have been identified as containing characters within the MARC21 Unicode character repertoire values.
- Remember that Millennium Character Maps that have characters within the MARC21 Unicode character repertoire range, also contain characters which are not acceptable for use.
- Use the companion document, Character Map Code Table, in order to ascertain those characters which may be used.
- Unicode values may be entered directly into the MARC record, rather than using Millennium Character Maps.
- Unicode values are entered prefaced by “u”, and then enclosed in curly brackets. For example: {u0101}
- Whether one is using the Millennium Character Map or entering Unicode values directly, the value must directly follow the character to modify.
- Diacritic display should be checked in public mode rather than edit mode, as edit mode fonts don’t support all diacritic modifications correctly.
- View>Show Codes can be used to confirm diacritic coding and placement, or to correct garbled values.
Background
MARC records in Millennium are stored as Unicode. Although Millennium allows for the input and display of Unicode values outside of the MARC21 Unicode character repertoire (or OCLC ALA set), only those values inside the MARC21 Unicode character repertoire should be used. Using characters outside of this repertoire may be problematic for other utilities in translating and displaying these values.
Library of Congress
The original restriction of MARC 21 Unicode character repertoire to the MARC-8 repertoire is no longer practicable because of the increased availability of Unicode-encoded data sources that are not bound by such a limitation. Through a variety of techniques, only the most common being copy-and-paste, non-MARC-8 characters can and do get introduced into MARC 21 records. Frequently these characters will escape detection when a record is created, or even when used locally, but they may impede the effectiveness of the data interchange that is the primary purpose of MARC 21. Characters such as single quotation marks and apostrophe, compressed to a single character in ASCII because of space limitations, are among the most common to be encountered accidentally. Data in European languages are likely to contain precomposed Latin characters. Users of CJK data may discover characters from the Halfwidth and Fullwidth Forms block (FF00 to FFEF (hex)).
It is infeasible to identify a particular collection of Unicode characters to be prohibited from MARC 21 records. But creators of MARC 21 records should take into account the capabilities of their likely exchange partners as they choose to expand their working repertoire. For limited distributions, agreements among exchange partners can support aggressive repertoire expansion. For general distribution, a more conservative approach is warranted. Such an approach would minimize or avoid entirely the use of certain types of characters.
The Library of Congress does not prohibit using any Unicode character, but they caution libraries that the use of characters outside of the traditional repertoire can create problems with data exchange.
Migration
GLADIS stored data as MARC-8. When we migrated to Millennium, GLADIS data was translated and loaded as Unicode. In preparation for future cataloging in OCLC, we set up OCLC Connexion parameters so that records cataloged and downloaded via OCLC’s interactive export process, would also load as Unicode (UTF8).
Unfortunately, during the migration process, some characters did not translate as they should have. The reasons for this are:
- Innovative’s translation tables were not complete as of the time our records from GLADIS were loaded. Although most diacritics loaded correctly, some extended Latin values will not have been translated correctly and will need to be corrected as encountered – there is no automated fix for this problem.
- Not all data exported from GLADIS went through the translation process at UCB’s end. Only bibliographic data went through the translation process; any GLADIS holdings data, including call numbers and summary statements, that contained diacritics will not have translated correctly. These will need to be corrected, but there may be automated fixes depending upon the problem.
- Many letters and modifying diacritics were translated by Innovative to the precomposed values (letter + diacritic) rather than the separate values (letter, diacritic). For example: {u0161} (Latin small letter s with caron) rather than {u0073}{u02C7}. This does not cause a display or indexing issue, though it remains to be seen if this will be problematic as it relates to data exchange. In the event that this does cause problems, we will need to fix these with an automated process.
Staff may clean up problems as they are encountered, or report badly translated data to the OskiCat Helpdesk for resolution if the problem appears to be global.
Millennium
It is strongly recommended that staff should input or make corrections to diacritics in OCLC, save, then export to Millennium. This is preferable to doing work in Millennium for several reasons:
- The Millennium character maps are vast, complex, and difficult to read. Unless one is well-versed in the use of hexadecimal values and their diacritic equivalents, it would be ill-advised to use this method for diacritic correction.
- OCLC only allows for the input of the MARC21 Unicode character repertoire (or OCLC ALA set), thus invalid diacritic values cannot be entered in error.
- Since OCLC master records are used for display purposes in Next Generation Melvyl (WorldCat Local, it is very important that errors in master records are corrected, not just in our local Millennium record.
That said, there will still be situations where one must correct data in Millennium, for example, where a diacritic did not translate properly from a GLADIS SUM field.
Millennium Character Maps
Millennium character maps can be found in the cataloging module under Tools>Character Map. From here, there is a pull-down menu with a list of all character maps.
Millennium character maps containing MARC21 Unicode character repertoire values include:
- Basic Latin
- Combining Diacritical Marks
- Combining Half Marks
- Currency Symbols (for Euro sign only)
- Latin 1 Supplement
- Extended Latin-A
- Extended Latin-B
- Letterlike Symbols (for script lower case L and musical copyright only)
- Miscellaneous Symbols (for musical flat and musical sharp only)
- Spacing Modifier Letters (for prime, double prime, ayn, and alif only)
- Superscripts and Subscripts
It is extremely important to note that not all values in these maps should be used! The Millennium Character Map Code Table, must be consulted in order to ascertain which ones may be used.
All other character maps and character values should never be used, as their use may severely compromise our ability to exchange data with other utilities.
Each character map, or code chart, gives the hexadecimal value for each diacritic or special character. The left hand column should be read first, followed by the top row. For example, 011 + 1 = Unicode hexadecimal value 0111.
The diacritic value should always be input following the character it is to modify.
[NOTE: Be sure to review any diacritic in the public display, since not all Millennium editor fonts will correctly combine with modifying diacritics for an accurate view.]
[NOTE: See Character Map Code Table for details on which diacritics and special characters may be found in which tables, and what their hexadecimal equivalents are.]
Unicode “Curly Bracket” Values
An alternative to using the character map is to enter the diacritic directly as its hexadecimal equivalent. This is done as:
{u<code>}
The code should be prefaced by “u” to indicate that it is a Unicode value, and then enclosed in curly brackets. Again, the hexadecimal value should be entered after the value to modify. For example:
franc{u0327}ais —displays as→ français
Records with diacritics may be viewed with their Unicode values (diacritics and special characters only) in the cataloging module under View>Show Codes. This option also allows one to review diacritic problems or garbled displays.
Legacy Data
Author: Jim Lake
Approval Group: Cataloging and Metadata Council
Update Group: Cataloging & Metadata Council
Last updated date: 03/18/12
Archived Comments
Changed reference to an
Tue, 06/28/2011 – 15:38 — Charis Takaro
Changed reference to an appendix (no longer treated as such) to the separate table: http://www.lib.berkeley.edu/asktico/data-definition/millennium-character-map-code-tables