THE EMILLE PROJECT: encoding

A key technical issue unearthed by MILLE relates to the representation of South Asian writing systems in fonts mapped to a QWERTY key-board. This is by far the most common way in which electronic texts of South Asian languages are produced. For example, Punjabi can be represented using two dominant - yet mutually unintelligible - 8-bit fonts (the Gurbani Lippi and Anandapur fonts) and Singhalese is encoded by more than three competing and mutually unintelligible fonts (see Jayalal, 1997, Mohanrajah, 1998 & Prathepan, 1998). This is a significant impediment to the goal of creating an infrastructure to enable minority language engineering - we need a stable platform on which corpora can be developed and exploited by language engineers. Thus, the project had a pressing need to develop software to map these many font-based representations of the writing systems used by South Asian languages to a common standard: in the case of EMILLE, Unicode.

Mapping these fonts to Unicode is no trivial matter. By way of example, consider the case of conjuncts in Hindi (where two or more characters combine to form one character). Unicode compliant software, such as editors, often has character representation or mapping limitations. For example, UniEdit 1.4 does not implement Unicode's conjunct rendering rules at all, while Global Writer 98 does use the rendering rules, but requires the elements of the conjunct to appear in a specific left-right order. This order can conflict with the order that an 8 bit font solution uses to encode conjunct characters. Simply mapping one eight bit character to a sixteen bit Unicode (or a 32 bit ISO-10646) character can result in conjuncts being deleted from converted texts. A solution which deals elegantly with such issues as conjunct characters was therefore necessary.

The solution implemented in the Unicodify software suite, developed at Lancaster (see Hardie, Forthcoming, Baker et al, 2004), is to develop for each part a mapping algorithm. That is, for each character in the font, a mapping rule is written that captures all the variant Unicode correspondences of that character by checking the local context to decide upon the appropriate Unicode output for that instance of the character. In a single pass, then, the mapping algorithm transforms font-based encodings to well-formed, standardised Unicode. Using this approach we were able to rapidly harmonise the data we collected as Unicode text.

While encoding proved a challenge, the structured markup of the corpus texts was adapted from the Corpus Encoding Standard's cesDoc document type. While this was implemented as SGML, we took care to minimise the occurrence of features that would be illegal in XML, so that the corpus markup is to a substantial degree compatible with the new generation of the Unicode-aware XML-compliant software.