Encoding

Encoding
A key technical issue unearthed by MILLE relates to the representation of South Asian writing systems in fonts mapped to a QWERTY key-board. This is by far the most common way in which electronic texts of South Asian languages are produced. For example, Punjabi can be represented using two dominant - yet mutually unintelligible - 8-bit fonts (the Gurbani Lippi and Anandapur fonts) and Singhalese is encoded by more than three competing and mutually unintelligible fonts (see Jayalal, 1997, Mohanrajah, 1998 & Prathepan, 1998). This is a significant impediment to the goal of creating an infrastructure to enable minority language engineering - we need a stable platform on which corpora can be developed and exploited by language engineers. Thus, the project had a pressing need to develop software to map these many font-based representations of the writing systems used by South Asian languages to a common standard: in the case of EMILLE, Unicode.

Mapping these fonts to Unicode is no trivial matter. By way of example, consider the case of conjuncts in Hindi (where two or more characters combine to form one character). Unicode compliant software, such as editors, often has character representation or mapping limitations. For example, UniEdit 1.4 does not implement Unicode's conjunct rendering rules at all, while Global Writer 98 does use the rendering rules, but requires the elements of the conjunct to appear in a specific left-right order. This order can conflict with the order that an 8 bit font solution uses to encode conjunct characters. Simply mapping one eight bit character to a sixteen bit Unicode (or a 32 bit ISO-10646) character can result in conjuncts being deleted from converted texts. A solution which deals elegantly with such issues as conjunct characters was therefore necessary.

The solution implemented in the Unicodify software suite, developed at Lancaster (see Hardie, Forthcoming, Baker et al, 2004), is to develop for each part a mapping algorithm. That is, for each character in the font, a mapping rule is written that captures all the variant Unicode correspondences of that character by checking the local context to decide upon the appropriate Unicode output for that instance of the character. In a single pass, then, the mapping algorithm transforms font-based encodings to well-formed, standardised Unicode. Using this approach we were able to rapidly harmonise the data we collected as Unicode text.

While encoding proved a challenge, the structured markup of the corpus texts was adapted from the Corpus Encoding Standard's cesDoc document type. While this was implemented as SGML, we took care to minimise the occurrence of features that would be illegal in XML, so that the corpus markup is to a substantial degree compatible with the new generation of the Unicode-aware XML-compliant software.

 

References

Baker, P, Hardie, A, McEnery, A, Xiao, R, Bontcheva, K, Cunningham, H, Gaizauskas, R, Hamza, O, Maynard, D, Tablan, V, Ursu, C, Jayaram, BD and Leisher, M (2004) Corpus linguistics and South Asian languages: corpus creation and tool development. In: Literary and Linguistic Computing 19(4): 509-524.

Hardie, A (forthcoming) From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.

Hardie, A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Archer, D, Rayson, P, Wilson, A, and McEnery, T (eds.) (2003) Proceedings of the Corpus Linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.


Hardie, A (2004) The computational analysis of morphosyntactic categories in Urdu. PhD thesis, University of Lancaster.


Hardie, A (2005) Automated part-of-speech analysis of Urdu: conceptual and technical issues. In: Yadava, Y, Bhattarai, G, Lohani, RR, Prasain, B and Parajuli, K (eds.) Contemporary issues in Nepalese linguistics. Kathmandu: Linguistic Society of Nepal.

Jayalal, D.A. (1997) Towards a Grammar Checker for Singhala, Unpublished MSc. Dissertation, Dept. Computer Science, University of Colombo.

Mohanrajah, S. (1998) Language Assistant – Towards Automatically Translating Text from English to Tamil in a Controlled Environment, Unpublished MSc. Dissertation, Dept. Computer Science, University of Colombo.

Pratheepan, B. (1998) Document Conversion Using Character Code Mapping, Unpublished BSc. Dissertation, Dept. Computer Science, University of Colombo.

Sperberg-McQueen, M. & Burnard, L. (1994) Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative, Chicago.

Also of interest
Guidelines for encoding spoken data.

Home | About | Who We Are | Languages | Encoding | Sample Data | Links | Contact Us