Goal 1 - extend a Language Engineering architecture
The project established an LE architecture within which minority LE can take place. To be truly generic platforms, LE architectures cannot be limited to specific languages/writing systems; the EPSRC workshop on LE architectures (“A Workshop on Language Processing Architectures and the Use and Distribution of Language Resources”, EPSRC ref. GR/M44545) led to the conclusion that LE architectures need to expand beyond their current focus on European languages. To this end, EMILLE extended GATE (the General Architecture for Text Engineering) to be fully UNICODE compliant so that it may act as a framework for exploiting the EMILLE corpora. By doing so we established a framework within which language processing tools for non-South Asian languages can be recycled in the context of South Asian language processing (see goal 3 below).
GATE was extended at Sheffield to meet the needs of EMILLE. GATE was first released in 1996 and has since had a wide take-up in language processing laboratories around the world (Cunningham, Gaizauskas, Humphreys, and Wilks, 1999). The system is a domain-specific software architecture/development environment that supports researchers in natural language processing and computational linguistics/developers who are producing and delivering LE systems. It has been used for a wide variety of applications including information extraction (Gaizauskas and Wilks, 1998) and sense tagging (Cunningham, Stevenson, and Wilks, 1998). A new version of the system has been developed which extends the established principals of version 1 to support research into language resources (Cunningham, Peters, McCauley, Bontcheva, and Wilks, 1998). This version takes advantage of new developments on the Internet, and is a distributed, Java-based system. This version of GATE was further extended within the EMILLE project.
Corpus validation tools were incorporated within GATE, and basic tools developed to allow for the rapid development of corpus headers and mark-up. With corpus building and validation tools in place, GATE is an architecture within which TEI conformant corpus texts can be developed and validated.
See also the GATE website.
|Goal 2 - develop corpora|
EMILLE generated large written corpora words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. These were the South Asian languages indicated as being those most wanted by the LE community in the Baker & McEnery (1999) survey. Other languages were added to this set as the project developed. For those languages with a UK community large enough to sustain spoken corpus collection (Bengali, Gujarati, Hindi, Panjabi and Urdu) EMILLE also produced spoken corpora.
For the monolingual written corpora, we did not attempt to balance the genres and data types as has been done for corpora such as the BNC. Our model of corpus building had to be opportunistic to maximise the amount of electronic data available to us. The monolingual texts were gathered both from the Indian subcontinent and from South Asian language communities worldwide. Contacts established on the MILLE project, such as Lake House printers in Sri Lanka, The Dept. of Health, the Sikh Parliament in Birmingham and community newspapers in the UK were further developed while gathering the data. Additionally over the course of the project we successfully established further contacts in order to widen the range of data suppliers contributing to the corpora. Most significantly, the EMILLE partners entered into an agreement with the Central Institute of Indian Languages, Mysore, to incorporate the CIIL's pre-existing text corpora into the EMILLE/CIIL Monolingual Written Corpora. This involved standardising text encoding and markup between the data collections, and situating the texts in an intergrated corpus structure. Combining our data collections in this way has had two major benefits. Firstly, the CIIL's data covers a variety of print genres, including scientific writing, fiction, and educational texts. Conversely, the EMILLE data contained a great deal of news data, harvested particularly from the World Wide Web, and relatively minor amounts of other text types. Thus, the genre spread of the final joint corpora is significantly better than could have been achieved without this collaboration. Secondly, this collaboration allowed us to look at a wider range of languages than originally planned, in addition to Bengali, Gujarati, Hindi, Punjabi, Singhalese, Tamil and Urdu, the joint corpora includes data in Assamese, Kannada, Malayalem, Marathi, Oriya and Telugu, bringing the total number of languages to fourteen.
In terms of the encoding of the corpora, we limited ourselves to those header items and text elements viewed as essential in the Baker et al (1998) review of the corpus encoding needs of language engineers, with the exception that we also encoded country of origin for each text in the corpus.
While the transcriptions in the spoken corpus have full structured markup, we did not work on the time-alignment of the sound files with the transcription during this project, as automated time aligners for South Asian languages are not available. However, we anticipate that the corpora we have created will be of great utility to any research centre that may in the future wish to work on this area.
The spoken texts have been transcribed in the native scripts of the languages; while Romanisation has previously been common in transcribed speech in the South Asian languages, we were able to avoid this entirely. Throughout, regular checking of the standard of transcriptions produced was undertaken by analysts employed to carry out random checks on transcriptions. Quality assurance at this level was cyclical.
The metadata gathered to accompany each transcription was limited to age and gender (and in a minority of cases, occupation). These are objectively verifiable categories. Categories such as social class, which appear attractive, are subjective and unreliable. Also, we aimed to make the production of the corpus as cost effective as possible - we therefore limited the metadata we encoded.
|Goal 3 - develop basic LE tools|
The production of the corpus itself required the development of a tool capable of mapping the bewildering array of 8-bit encodings used for South Asian scripts into the standard Unicode character set. Since the correspondences involved were both complex and contextually conditioned, a powerful and intelligent piece of software was needed. The Unicodify tool was developed at Lancaster in 2002/2003 to serve this purpose (see Hardie, Forthcoming, Baker et al 2004). Additionally tools for analysing and exploiting the EMILLE corpora were also developed. For one language, namely Urdu, we developed an automated part-of-speech tagger (see Hardie 2003, 2004, 2005) which was subsequently used to tag the whole Urdu corpus. Finally, the project has developed existing alignment software to sentence align the parallel corpora within EMILLE.
Baker, J.P., Burnard, L., McEnery, A.M. & Wilson, A. (1998) ‘Techniques for the Evaluation of Language Corpora: a report from the front’, Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain.
Baker, P, Hardie, A, McEnery, A, Xiao, R, Bontcheva, K, Cunningham, H, Gaizauskas, R, Hamza, O, Maynard, D, Tablan, V, Ursu, C, Jayaram, BD and Leisher, M (2004) Corpus linguistics and South Asian languages: corpus creation and tool development. In: Literary and Linguistic Computing 19(4): 509-524.
Baker, P. & McEnery, A. (1998) Needs of language-engineering communities; corpus building and translation resources. MILLE working paper 7, Lancaster University.
Cunningham, H., Gaizauskas, R.G., Humphreys, K. and Wilks, Y. (1999) "Experience with a Language Engineering Architecture: Three Years of GATE", Proceedings of the AISB'99 Workshop on Reference Architectures and Data Standards for NLP, Edinburgh.
Cunningham, H., Peters, W., McCauley, C., Bontcheva, K. and Wilks, Y. (1998) "A Level Playing Field for Language Resource Evaluation ", Workshop on Distributing and Accessing Lexical Resources at Conference on Language Resources Evaluation, Granada, Spain.
Cunningham, H., Stevenson, M. and Wilks, Y. (1998) "Implementing a Sense Tagger within a General Architecture for Language Engineering", Proceedings of the Third Conference on New Methods in Language Engineering (NeMLaP-3), Sydney, Australia.
Gaizauskas, R. and Wilks, Y. (1998) " Information Extraction: Beyond Document Retrieval", Journal of Documentation, 1.
Hardie, A (forthcoming) From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.
Hardie, A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Archer, D, Rayson, P, Wilson, A, and McEnery, T (eds.) (2003) Proceedings of the Corpus Linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.
Hardie, A. (2004) The computational analysis of morphosyntactic categories in Urdu. PhD thesis, University of Lancaster.
Hardie, A. (2005) Automated part-of-speech analysis of Urdu: conceptual and technical issues. In: Yadava, Y, Bhattarai, G, Lohani, RR, Prasain, B and Parajuli, K (eds.) Contemporary issues in Nepalese linguistics. Kathmandu: Linguistic Society of Nepal.
Langlais, P., Simard, M., & Véronis, J. (1998) Methods and practical issues in evaluating alignment techniques. Proceedings of 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistic , Montréal, Canada.
Lie, M., Baker, P., McEnery, A. & Sebba, M. (1999) “Building a Corpus of Spoken Sylheti”, in N. Ostler (ed) The Proceedings of the 3rd Conference of the Foundation for Endangered Languages. Foundation for Endangered Languages, Bath.
Masica, C.P. (1991) The Indo-Aryan Languages, Cambridge University Press, Cambridge.
McEnery, A. (1999) Final Report on MILLEFT, Report to EPSRC, Lancaster University.