Goal 1 - extend a Language Engineering architecture
The project established an LE architecture within which minority LE can take place. To be truly generic platforms, LE architectures cannot be limited to specific languages/writing systems; the EPSRC workshop on LE architectures (“A Workshop on Language Processing Architectures and the Use and Distribution of Language Resources”, EPSRC ref. GR/M44545) led to the conclusion that LE architectures need to expand beyond their current focus on European languages. To this end, EMILLE extended GATE (the General Architecture for Text Engineering) to be fully UNICODE compliant so that it may act as a framework for exploiting the EMILLE corpora. By doing so we established a framework within which language processing tools for non-South Asian languages can be recycled in the context of South Asian language processing (see goal 3 below).

GATE was extended at Sheffield to meet the needs of EMILLE. GATE was first released in 1996 and has since had a wide take-up in language processing laboratories around the world (Cunningham, Gaizauskas, Humphreys, and Wilks, 1999). The system is a domain-specific software architecture/development environment that supports researchers in natural language processing and computational linguistics/developers who are producing and delivering LE systems. It has been used for a wide variety of applications including information extraction (Gaizauskas and Wilks, 1998) and sense tagging (Cunningham, Stevenson, and Wilks, 1998). A new version of the system has been developed which extends the established principals of version 1 to support research into language resources (Cunningham, Peters, McCauley, Bontcheva, and Wilks, 1998). This version takes advantage of new developments on the Internet, and is a distributed, Java-based system. This version of GATE was further extended within the EMILLE project.

Corpus validation tools were incorporated within GATE, and basic tools developed to allow for the rapid development of corpus headers and mark-up. With corpus building and validation tools in place, GATE is an architecture within which TEI conformant corpus texts can be developed and validated.

See also the GATE website.

Goal 2 - develop corpora

EMILLE generated large written corpora words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. These were the South Asian languages indicated as being those most wanted by the LE community in the Baker & McEnery (1999) survey. Other languages were added to this set as the project developed. For those languages with a UK community large enough to sustain spoken corpus collection (Bengali, Gujarati, Hindi, Panjabi and Urdu) EMILLE also produced spoken corpora.

Written Data
Alongside 94 million words of monolingual written data, the corpus contains 200,000 words of parallel text in English, Bengali, Gujarati, Hindi, Punjabi and Urdu. The remainder will be monolingual corpus data. We chose a figure of 200,000 words as a corpus of this size produced by the MULTEXT project proved an adequate basis for the largest comparative evaluation exercise for alignment tools yet undertaken (Langlais et al 1998). While other work relying on parallel corpus data may require larger corpora, we believe that parallel corpora of 200,000 words are clearly useful to those needing to exploit such corpora. The primary donor for the parallel corpus was the UK government.

For the monolingual written corpora, we did not attempt to balance the genres and data types as has been done for corpora such as the BNC. Our model of corpus building had to be opportunistic to maximise the amount of electronic data available to us. The monolingual texts were gathered both from the Indian subcontinent and from South Asian language communities worldwide. Contacts established on the MILLE project, such as Lake House printers in Sri Lanka, The Dept. of Health, the Sikh Parliament in Birmingham and community newspapers in the UK were further developed while gathering the data. Additionally over the course of the project we successfully established further contacts in order to widen the range of data suppliers contributing to the corpora. Most significantly, the EMILLE partners entered into an agreement with the Central Institute of Indian Languages, Mysore, to incorporate the CIIL's pre-existing text corpora into the EMILLE/CIIL Monolingual Written Corpora. This involved standardising text encoding and markup between the data collections, and situating the texts in an intergrated corpus structure. Combining our data collections in this way has had two major benefits. Firstly, the CIIL's data covers a variety of print genres, including scientific writing, fiction, and educational texts. Conversely, the EMILLE data contained a great deal of news data, harvested particularly from the World Wide Web, and relatively minor amounts of other text types. Thus, the genre spread of the final joint corpora is significantly better than could have been achieved without this collaboration. Secondly, this collaboration allowed us to look at a wider range of languages than originally planned, in addition to Bengali, Gujarati, Hindi, Punjabi, Singhalese, Tamil and Urdu, the joint corpora includes data in Assamese, Kannada, Malayalem, Marathi, Oriya and Telugu, bringing the total number of languages to fourteen.

In terms of the encoding of the corpora, we limited ourselves to those header items and text elements viewed as essential in the Baker et al (1998) review of the corpus encoding needs of language engineers, with the exception that we also encoded country of origin for each text in the corpus.

Spoken Data
The spoken corpus data was largely gathered from the BBC's domestic South Asian language radio broadcasts, primarily the BBC Asian Network. Ideally, a spoken corpus would consist at least in part, and perhaps mostly, of readings of spontaneous, naturally-occuring speech, sampled demographically. However, our early efforts to gather this kind of spoken data were generally fruitless, due to an extreme degree of difficulty in recruiting informants. The results of our pilot exercise in demographically-sampled data were clear: members of UK-based South Asian language communities are not comfortable allowing their speech to be recorded and transcribed. Radio broadcasts were a highly attractive alternative: as only public speech is involved, there are fewer confidentiality issues to negotiate; and since the shows in question often incorporate interviews and phone-ins as well as the DJ's voice-over, a portion of the speech is of the spontaneous, naturalistic type that would normally be captured by informants with personal recorders.

While the transcriptions in the spoken corpus have full structured markup, we did not work on the time-alignment of the sound files with the transcription during this project, as automated time aligners for South Asian languages are not available. However, we anticipate that the corpora we have created will be of great utility to any research centre that may in the future wish to work on this area.

The spoken texts have been transcribed in the native scripts of the languages; while Romanisation has previously been common in transcribed speech in the South Asian languages, we were able to avoid this entirely. Throughout, regular checking of the standard of transcriptions produced was undertaken by analysts employed to carry out random checks on transcriptions. Quality assurance at this level was cyclical.

The metadata gathered to accompany each transcription was limited to age and gender (and in a minority of cases, occupation). These are objectively verifiable categories. Categories such as social class, which appear attractive, are subjective and unreliable. Also, we aimed to make the production of the corpus as cost effective as possible - we therefore limited the metadata we encoded.

This website is the project's primary information point. The Department of Linguistics at Lancaster University has undertaken to maintain the web site beyond the life of the EMILLE project. ELRA (an EMILLE partner) has taken responsibility for distribution of the project resources on CD. The corpus is accompanied by comprehensive documentation, giving details of the sources individual corpus texts were gathered from etc. This manual is available to download from this website, as well as being published with the corpus.

Click here to read the manual.

Legal Issues
We used copyright release forms, which the MILLE project adapted from the existing BNC release forms, to gain permission to include individual texts in the corpus was successfully obtained over the course of the project for the vast majority of the texts we wished to include in the corpus. We had a good head start in this process thanks to MILLE which had already established data donor agreements (see McEnery, 1999). Copyright clearance will be an on-going sub-task of work package two of the project.

Goal 3 - develop basic LE tools

The production of the corpus itself required the development of a tool capable of mapping the bewildering array of 8-bit encodings used for South Asian scripts into the standard Unicode character set. Since the correspondences involved were both complex and contextually conditioned, a powerful and intelligent piece of software was needed. The Unicodify tool was developed at Lancaster in 2002/2003 to serve this purpose (see Hardie, Forthcoming, Baker et al 2004). Additionally tools for analysing and exploiting the EMILLE corpora were also developed. For one language, namely Urdu, we developed an automated part-of-speech tagger (see Hardie 2003, 2004, 2005) which was subsequently used to tag the whole Urdu corpus. Finally, the project has developed existing alignment software to sentence align the parallel corpora within EMILLE.


Baker, J.P., Burnard, L., McEnery, A.M. & Wilson, A. (1998) ‘Techniques for the Evaluation of Language Corpora: a report from the front’, Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain.

Baker, P, Hardie, A, McEnery, A, Xiao, R, Bontcheva, K, Cunningham, H, Gaizauskas, R, Hamza, O, Maynard, D, Tablan, V, Ursu, C, Jayaram, BD and Leisher, M (2004) Corpus linguistics and South Asian languages: corpus creation and tool development. In: Literary and Linguistic Computing 19(4): 509-524.

Baker, P. & McEnery, A. (1998) Needs of language-engineering communities; corpus building and translation resources. MILLE working paper 7, Lancaster University.

Cunningham, H., Gaizauskas, R.G., Humphreys, K. and Wilks, Y. (1999) "Experience with a Language Engineering Architecture: Three Years of GATE", Proceedings of the AISB'99 Workshop on Reference Architectures and Data Standards for NLP, Edinburgh.

Cunningham, H., Peters, W., McCauley, C., Bontcheva, K. and Wilks, Y. (1998) "A Level Playing Field for Language Resource Evaluation ", Workshop on Distributing and Accessing Lexical Resources at Conference on Language Resources Evaluation, Granada, Spain.

Cunningham, H., Stevenson, M. and Wilks, Y. (1998) "Implementing a Sense Tagger within a General Architecture for Language Engineering", Proceedings of the Third Conference on New Methods in Language Engineering (NeMLaP-3), Sydney, Australia.

Gaizauskas, R. and Wilks, Y. (1998) " Information Extraction: Beyond Document Retrieval", Journal of Documentation, 1.

Hardie, A (forthcoming) From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.

Hardie, A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Archer, D, Rayson, P, Wilson, A, and McEnery, T (eds.) (2003) Proceedings of the Corpus Linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.

Hardie, A. (2004) The computational analysis of morphosyntactic categories in Urdu. PhD thesis, University of Lancaster.

Hardie, A. (2005) Automated part-of-speech analysis of Urdu: conceptual and technical issues. In: Yadava, Y, Bhattarai, G, Lohani, RR, Prasain, B and Parajuli, K (eds.) Contemporary issues in Nepalese linguistics. Kathmandu: Linguistic Society of Nepal.

Langlais, P., Simard, M., & Véronis, J. (1998) Methods and practical issues in evaluating alignment techniques. Proceedings of 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistic , Montréal, Canada.

Lie, M., Baker, P., McEnery, A. & Sebba, M. (1999) “Building a Corpus of Spoken Sylheti”, in N. Ostler (ed) The Proceedings of the 3rd Conference of the Foundation for Endangered Languages. Foundation for Endangered Languages, Bath.

Masica, C.P. (1991) The Indo-Aryan Languages, Cambridge University Press, Cambridge.

McEnery, A. (1999) Final Report on MILLEFT, Report to EPSRC, Lancaster University.

Home | About | Who We Are | Languages | Encoding | Sample Data | Links | Contact Us