Certain sections of the EMILLE Corpora can now (as of 2015) be accessed via Lancaster's CQPweb server: https://cqpweb.lancs.ac.uk/
To use these datasets, create a CQPweb account (free to all) and then enter the corpus interface from the homepage.
Note that while these datasets can be analysed online, they cannot be downloaded in full via CQPweb. For access to full-text downloads, see below.
ELRA / ELDA
The EMILLE Corpora were released in August 2003. They are distributed via ELRA / ELDA.
The data is available in two forms. The full EMILLE/CIIL Corpus is made available, for free, for research use only. See:
Resource page for W0037 on ELRA catalogue (or catalogue search)
The EMILLE Lancaster Corpus is a 59-million word subset of the full data, which is available for commerical exploitation, at a cost. See:
Resource page for W0038 on ELRA catalogue (or
For more information and pricing details, see the ELDA catalogue here:
For information on how to order, see the following page on the ELRA site:
A beta version of the corpus, consisting of a restricted sample of the data, was released in 2003. We do not advise making use of this corpus, as it contains some errors of encoding which were identified after the beta release.
|Downloads from this site
You can download the corpus manual from this site.
You can also download a copy of the part-of-speech tagset used to tag the Urdu corpus.
You can also download the Unicodify software used to convert the encoding of the written corpus.
Finally, if you are having difficulty viewing any of the texts in the corpus, you may wish to make use of the GATE Unicode viewer. This software, which runs in the Java environment, is a simplified extract from the GATE architecture.
Home | About | Who We Are | Languages | Encoding | Sample Data | Links | Contact Us