To see an example of the file header used in EMILLE, click here.
Of particular note is how the header encodes information about the speakers and setting of the transmitted speech.
Each speaker in the corpus receives a unique code. The first letter of the code corresponds to the prime "language" spoken in that section of the corpus e.g. P = Panjabi. The following 3-figure number is simply an identifier.
Sex can be f(emale), m(ale) or u(nknown).
Age is given one of the following codes:
Age 0-14 15-24 25-34 35-44 45-59 60+
code 0 1 2 3 4 5
Occupation is based on the following codes:
0 = professional
1 = clerical
2 = service
3 = manual
4 = student
5 = not working
who: supplies the identifiers of the participants at this setting.
<name type="city">…. This can be <name type="town"> when the interaction occurs in a town rather than a city, for example.
<name type="region">UK: North</name>. UK Regions are split into North, Mid and South.
Encoding Spoken Texts
Utterances are transcribed using the <u> element:
<u id="1" who="P001">how are you ?</u>
For the sake of the transcription an utterance lasts as long as one person is talking.
<vocal who="P001" desc="cough">
<vocal who="P001" desc="laugh">how are you</vocal>
<event desc="sound of coffee being stirred">
Vocalisations do not require closers, although if the transcriber wants to show that they continue while speech (or other events) are happening, the closer element can be used to show the duration of the event. Where multiple vocalisations occur at the same time, the id=x attribute can be used:
<vocal desc="cough" id="01">
<vocal who="P001" desc="laugh" id="02">how are you</vocal id="1">
<u who="P001">I'm </vocal id="2">fine</u>
Use the following standardised values for vocalisations: laugh, cough, sneeze, hiccup, yawn, sigh, whistle.
<pause dur="2">
Pauses are timed in seconds (this will be noted in the header). The dur attribute can be used, if required in other non-lexical elements such as coughs e.g. <vocal id=u7 who=P2 dur=2 coughs>.
Questions are marked with question marks. Spaces are used on on either side of full stops and question marks. Utterances do not have to end in a full stop if there is no discernible pause between the end of one utterance and the beginning of another.
<id who="P002">is it ? i’m not . going home</u>
Capitalisation is used for proper names only. Commas, quotation marks, colons, semi-colons, ampersands, percentage signs, signs for money or brackets are not to be used within spoken texts or within the elements. In English “OK” is written as ok and I as i. Exclamation words may be used to show emphasis.
For abbreviated names, capital letters are used, but spaces are placed in-between each letter. So BT (British Telecom) becomes B T . Full stops are not used for acronyms.
Numbers are written out as words when they appear in conversation as quantifiers (e.g. I had two apples). Telephone numbers, times etc can be written as numerals, but put spaces between single numbers. E.g. 0 1 9 1 23 24 24.
<foreign lang="eng">how are you ? </foreign>
The foreign element is generally not needed to note proper names.
The three letter code for each language is taken from the ISO 639-2 standard which can be found in full at: http://sunrise.eng.monash.edu.au/sunrise/html4/oldtut/ISO6392.HTM
The relevant languages and their codes are:
ara Arabic
ben Bengali
eng English
guj Gujarati
hin Hindi
pan Panjabi
sin Singhalese
tam Tamil
urd Urdu
<unclear cause="passing truck">i don't<unclear>
Or where it is impossible to determine what is being said, the <omit> element can be used:
<omit extent="3 syllables" cause="passing truck">
Cases of false starts, repetition and truncated words are included in the transcription but marked as being editorially “deleted” by the speaker:
<del type="truncation">y</del>yes
<del type="repetition">i i</del>i don't know
<del type="false start">i</del> you're crazy
<vocal desc="er">
Don’t capitalise. A question mark can be added to show a questioning noise:
<vocal desc="mm?">
The following list of sounds are acceptable:
er erm mm mhm um uh ah aha oh ooh urgh argh eek uh-oh
<u who="P001">hi . how<anchor id="1">are you<anchor id="2"></u>
<u who="P002"><anchor synch="1">i'm<anchor synch="2"> fine</u>
In the above example, the words “are you” and “i’m” are spoken simultaneously.
The next set of utterances with overlap in this conversation will uses anchors with id=3 and id=4.
Home | About | Who We Are | Languages | Encoding | Sample Data | Links | Contact Us