Guidelines for encoding spoken data

In designing the encoding scheme for the transcription of spoken data, we took into account a number of findings based on existing resources on spoken corpus markup:

The LIDES (Language Interaction Data Exchange) scheme for encoding multilingual language data.
The TEI (Text encoding initiative) scheme. http://www.uic.edu/orgs/tei/p3/doc/p3.html
Findings based on encoding spoken language data in the MILLE (Minority Languages Engineering) project. See MILLE working paper 4.
The results of a questionnaire on corpus encoding preferences sent to language engineering communities. See Baker, J. P., Burnard, L., McEnery, A. & Wilson, A. (1998). Techniques for the Evaluation of Language Corpora: a report from the front, Proceedings of the First International Conference on Language Resources and Evaluation, Spain.
The results of a questionnaire on corpus encoding preferences sent to multilingual language-engineering communities. See MILLE working paper 7.
The findings of LE-EAGLES-WP4-3.2 Integrated Resources Working Group Survey and guidelines for the representation and annotation of dialogue http://www.ling.lancs.ac.uk/eagles/delivera/wp4aug1.html
Standards used in existing spoken corpora such as the BNC and the "100" corpus of telephone interactions.

It was decided to use the TEI scheme for markup of spoken texts. The decision to encode particular features was based upon the elements deemed to be “essential” from the MILLE questionnaire and the “highest priority” and “recommended” findings of LE-EAGLES-WP4-3.2:

The Header

To see an example of the file header used in EMILLE, click here.

Of particular note is how the header encodes information about the speakers and setting of the transmitted speech.

Describing Speakers

<person id="P001" sex="F" age="0">
<occupation code="0">
</person>

Each speaker in the corpus receives a unique code. The first letter of the code corresponds to the prime "language" spoken in that section of the corpus e.g. P = Panjabi. The following 3-figure number is simply an identifier.

Sex can be f(emale), m(ale) or u(nknown).

Age is given one of the following codes:

Age 0-14 15-24 25-34 35-44 45-59 60+
code 0 1 2 3 4 5

Occupation is based on the following codes:
0 = professional
1 = clerical
2 = service
3 = manual
4 = student
5 = not working

The Setting

<settingDesc>
<setting who="P001 P002">
<name type="city">Newcastle</name>
<name type="region">UK: North</name> <date value="21-01-1999">
<locale>living room of a suburban home</locale>
<activity>chatting</activity>
</setting>
</settingDesc>

who: supplies the identifiers of the participants at this setting.
<name type="city">…. This can be <name type="town"> when the interaction occurs in a town rather than a city, for example.

<name type="region">UK: North</name>. UK Regions are split into North, Mid and South.

Encoding Spoken Texts

Utterances

Utterances are transcribed using the <u> element:
<u id="1" who="P001">how are you ?</u>

For the sake of the transcription an utterance lasts as long as one person is talking.

Paralinguistic Information

Paralinguistic information (coughing, other noises etc), is annotated with the <event> and <vocal> elements.

<vocal who="P001" desc="cough">
<vocal who="P001" desc="laugh">how are you</vocal>
<event desc="sound of coffee being stirred">

Vocalisations do not require closers, although if the transcriber wants to show that they continue while speech (or other events) are happening, the closer element can be used to show the duration of the event. Where multiple vocalisations occur at the same time, the id=x attribute can be used:

<vocal desc="cough" id="01">
<vocal who="P001" desc="laugh" id="02">how are you</vocal id="1">
<u who="P001">I'm </vocal id="2">fine</u>

Use the following standardised values for vocalisations: laugh, cough, sneeze, hiccup, yawn, sigh, whistle.

Punctuation, Capitalisation and Pauses

A short pause or hesitation (under 1 second in length) can be shown by using a full stop. Longer pauses are annotated in the <pause dur=x> format.

<pause dur="2">

Pauses are timed in seconds (this will be noted in the header). The dur attribute can be used, if required in other non-lexical elements such as coughs e.g. <vocal id=u7 who=P2 dur=2 coughs>.

Questions are marked with question marks. Spaces are used on on either side of full stops and question marks. Utterances do not have to end in a full stop if there is no discernible pause between the end of one utterance and the beginning of another.

<id who="P002">is it ? i’m not . going home</u>

Capitalisation is used for proper names only. Commas, quotation marks, colons, semi-colons, ampersands, percentage signs, signs for money or brackets are not to be used within spoken texts or within the elements. In English “OK” is written as ok and I as i. Exclamation words may be used to show emphasis.

For abbreviated names, capital letters are used, but spaces are placed in-between each letter. So BT (British Telecom) becomes B T . Full stops are not used for acronyms.

Numbers are written out as words when they appear in conversation as quantifiers (e.g. I had two apples). Telephone numbers, times etc can be written as numerals, but put spaces between single numbers. E.g. 0 1 9 1 23 24 24.

Foreign Utterances

The <foreign> element is used to denote a piece of dialogue which does not occur in the default language e.g.

<foreign lang="eng">how are you ? </foreign>

The foreign element is generally not needed to note proper names.

The three letter code for each language is taken from the ISO 639-2 standard which can be found in full at: http://sunrise.eng.monash.edu.au/sunrise/html4/oldtut/ISO6392.HTM

The relevant languages and their codes are:

ara Arabic
ben Bengali
eng English
guj Gujarati
hin Hindi
pan Panjabi
sin Singhalese
tam Tamil
urd Urdu

Unclear Utterances

A "best guess" can be made by using the <unclear> element with an optional ‘cause’ attribute:

<unclear cause="passing truck">i don't<unclear>

Or where it is impossible to determine what is being said, the <omit> element can be used:

<omit extent="3 syllables" cause="passing truck">

Cases of false starts, repetition and truncated words are included in the transcription but marked as being editorially “deleted” by the speaker:

<del type="truncation">y</del>yes
<del type="repetition">i i</del>i don't know
<del type="false start">i</del> you're crazy

Back-channelling, Exclamations et

As it’s difficult to note these sounds as belonging to a particular language, it’s recommended that they <vocal> element is used for them:

<vocal desc="er">
Don’t capitalise. A question mark can be added to show a questioning noise:
<vocal desc="mm?">
The following list of sounds are acceptable:

er erm mm mhm um uh ah aha oh ooh urgh argh eek uh-oh

Overlap

Overlap is indicated by using the <anchor> element, with the id and synch attributes:

<u who="P001">hi . how<anchor id="1">are you<anchor id="2"></u>
<u who="P002"><anchor synch="1">i'm<anchor synch="2"> fine</u>

In the above example, the words “are you” and “i’m” are spoken simultaneously.

The next set of utterances with overlap in this conversation will uses anchors with id=3 and id=4.

Home | About | Who We Are | Languages | Encoding | Sample Data | Links | Contact Us