We have reached the final stage of the ChroMat corpus preparation. Even though we will be recording and transcribing Augustýn for about another year, we feel that it is also time to start concretely planning a new project, which will be a corpus of children with developmental language disorder.
We lend cameras and dictaphones to families and record the communication of toddlers and their parents.
In each family, we make long-term recordings at regular intervals to get an overall picture of the child's language development, as well as how adults speak to the child.
These recordings are then transcribed and anonymized, which means that the children are given pseudonyms and other sensitive data are removed or replaced with pseudo-data.
We publish anonymized transcripts so they could be analyzed by every interested researcher. We also continue to work on newer versions of the corpora (e.g. with morphological annotation) and on their analysis.
For transcription, we use the CHAT system, a standard format for transcribing spoken communication developed specifically for transcribing child speech, first published in 1990 and updated regularly since then (MacWhinney, 2000). The entire CHAT manual is very extensive and is directly designed to allow each researcher to select only the elements they need for their assignment. Thus, when transcribing for CoCzeFLA, we work with our customized version of the Manual (in Czech only: v3.1 – the most recent; v2.0; v1.0) and specifically focus on annotating (1) speech dysfluencies (repetitions, false starts, self-corrections), (2) interjections, (3) morphological innovations, errors, and idiosyncratic words. In addition to the basic CHAT system we also use several custom codes, namely codes for special uses of interjections as verb phrases (Uděláš ham? = Will you do yum?), nominal phrases (Podívej, támhle je hafhaf. = Look, there’s a bow-wow.) and modifiers (Pozor, ta pánev je auau! = Watch out, that pan is ow ow!), and a code for the occurrence of foreign words (Podívej, to je bunny. = Look, that’s králík.). Thus, these named phenomena are coded manually by the transcriber. On the contrary, lemmatization and morphological tagging of the corpora are not the task of transcribers, they are performed separately (see below).
For transcribers and other interested researchers we offer a brief list of the applied codes: CHAT for CoCzeFLA. They can find both the codes used in the current work on transcriptions for the ChroMat corpus and the codes as they have been used in the already published versions of the Chroma corpus.
Age of children is always given in the format Y;MM(.DD).
Every recording is transcribed by a trained person who thereafter handles it to another person for control. The checked version is then returned to the original transcriber, who re-reads it, cleans it up and creates the final version. Finally, the transcript goes through a formal automatic check in the CLAN program, which is able to work with the CHAT format and has a special function for this purpose.
References:
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates.
Since the 2023.04 version of the Chroma corpus, CoCzeFLA corpora are morphologically annotated. They thus allow automatic search according to lemma, part-of-speech, and specific morphological categories. The annotation is based on an automatic analysis performed by the MorphoDiTa tool. The output from MorphoDiTa is subsequently transformed into the %mor tier complying with the CHAT format of transcription an analyzable by means of the CLAN software. The format of the %mor tier is to see below in an example of annotated transcript from Julie (the Chroma corpus). The automatic annotation of children’s lines in the version Chroma 2023.07 was completely manually checked, while adult’s lines preserve the automatic version for now. Possible annotation incosistencies arise from this fact, e.g., hodiny ‘clock’ in the examle below is annotated as n:pt (plurale tantum) on child’s line, but not on mother’s line.
Detailed information on the morphological annotation and its format is to find in the User Manual (in Czech only).
a
a
a
*CHI: +< je tam hodiny [*] . ‘there is a clock .’
%mor: v|být-3&SG&ind&pres&akt&impf-err adv:pro|tam n:pt|hodiny-1&PL&F .
%pho: je tam hodiny .
%err: je = jsou .
(…)
*MOT: jsou tam hodiny . ‘there is a clock .’
%mor: v:aux/cop|být-3&PL&ind&pres&akt&impf adv:pro|tam n|hodina-1&PL&F .
%pho: jsou tam hodiny .
%com: she corrects CHI .
*CHI: jsou tam . ‘there it is .’
%mor: v|být-3&PL&ind&pres&akt&impf adv:pro|tam .
%pho: sou tam .
The core team of CoCzeFLA is made up of Anna Chromá and Klára Matiasovitsová. Associate Professor Filip Smolík has been an experienced consultant since the beginning of the project. In the key area of transcription and revision, the project relies on a team of undergraduate students rewarded with a scholarship from the Faculty of Arts at Charles University, possibly completing mandatory internships with us.