Corpora of Czech
as the First Language in Acquisition

CoCzeFLA

latest news

We offer the possibility of an internship! We are looking for interns and fellows, especially for manual checks of automatic morphological annotations from the Chroma corpus, and possibly for transcription for the emerging ChroMat corpus. Contact us 🙂

For a new recording we are seeking a family with one firstborn child. The baby should be between 12 and 24 months old and just starting to produce their own speech (saying only one-word utterances). If you are interested in joining our recording session, feel free to get in touch! [Please, keep in mind that we are only looking for a Czech-speaking child]

What We Do?

We lend cameras and dictaphones to families and record the communication of toddlers and their parents.

In each family, we make long-term recordings at regular intervals to get an overall picture of the child's language development, as well as how adults speak to the child.

These recordings are then transcribed and anonymized, which means that the children are given pseudonyms and other sensitive data are removed or replaced with pseudo-data.

We publish anonymized transcripts so they could be analyzed by every interested researcher. We also continue to work on newer versions of the corpora (e.g. with morphological annotation) and on their analysis.

How Do We Transcribe?

For transcription, we use the CHAT system, a standard format for transcribing spoken communication developed specifically for transcribing children’s speech, first published in 1990 and updated regularly since then (MacWhinney, 2000). The entire CHAT manual is very extensive and is directly designed to allow each researcher to select only the elements they need for their assignment. Thus, when transcribing for CoCzeFLA, we work with our customized version of the Manual (v3.0 – the most recent; v2.0v1.0) and specifically focus on annotating (1) speech dysfluencies (repetitions, false starts, self-corrections), (2) interjections, (3) morphological innovations, errors, and idiosyncratic expressions. In addition to the basic CHAT system we also use several custom codes, namely codes for special uses of interjections in the predicate function (Uděláš ham? = Will you do yum?), in the nominal function (Podívej, támhle je hafhaf. = Look, there’s a bow-wow.) and in the attributive function (Pozor, ta pánev je auau! = Watch out, that pan is ow ow!), and a code for the occurrence of foreign words (Podívej, to je bunny. = Look, that’s králík.). Thus, these named phenomena are annotated manually by the transcriber. On the contrary, lemmatization and following morphological annotation of corpora are not the task of transcribers; they require a lot of additional work and their creation is currently the main focus of the START project.

For transcribers and other interested researchers we offer a brief list of the codes used [the document is currently available in Czech only]. They can find both the codes used in the current work on transcriptions for ChroMat and the codes as they are used in the already published versions of the ChroMat corpus there.

Every recording is transcribed by a trained person who thereafter handles it to another person for control. The checked version is then returned to the original transcriber, who re-reads it, cleans it up and creates the final version. Finally, the transcript goes through a formal automatic check in the CLAN program, which is able to work with the CHAT format and has a special function for this purpose.

References:
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates. 

Example from the ChroMat corpus
(Felix, 2 years old)
MAIN LINES ONLY

@Begin
*CHI: jé !
%eng: oh !
*MOT: jé .
%eng: oh .
*MOT: teď to pasuje, viď ?
%eng: it fits now, isn't it ?
*CHI: pasuje to .
%eng: it fits .
*MOT: dáš mi taky ?
%eng: can I have one too ?
*CHI: na .
%eng: here .
*CHI: ňam, dobrý .
%eng: yum, good .
*MOT: děkuju .
%eng: thanks .
*CHI: ty .
%eng: you .
*CHI: dobrý .
%eng: good .
@End
Example from the ChroMat corpus
(Felix, 2 years old)
MAIN LINES INCL. CODES

@Begin
*CHI: jé@i ! # interjection
%eng: oh !
*MOT: jé@i . # interjection
%eng: oh !
*MOT: teď to pasuje, viď ?
%eng: it fits now, isn't it ?
*CHI: pasuje to .
%eng: it fits .
*MOT: dáš mi taky ?
%eng: can I have one too ?
*CHI: na@i . # interjection
%eng: here .
*CHI: ňam@i, dobrý . # interjection
%eng: yum, good .
*MOT: děkuju .
%eng: thanks .
*CHI: ty [*] . # error (err)
%eng: you .
%err: reversal (2sg = 1sg) . # error comment: CHI uses "you" to refer to himself
*CHI: dobrý .
%eng: good .
@End
Example from the ChroMat corpus
(Felix, 2 years old)
FULL TRANSCRIPT

@Begin
@Situation: they put toether a puzzle .
@Comment: CHI correctly attaches one piece of puzzle to another one .

*CHI: jé@i ! # interjection
%pho: jé .
*MOT: jé@i . # interjection
%pho: jé .
%com: emulates CHI .
*MOT: teď to pasuje, viď ?
%pho: teť to pasuje viť .
*CHI: pasuje to .
%pho: pašuje to .
%com: gets up, picks fruit out of the box .
*MOT: dáš mi taky ?
%pho: dáš mi taky .
%com: moves the cloud to the other images .
*CHI: na@i . # interjection
%pho: na .
%com: puts a piece he wanted to eat himself in MOT's mouth .
*CHI: ňam@i, dobrý . # interjection
%pho: ňam dobý .
%com: returns to the box .
*MOT: děkuju .
%pho: děkuju .
%com: smiles .
*CHI: ty [*] . # error (err)
%pho: ty .
%err: reversal (2sg = 1sg) . # error comment: CHI uses "you" to refer to himself
%com: picks fruit for himself, eats a piece of banana .
*CHI: dobrý .
%pho: dobý .
%com: springs in the knees .
@End
Previous
Next
Play Video

About us

The core team of CoCzeFLA is made up of Anna Chromá and Klára Matiasovitsová, postgraduate students at the Faculty of Arts of Charles University. Associate Professor Filip Smolík has been an experienced consultant since the beginning of the project. In the key area of transcription and revision, the project can currently rely on two experienced and long-term collaborating undergraduate students of the Faculty of Arts – Markéta Baslová and Kateřina Šimková – and three skilled student newcomers – Jan Pinc, Leona Straková and Štěpánka Tvrdíková (trained at the turn of 2021/22). Until recently, Jolana Kohoutková was also part of the team of transcribers, but is now engaged only in a limited amount of revising and is otherwise co-investigator of Klara’s START project. Other PhD students Jakub Sláma and Petra Čechová are also involved in this project.