The Corpus of spoken Bosnian.

The corpus aims to collect a representative sample of colloquial Bosnian. In a first round, we are working on recording 100 speakers all over Bosnia-Herzegovina. The corpus design is balanced according to age, gender, education and region. Speakers record themselves in the course of 24 hours. All recordings are split into communicative events and furnished with extensive metadata about the speaker and the situation.

Situation: Topic, aim, situation type, type of collocutor, formalness, preparedness, familiarity, daytime, duration, place, locale, channel, interactivity, frequency, language(s), non-linguistic activitiy,
Speaker: Age, gender, education, birthplace, place of residence, mother tongue, faith, nationality, use of language(s), use of BCMS media, writing proficiency, language of partner, origin of partner, nationality of partner, faith of partner, nationality of parents, language of parents, faith of parents, income class

Morpho-syntactic annotation and lemmatization with MULTEXT-East tagset

Univerzitet u Sarajevu, Humboldt-Universität zu Berlin

Halid Bulić, Azra Hodžić-Čavkić, Ismail Palić


Technical realisation and support:
Megan Nagel, Philipp Wasserscheidt