The Corpus of spoken Bosnian.
The corpus aims to collect a representative sample of colloquial Bosnian. In a first round, we are working on recording 100 speakers all over Bosnia-Herzegovina. The corpus design is balanced according to age, gender, education and region. Speakers record themselves in the course of 24 hours. All recordings are split into communicative events and furnished with extensive metadata about the speaker and the situation.
Situation: Topic, aim, situation type, type of collocutor, formalness, preparedness, familiarity, daytime, duration, place, locale, channel, interactivity, frequency, language(s), non-linguistic activitiy,
Speaker: Age, gender, education, birthplace, place of residence, mother tongue, faith, nationality, use of language(s), use of BCMS media, writing proficiency, language of partner, origin of partner, nationality of partner, faith of partner, nationality of parents, language of parents, faith of parents, income class
Morpho-syntactic annotation and lemmatization with MULTEXT-East tagset
Univerzitet u Sarajevu, Humboldt-Universität zu Berlin
Halid Bulić, Azra Hodžić-Čavkić, Ismail Palić
Technical realisation and support:
Megan Nagel, Philipp Wasserscheidt