Newspaper corpora

The newspaper corpora contain texts from various newspapers and news portals. The aim is to collect a chronologically representative sample of newspaper texts in order to analyse their language, style and topics. The corpora are tagged according to the MULTEXT-East specifications.

Hungarian 2,184,200 2018vertical
Bosnian 493,735 2018vertical
Montenegrin 3,389,5332018vertical
Albanian (Kosovo) 2,116,001 2018vertical

Data preparation
Philipp Wasserscheidt