The newspaper corpora contain texts from various newspapers and news portals. The aim is to collect a chronologically representative sample of newspaper texts in order to analyse their language, style and topics. The corpora are tagged according to the MULTEXT-East specifications.
File | Size | Year | Format |
Hungarian | 2,184,200 | 2018 | vertical |
Bosnian | 493,735 | 2018 | vertical |
Montenegrin | 3,389,533 | 2018 | vertical |
Albanian (Kosovo) | 2,116,001 | 2018 | vertical |
Data preparation
Philipp Wasserscheidt