Lancaster-Oslo-Bergen Corpus
The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kučera and W. Nelson Francis for American English in the 1960s.
Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK in 1961 by British authors.[1] Both corpora consist of 500 samples each comprising about 2000 words in the following genres:
Label | Text category | Brown Corpus | LOB Corpus |
---|---|---|---|
A | Press: reportage | 44 | 44 |
B | Press: editorial | 27 | 27 |
C | Press: reviews | 17 | 17 |
D | Religion | 17 | 17 |
E | Skills, trades and hobbies | 36 | 38 |
F | Popular lore | 48 | 44 |
G | Belles lettres, biography, essays | 75 | 77 |
H | Miscellaneous (documents, reports, etc.) | 30 | 30 |
J | Learned and scientific writings | 80 | 80 |
K | General fiction | 29 | 29 |
L | Mystery and detective fiction | 24 | 24 |
M | Science fiction | 6 | 6 |
N | Adventure and western fiction | 29 | 29 |
P | Romance and love story | 29 | 29 |
R | Humour | 9 | 9 |
Total | 500 | 500 |
The corpus has been also tagged, i.e. part-of-speech categories have been assigned to every word.[2]
References
External links
- LOB Corpus Manual
- LOB Corpus from the Oxford Text Archive
- v
- t
- e
English
- American National Corpus
- Bank of English
- Bergen Corpus of London Teenage Language
- British National Corpus
- Brown Corpus
- Buckeye Corpus
- Cambridge English Corpus
- Corpus of Contemporary American English
- Enron Corpus
- EnTenTen
- International Corpus of English
- Lancaster-Oslo-Bergen Corpus
- Oxford English Corpus
- PropBank
- Spoken English Corpus
- Switchboard Telephone Speech Corpus
- TIMIT
- VerbNet
- Wellington Corpus of Spoken New Zealand English
non-English
- Bijankhan Corpus
- CHILDES
- CorCenCC National Corpus of Contemporary Welsh
- Croatian Language Corpus
- Croatian National Corpus
- Czech National Corpus
- Europarl Corpus
- German Reference Corpus
- Hamshahri Corpus
- National Corpus of Polish
- Neo-Assyrian Text Corpus Project
- Persian Speech Corpus
- Quranic Arabic Corpus
- Russian National Corpus
- Scottish Corpus of Texts and Speech
- Slovenian National Corpus
- TalkBank
- Tatoeba
- Tehran Monolingual Corpus
- Tekstaro de Esperanto
- TenTen Corpus Family
- Thesaurus Linguae Graecae
This article about a digital library is a stub. You can help Wikipedia by expanding it. |
- v
- t
- e
This article about the English language is a stub. You can help Wikipedia by expanding it. |
- v
- t
- e