Deutsche Romane des 19. Jahrhunderts (DE19): A nineteenth-century reference corpus of German novels for contrastive analysis

Published: Posted on

We are excited to introduce Deutsche Romane des 19. Jahrhunderts (DE19), a 4.5-million-word corpus of nineteenth-century German novels designed to be comparable to the 19th Century Reference Corpus of English (EN19). DE19 is available to download now and will soon be accessible via the CLiC web app (Mahlberg et al., 2020). 

Supporting contrastive concordance reading

We have created DE19 as part of the Reading Concordances in the 21st Century project. One of the aims of the project is to broaden resources for tool-independent concordancing from the traditional English context into German. Since a lot of concordancing is already being done with English literary corpora, we wanted to see what could be done to support future contrastive study with German narrative fiction.

Creating truly comparable corpora is not straightforward. Generally, these kinds of corpora should be collections of texts in different languages that are similar in genre, domain, and sampling period (McEnery & Hardie, 2012). A general reference corpus of 19th century novels that we have created before is the 19th Century Reference Corpus of English (19C), which is one of the CLiC corpora. This corpus was originally created to provide contextual information for the study of Dickens’s novels (which is why there are no texts by Dickens in the corpus!) As we have used the sampling criteria of this corpus to inform the design of DE19, we will refer to 19C as “EN19” here.

What is in the DE19 corpus?

EN19 consists of 14 canonical novels by eight female authors (2.53 million words) and 24 canonical novels by ten male authors (1.98 million words). All texts were published in the period 1800–1899, the longest is 310,000 words, and no more than three texts by the same author are included.

        

Some of the authors in EN19: Charlotte Brontë, Oscar Wilde, Mary Shelley, Thomas Hardy (Images from WikiCommons).

To find comparable texts for DE19, we consulted the German Novel Corpus (Konle et al., 2021), which is part of a large, comparable corpus of literature in twelve languages called the European Literary Text Collection (ELTeC) (Odebrecht et al., 2021). The German Novel Corpus has a balanced sample of texts by key authors active in the period 1800–1899, collected from the Deutsches Textarchiv and TextGrid repositories.

It seems fewer German novels published in the nineteenth century are available (at least in digital form), and that the predominance of male authors writing fiction at the time was significantly greater. We also found that German novels tended to be longer on average, so we chose to balance gender representation by number of authors and texts rather than text length (while matching the upper limit in EN19). In terms of societal impact, more novels by male authors than female authors have high numbers of reprints (Konle et al., 2021). So, we balanced our corpus by including fewer texts by male authors with low circulation figures.

         File:Hedwig Dohm.jpg

Some of the authors in DE19: Theodor Fontane, Fanny Lewald, Gustav Freytag, Hedwig Dohm (Images from WikiCommons).

Our final corpus includes 13 novels by nine female authors (1.24 million words) and 20 novels by thirteen male authors (3.24 million words). No more than three texts were written by the same author, and the longest text is 286,000 words. The DE19 and EN19 corpora are both available to download for free from the mahlberg-lab/corpora GitHub repository. Technical information about text conversion and annotation is also available there.

What will users of DE19 be able to do in CLiC? 

The CLiC web app is a user-friendly tool that supports researchers and students in the computer-assisted close reading of narrative fiction. All CLiC corpora have been marked up so users can search subsets of the text. They can choose to look only at the speech of characters, or run searches just on narration. Like the other corpora in CLiC, DE19 has been annotated to make subsets of text searchable. Now, students and researchers will be able to use functions of CLiC to both analyse German and to compare features of narrative fiction in English and German. 

We are already using DE19 and EN19 to compare patterns of body part nouns in English and German fiction; watch this space for updates! What applications does DE19 have in your research? If you are interested in comparative study of English and German literature, we would love to hear from you! 

Texts in DE19

Arnim, B. (1840). Die Günderode.
(1844). Clemens Brentanos Frühlingskranz.
Boy-Ed, I. (1889). Fanny Förster.
Dahn, F. (1876). Kampf um Rom.
Dohm, H. (1896). Sibilla Dalmar.
(1899). Schicksale einer Seele.
Ebner-Eschenbach, M. (1876). Bozena.
Fontane, T. (1878). Vor dem Sturm
(1897/98). Der Stechlin.
François, Louise. (1871). Die letzte Reckenburgerin.
Freytag, G. (1855). Soll und Haben.
(1864). Die verlorene Handschrift.
Ganghofer, L. (1895). Schloß Hubertus.
Hahn-Hahn, I. G. (1846). Sibylle.
Kurz, H. (1855). Der Sonnenwirt.
Lewald, F. (1843). Jenny.
Marlitt, E. (1871/72). Das Heideprinzeßchen.
(1866/67). Goldelse.
(1868). Das Geheimnis der alten Mamsell.
May, K. (1898). Am Jenseits.
Meyer, C. F. (1874). Jürg Jenatsch.
Polenz, W. (1895). Der Büttnerbauer.
Raabe, W. (1869/70). Der Schüdderump.
(1863/64). Der Hungerpastor.
(1863). Die Leute aus dem Walde, ihre Sterne, Wege und Schicksale.
Schopenhauer, A. (1845). Anna.
Spielhagen, F. (1861). Problematische Naturen. Erste Abtheilung
(1869). Hammer und Amboß.
Stifter, A. (1841). Die Mappe meines Urgroßvaters.
(1855, 1865–1867). Witiko.
Tieck, L. (1840). Vittoria Accorombona.
Vischer, F. T. (1879). Auch Einer 1.
(1879). Auch Einer 2.

Mahlberg, M., Stockwell, P., Wiegand, V., and Lentin, J. (2020). CLiC 2.1. Corpus Linguistics in Context. https://clic.bham.ac.uk  

McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press. 

Konle, L., Jannidis, F., Odebrecht, C., & Burnard, L. (2021). German Novel Corpus (ELTeC-deu) (version 1.0.0). [Data set]. 

Odebrecht, C., Burnard, L., & Schöch, C. (Eds.). (2021). ELTeC: European Literary Text Collection (version 1.1.0). COST Action Distant Reading for European Literary History (CA16204). 

Please cite this post as follows: Finlayson, N., Evert, S., Mahlberg, M., & Piperski, A. (2024). Deutsche Romane des 19. Jahrhunderts (DE19): A nineteenth-century reference corpus of German novels for contrastive analysis [Blog post]. Reading Concordances in the 21st Century Blog, University of Birmingham. Retrieved from https://blog.bham.ac.uk/rc21/2024/04/26/deutsche-romane-des-19-jahrhunderts-de19-a-nineteenth-century-reference-corpus-of-german-novels-for-contrastive-analysis/

Leave a Reply

Your email address will not be published. Required fields are marked *