28th November 2017 by

The GLARE 19th Century Children’s Literature Corpus in CLiC

Anna Čermáková is a Marie Sklodowska-Curie Fellow on the EU-funded GLARE Project [“Exploring Gender in Children’s Literature from a Cognitive Corpus Stylistic Perspective”] at the University of Birmingham. She is a member of the CLiC Dickens advisory board and her main research interests are in corpus linguistics and particularly in corpus stylistics. She is also interested in literary translation, contrastive corpus based linguistics and lexicology.

There is a new addition to the CLiC corpora family: the GLARE 19th Century Children’s Literature corpus (ChiLit). The GLARE project, which looks at gender representation in children’s literature and its development over time, starts with the examination of the 19th century and the corpus that has been collected for the project is now hosted in CLiC (thank you CLiC!). It amounts to 4.4 million words, so it is similar in size to the CLiC 19th Century Reference Corpus (4.5 million words).

Unlike the 19th Century Reference corpus, which contains 29 books, ChiLit is made up of 71 books. There are some books that are very short, like the texts by Beatrix Potter, which are on average about 1000 words only. At the other extreme, the longest book is C. M. Yonge’s The Daisy Chain, or Aspirations reaching nearly 300,000 words followed byThe Heir of Redclyffe with well over 230,000 words. Excluding these two extremes, the average book in ChiLit is about 62,000 words, which is still less than half of the average novel in the 19th Century Reference Corpus (over 155,000 words).  Interestingly, the average length in the corpus of Dickens’s fiction is much higher: 255,700 words.

In ChiLit, there are 35 books by women writers and 36 books by male writers. Some authors are represented by more than one book, so there are only 38 different writers (14 female, 24 male). The women writers take up slightly less than half of the words in the corpus, i.e. about 44% (1.9 million words) as opposed to men at 56% (2.5 million words).

ChiLit gives us a snapshot of the Golden Age of English children’s literature, i.e. children’s fiction written in the 19th century. The earliest published book included in the corpus is from 1826 — The Rival Crusoes; Or, The Ship Wreck by Agnes Strickland and the latest books are from 1911: Peter and Wendy (Peter Pan) by J. M. Barrie’s and The Secret Garden by F. H. Burnett.

The selection of books for ChiLit was primarily guided by Children’s Literature. An Anthology 1801 — 1902 compiled by Peter Hunt (2001, Oxford) (henceforth Anthology), which is based on four principles of compilation and three of them are also relevant for the selection of titles in ChiLit. The first principle aims to cover “a reasonable representation of what was written for and read widely by English-speaking children in the nineteenth century”. The second principle aims at choosing “historically significant, or good examples of their kind” (particularly in terms of the newly emerging genres, e.g. fantasy or school story). The third principle ensures that the books selected are “readable today” with the “heaviest emphasis … on books that ‘entertain’ rather than instruct” (Hunt 2001: xv-xvi). In line with the third principle, all the books selected for ChiLit have been also recently reprinted (at least once after 2010).

The Anthology excludes Beatrix Potter’s books “because they are both readily available (in a bewildering number of editions and variants) and because they were designed as complete experience, of a certain size, shape and texture” (Hunt 2001: vxi); however we decided to include some of her texts in the ChiLit corpus (because they are such classics!) being aware that in this case the text not accompanied by the pictures may pose problems. The issue of illustrations in children’s books is a complex one and it is debatable, into what degree some of the texts of children’s fiction can be analysed without their illustrations. More broadly, this is one of the general issues in corpus linguistics as a discipline or a method of analysis.

John Tenniel’s illustration ‘Alice sitting at a mad tea-party’ for ‘Alice’s Adventures in Wonderland’

In line with the GLARE project objectives, ChiLit (unlike the Anthology) contains only books by British authors (or those with British background), therefore classics such as Little Women by Louisa Alcott are not included. In addition, translations (or retelling of myths, legends and other folklore texts), nursery rhymes, classic fairy tales and poetry in general are not included in ChiLit. Apart from the Anthology, the selection of books for ChiLit was further guided by another reference work: Children’s Literature. An illustrated history edited by Peter Hunt (1995, Oxford).

Some books that we know today as children’s books were at the time of their publication in the 19th  century also read by adults. This is partly due to the fact that books were often published in various magazines and some of the magazines would have been aimed at the whole family. Also, quite a few authors were writing for both audiences: e.g. Oscar Wilde, William Makepeace Thackerey, or George MacDonald. As Hunt puts it: “The definition of children’s literature is an immensely complex and variable one, and generally rests upon authorial intention (however deduced), or the reader ‘implied’ in the text (however deduced), rather than a factual examination of which books were or are marketed for, adopted by, or imposed upon children. As if that were not tricky enough, as childhood changes, books that were once for adults are read by children and vice versa” (Hunt 2001: xvi). Hunt in his Anthology emphasises in his selection “books written for children, not to children, or about children(Hunt 2001: xvi),  e.g. he does not include Kenneth Grahame’s The Golden Age and Dream Days because initially they were actually for adults. They are included in ChiLit because they were also widely read by children. On the other hand, Hunt (2001) (and ChiLit) includes e.g. Black Beauty because it is so “overwhelmingly regarded as being children’s book” (Hunt 2001: xvi), even though it was initially aimed at adult audience.

Decisions what to include and what to leave out were not always straightforward — selections are always subjective to a degree. Ultimately, however, the final selection depended on the availability of the texts on the Project Gutenberg website. We also had to exclude some texts due to technical reasons. All the texts in ChiLit are annotated for quotes, non-quotes, and suspensions in the same way as the other CLiC corpora (see documentation). I would like to thank Viola Wiegand, Anthony Hennessey and Jamie Lentin for their indispensable help with the corpus compilation, annotation and other necessary treatment.

[Update, February 2018: For more information on the ChiLit corpus, also see a post about the ChiLit corpus on the GLARE blog (Čermáková, 2018).]

Please cite this blog post as follows: Čermáková, A. (2017, November 28). The GLARE 19th Century Children’s Literature Corpus in CLiC [Blog post]. Retrieved from https://blog.bham.ac.uk/clic-dickens/2017/11/28/the-glare-19th-century-childrens-literature-corpus-in-clic/