This post will illustrate different possible ways to add additional information to your data and builds forth on the tools discussed in my previous post. Corpus annotation makes it possible to retrieve specific data systematically. It might be a bit overwhelming, but just take some time to read the about pages of the programme websites.
I prefer to use notepad++ to annotate my datasets; it is the programme shown in the examples below. It is also possible to prepare and annotate your datasets in other text processing software (Word or the notepad application already available on your device). However, I favour notepad++ because you can open different datasets in tabs. If you want to change a tag in all files, you open all your data in tabs and change the tag in one go instead of changing it in every file manually.
Annotating means adding (linguistic) information to your data. The most basic method of adding information is by manually adding metadata – data about your data. The metadata, for example, describes whether the letter is an in or out-letter, to whom it was sent, and when it was sent. In concordancing programmes it is possible to search on letters from, for instance a specific year, if this is defined in the metadata. Defining metadata can be done with XML. The tags are customisable; you can assign the words you deem necessary to certain classes. In XML elements are embedded elements inside other elements as in the example below; always open <word> and close </word> the element. You can add additional information on a lot of other things, such as the publication source.Another type of interesting metadata is editorial metadata. These are for example additions, omissions, corrections and strikethroughs in the text itself. Annotating this information makes it easier to retrieve corrections in a text as you only have to search on the assigned tag. For instance: <strike>fi</strike> <corr>if</corr> he had. In a concordancing programme you can systematically search on all the items containing <corr>.
The automatic approach to corpus annotation includes part-of-speech (POS) annotation and semantic annotation. These tags are inline with the rest of your data (see example below).The most widespread tagset for POS tagging is CLAWS (Constituent Likelihood Automatic Word-tagging system) C5 (simple tags) and C7 (incorporates more complex tags than C5 and punctuation mark tags). You can use the online free version of the tagset and the keys to C5 and C7. Another free accurate POS tagger is TagAnt. In both programmes you either paste the text or open the file and the programme will automatically annotate POS in your data. If your data has a lot of spelling variation you might want to use VARD or MorphAdorner first to normalise spelling and grammar (programmes also work on Early Modern variations as well) in order to make the POS tagger more accurate. When your data contains POS tagging it is possible to search on specific grammatical features (look at the keys first!), for example the passive construction exists of be verbs (in C7 these are VB0, VBDR, VBG, VBI and VBV, so you search with a wildcard: *_VB*) and the past participle form of the lexical verbs (search on *_VVN). You can do this with other constructions as well.
Lastly, you can tag your dataset on semantic domains. This works more or less the same as POS tagging. Semantic domains can reveal information about ideologies or might help analyse politeness within the Brown and Levinson framework. The free online tagset can be consulted here and the key for the tags here.
I deliberately tried to keep this post as short as possible. If anything is unclear or if you have any other questions or comments please ask them below.
Garside, Roger, Geoffrey Leech & Anthony McEnery (eds.). 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman.
Rayson, Paul. 2008. From key words to key semantic domains. International Journal of Corpus Linguistics 13(4). pp. 519–549.