Welcome to Module 3 of our course "Natural Language Processing for Digital Humanities". More precisely, the first part of the 3rd module. Today we will be dealing with corpus analyses that might be interesting for problems in the field of Digital Humanities. A short overview on the topics that I would like to talk about today: First of all, we will try to get to know better the corpus linguistics' perspective. And we will try to define and understand what a corpus is in the proper sense and learn some basic notions and terminology regarding corpus search and analysis. Finally, we will get to know four very important analytical categories that are being frequently used. What is this corpus linguistic perspective? What is so special or different about it? Our lives are surrounded by many problems that, of course, are also interesting from a linguistic perspective. For example, the question, how right-winged is a party? Here is the example of the AfD party (extreme right wing party) in Germany. You see in this report of the newspaper "Tagesschau", that investigations are made about where on the political spectrum a party can be located. Would it be possible to analyse the language usage or the written documents of members of such parties in order to find out where on the spectrum between left-wing and right-wing this party can be identified? Here you see an example of an image that is already a bit older of a couple of mountaineers and you can see in the image that they look very different from modern mountaineers. What did actually change in the way we talk about mountaineering and report the events? What cultural significance does that have? Of course, it can be interesting here to have an appropriate corpus to see how the use of language in talking about mountaineering has changed. and to draw conclusions with which one can make statements about changes in culture. If one has such problems, you can sit down in a comfortable armchair and think about it.... And come to a solution out of your own intuition. Or you can work with data, i. e. proceed empirically. You can start corpus queries using a computer to find something out about these issues. Those two positions are described ironically as the ""armchair-linguist" and the "corpus linguist". That is the tension between two very different approaches. We are, of course, more on the corpus linguist's side. I would like to show you in the following what makes this position so special. First of all, we need to know what a corpus is at all. It can be said, a corpus is a collection of written and spoken utterances. It doesn't have to be text but can also be spoken language. If we deal with spoken language, the material needs to be transcribed. Typically, the corpus data is available in electronic and machine-readable form. The corpus consists of text data that we processed accordingly, and also meta data, i.e. data that gives us further information about author, title, publication date, etc. In linguistic corpora we can also find so-called annotations, i.e. information that has been added to enrich the text data. For example information about the part of speech class, for each token in the corpus I know what its part of speech tag is and, additionally, what basic form it has. But even further annotation is possible. With a corpus linguistic perspective we start from certain premises, which I would also like to mention briefly: It is very important to understand that corpus analysis corresponds to the analysis of language use. We are not interested in the system behind language that can be formulated in a grammar or a lexicon, but we are dealing with actual language use. The basics for such analyses are corpora that contain authentic language use. In this case, I don't want to work with data that has been generated explicitly for a certain analysis but with data that has been generated for totally different purposes, i.e. newspaper articles, blogs, books, etc. It is important to note that a corpus is only a sample taken from authentic use of language. We cannot analyse the entire use of language: our corpus represents always a small part of the use of language. And the choice regarding the corpus determines the population about which I can make any statements. Digital data, nicely prepared in a database. Are corpora just modern card boxes that can be used to extract nice evidences for a certain phenomenon? That is not correct, according to Perkhuhn and Belica (2006) for example. Corpora are not only collections of evidences of card boxes in electronic form but tools offering totally new analytical methods, a new, individual perspective in the field of linguistic research, a corpus linguistic perspective. What does that mean? If we read texts because of our own interest we will notice certain things in these texts, for example certain wordings or the use of a word in a specific sense. And now, we can try to interpret such things to draw conclusions. We don't want to work with single documents in corpus linguistics but with the greatest amount of text data as possible. Using statistical methods we can find interesting, noticeable patterns. We only notice them because we have an overview over the large amount of text. An amount of text that I cannot cope with when I have to read all the texts by myself. Let's take a look at some important basic notions in the context of corpus analysis. An example: Let's do some analyses in the "Deutschen Referenz Korpus" (German reference corpus). The "Deutsche Referenz Korpus" is one of the largest, publicly available corpora in German language, at the Institute for German Language in Mannheim. We access the corpus via the COSMAS II interface, that can be used directly in the webbrowser. I will quickly show you how to do corpus analysis using the German reference corpus. On the webpage of ids-mannheim.de the interface COSMAS II is provided as an online application. Using this interface, the corpus can be accessed. You need to register but the account is free of charge. As soon as you logged in to COSMAS II you need to select an archive first of all. I select the archive for written language, that contains the entire corpus collection of the German reference corpus. In addition, you can see a list of pre-defined corpora the entire corpus collection is subdivided into different corpora. You can also create your own virtual corpora as subcorpora and use them for your own research interests. I select "W-öffentlich" (W-public) containing the entire corpus collection. And here you can see the search window where you can start your corpus queries. In the most simple case you can just type in a simple word, for example "Freiheit" (freedom). Of course, there are also more complex query options. Here you see some examples. If you want to do more complex corpus queries the documentation provides you with the syntax rules. In this case here, I am just looking for the word "freedom" in this corpus. And here in an intermediate step you also get an overview of the different word forms found together with "Freiheit". We want to look at the results now, the query is being processed at the moment. Of course, there are quite many hits for this lemma. First of all, I get an overview over all hits in the form of a source overview, That means, we can see for each source the amount of hits that we got. The so-called KWIC view (keyword in context) is widely used as an overview tool in corpus linguistic analyses. In this view, you can see the word that I typed in marked in red and left and right the immediate context of this word. Go ahead and try out some corpus linguistic search by yourself! There are several instructions to do so, on bubenhofer.com/korpuslinguistik you can find a tutorial that shows you how to use the COSMAS II interface. We saw that it is possible to work with subcorpora in the German reference corpus. They are also called virtual corpora that allow you to define individual amounts of text data to work with. When querying the corpus we can use a certain query syntax for simple to very complex corpus queries. We might want to know which adjectives normally co-occur together with "freedom". We already saw two different types of how the result can be displayed: The distribution over the individual sources and the KWIC (Keyword in Context) view usually enabled in conventional corpus query tools. If we get huge and really long KWIC-lists and need to scroll them through we won't enjoy that. We want to summarise the results to draw interesting conclusions without having to scroll through all results and read them one by one. That's why there are several important categories for a compact analysis of the results. I will show you four of them but, of course, there are many more. Here you can see a network with nodes and edges. The node in the middle is open there should be another word apparently that is connected to the other words in the network. Take a close look and maybe you have an idea which word should be placed in the middle of the network? So, did you guess correctly? It is remarkable, we see typical uses of the word "diskutieren" (to discuss). And by looking at the surrounding words in the nodes we can understand how the word "discuss" is usually used in language. Something can be "intensively discussed" or "seriously discussed" and discussions can be "heated" or "controversial". Here we see so-called collocations co-occurrences of words that are statistically significant. The term "collocation" was coined by John Rupert Firth in the context of British Contextualism saying: "You shall know a word by the company it keeps". If we know the words another word typically co-occurs with then we also know something about the semantics of this word. The calculation of such collocations is not very complicated. The question is basically: Which words can be found in the neighbourhood of my search term? If we take all instances of "discuss" which words can be typically found in the immediate context? It works like this: We list all words that co-occur in a defined context window of the search term. Usually, we take 5 words left and 5 words right as a context window. Now we count the frequencies of all these co-occurring lexemes. We count the frequencies of these words co-occurring together with our search term but also the overall frequencies of such words in the entire corpus. Next, we apply a statistical significance test to verify whether the two words co-occur more frequently than we would expect with an even distribution over the corpus. Let's assume that we are interested in the use of the word "Passagier" (passenger) and we want to know the words that typically co-occur with passenger. In my results, I will certainly get words like "der" (the), i.e. "der Passagier" (the passenger) Maybe, we will also find "reist" (travels), i.e. "der Passagier reist" (the passenger travels) Or maybe, we will also find "blinder Passagier" (stowaway). In order to decide which of these collocations are statistically significant is the individual frequency of the single words. The German article "der" (the, determined, masculine) most likely occurs many times throughout the corpus. And thus it is not surprising if "the" also co-occurs with "passenger". For the verb "reist" (travels) we can assume that is occurs less frequently compared to the article. Among other contexts, it co-occurs with "passenger" but also with other words. For the adjective "blind" we can assume that "blind" occurs rarely in the entire corpus, and if it occurs, it co-occurs most likely together with "passenger". It COULD co-occur with other words, but we see that it doesn't (in this corpus). That's why this collocation is statistically significant. I want to show you how this looks in an example when calculating collocations in the German reference corpus. The COSMAS II interface allows you to calculate the collocations. There is also a database where you can access predefined collocation profiles. In the current slide, you can see the co-occurrence database of the Institute of German Language that also relies on the data of the German reference corpus. It is a collection of predefined collocation profiles. You see there are several traditions to call this phenomenon "collocation" or "co-occurrence", but we won't go into detail. For now, we treat "co-occurrences" as "collocations". If we want to see the collocation profile of a certain word, we can simply click on the item, for example "freedom". We get a list of all significant collocations (= the co-occurring words) for the term "freedom", and we see that the most significant collocation is "democracy". Then we see "equality", "fairness", "peace", "brotherhood" and so on. The log-likelihood score (significance measure), is arranged in descending order. If we scroll down this list we will eventually see those collocations that are less significant for the co-occurrence with "freedom". What is special about this calculation of collocations is that not only the primary collocation (democracy) is computed but all cases, thus resulting in secundary and tertiary collocations of "freedom". The words "democracy" and "freedom" also co-occur with the word "Volkspartei" (German party). We can see typical syntagmatic patterns of how the lexemes co-occur. The words "freedom" and "democracy" appear 2895 times in the corpus. In 64% of the cases we see the following syntagmatic pattern: "for freedom" or "and democracy" and so on. We saw how such collocation profiles look like but what insights can we gain from this? Collocation analyses show the typical use of a word in language and thus summarise thousands of lines and results in the KWIC view, They quickly tell us which words typically co-occur within a certain context window with our search term. And we can learn something about the semantics of this word. We can also look at different patters of use to see if different types of use regarding a certain lexeme, differ from eachother. You can do synchronous analyses and compare corpora of different parties for the use of certain terms in their individual speech or writing. Or you can use a diachronous approach to see if and how the use of certain words has changed over time. We can also compare the collocation profiles of different words with eachother. We can formulate the hypothesis: If the collocation profiles of two words are very similar these words are used in similar contexts and might be synonyms. Or maybe we are interested in slight differences between different collocation profiles to find out the semantic differences regarding the use of certain lexemes, i.e. how similar are the terms "migrant", "foreigner", "refugee" This idea that we learn something about a word's meaning by looking at its use traces back to Wittgenstein, who said that the meaning of words is best understood as their use in language. This dictum represents the basics for an approach dealing with usage semantics to access the meaning of words. The meaning of words could thus be compared to that what we see as typical use patterns within a corpus. A similar category of analysis are "multiword units" or "n-grams" Also in this case many terminological distinctions can be made. For the sake of simplicity, we will now summarizs it here. A "n-gram" is basically a sequence of linguistic units for example character n-grams, like e-n-t or h-e-i-t. But n-grams can also be composed of entire words, like "heute Abend" (today evening), "freies Land" (free country). Of course, the combinations can be of length n and include also tetragrams, like "auf der einen Seite" (at the one side) or "sehr geehrte Damen und Herren” (Dear Ladies and Gentlemen). According to their length, these n-grams are called: unigrams, bigrams, trigrams, and so on ... In the context of corpus analyses we usually use n-grams in the shape of entire words, i.e. multiword units, a sequence of n words. This is basically an extension of the collocation concept that was binary to collocations containing patterns with more than two lexemes. Here you see a small example on how to interpret such n-grams. In this example, we wanted to know which wordings are typical for different parties in the German Bundestag. We see two groups of typical n-grams the Green Party compared to the other parties. So, these are the n-grams that are typical occurring more frequently in text / speech of the Green party compared to the other German parties. You see the following constructions: "nichts anderes als eine" (nothing but a) "schon bezeichnend, dass" (it is very revealing that), "es ist schon interessant" (it is quite interesting) and so on. We see immediately that these forms have rhetorical potential and assume some rhetorical functions: To accuse someone of a cover-up or to reveal certain connections. These are acts of speech that are most likely very typical for the opposition parties. The third category for corpus linguistic analyses that I want to mention here deals with the distribution of a certain phenomenon over the different parts of a corpus. The distribution of a phenomenon over groups of texts which can be defined according to available meta data. Let's assume that we want to know whether certain lexemes are distributed in a different way over the political parties, i.e. is a certain word used more / less frequently by a certain party or the members of a party. Or if we want to assume a diachronous perspective, if a the use of a certain word increases or decreases over time. For analyses dealing with distribution first of all, we need to compute absolute frequencies for each sub-corpus. In this way, we can see how often a certain expression occurs in a certain sub-corpus. Of course, it is difficult or even impossible to compare such absolute values with eachother since the single sub-corpora can be of different sizes. That's why we often work with relative frequencies and indicate the frequency per 1000 (or 1 million) words and not as percentage. We're basically pretending that the sub-corpus contains one million words, and convert the absolute frequency into a number per one million words. And by this means, we are able to compare the frequencies with each other. Here you see an example from a corpus dealing with the German Bundestag 17th legislative period and separated according to the single parties. We entered again the search term "freedom". We can see now how often the term "freedom" is mentioned by the single parties. We get 199 hits for the Green Party and 306 hits for the SPD. It is important to know that these numbers represent absolute values. In the last row, we computed the relative frequency stating how often a words occurs per one million words. It is visible, that for the Green Party the word is used 80 times per million words, and 62 times for the SPD. The corpus of the Green Party is smaller compared to the one of the SPD. That's why these 199 absolute hits is an even bigger number compared to the SPF when measuring per one million words. Let's more on to category 4, the last of these important categories for corpus analyses. We are talking about "keywords" that can be extracted from a corpus. We can also say that we want to compute the "keyness" of certain words. The underlying idea is the following: Which words, word forms and basic form are typical for a certain corpus in comparison with another corpus. One practical example is the typical vocabulary of the Green party in comparison to the other parties. The operationalisation of this idea is again relatively simple. We compute all frequencies of all words in the actual and reference corpus and get two values for each word respectively. First, the frequency in the actual corpus, i.e. the corpus of the Green Party. Second, the reference corpus, i.e. the corpus of all other parties. Now, a significance test can be applied to measure the significance of the difference of these two frequencies. If we expect an even distribution, we get unexpectedly high values for one party for example. Here is another practical example. A selection of the most typical nouns of three parties in the German Bundestag. We can see that the Green Party uses well-known topics such as "nuclear power plant", "nuclear energy" "climate protection", "lifetime extension" while the CDU/CSU rather talks about "atomic energy" and "competitiveness" in the context of economy. The Green Party uses the term "war" that is rarely used by the governing party. "bank", "employees", "annuity", "salary", "scandal", etc. From this one can draw different conclusions, we are not going to do that at this point. We are already at the end of this module, and I would like to summarise the most important points: On the one hand, we have learned about the typical corpus linguistics' perspective, corpus analysis as an analysis of language use. Then, quantitative analysis to reveal interesting tendencies and patterns instead of individual documents. And we learned about the most important categories of analysis: Collocations as typical, binary combinations of words, multiword units and n-grams. We looked at distributions in a corpus to analyse how a certain phenomenon is distributed over different sub-corpora. And finally, we heard about keywords that can be used to extract very typical expressions within a certain sub-corpus. Thank you very much for your attention!