Hello and welcome to Module 6a: "Challenges in Multilingual Text Analysis" This module is divided into two parts. We will first of all focus on language identification, and in the second part on automatic sentence and word alignment. Let's start with the topic "Automatic language identification" and take a look at some examples: Here you see a German text taken from the yearbooks of the Swiss Alpine Club. And in the middle of this text there is an insertion with the title of an article in English. We want to automatically identify the English part in order to be able to treat it separately and search for it systematically. In the Swiss Alpine Club, several language combinations can be found in the corpus. Second example: Here you see a German text ending with an expression in Swiss German dialect. And also such parts should be marked to identify them and treat them separately. And a third example: A French text with direct speech of an alpine guide and an exclamation in German, making the expression more authentic. First of all, we can note: Automatic language identification works well for languages with standardised writing for example German, English, French, Italian. And it is very difficult for languages where this standardised grammar and spelling rules are missing, like in Rhaeto-Romanic or text in Swiss German dialect. We can define automatic language identification as follows: It assigns a language label, a language classification to a sentence. And thus resolves the question: "To which language does this sentence belong to?" The motivation for automatic language identification is the following: The tools for the analysis and annotation of a sentence for example part-of-speech tagging are language-dependent. That's why we need to know the language of a certain sentence. We distinguish between two methods: language identification using character sequences and language identification using the comparison of dictionaries. The automatic language identification leads to a chicken-egg-problem, since the tokenisation, i.e. the splitting of a sentence into single units, and the splitting of a text into single sentences, depends on the language and on abbreviation lists for that language. The other way around, the language identification requires that the sentence boundary detection is already done. So basically, in practice we first run the overall tokenisation and sentence boundary detection, then, we apply the language identification and correct and adapt the tokenisation in the final step. Automatic language identification uses typically statistics regarding character sequences, usually character trigrams, sequences of three letters. The advantage of this approach: It's a very robust method that can be also applied when there are spelling errors. It works quickly and we don't need any dictionaries. The disadvantage is the following: This approach works only reliably for sentences with 40 or more characters. For shorter sentences that might appear within a text, we can pragmatically assign the language of the entire text or assume that this short sentence is written in the same language as the previous sentence. Or we have the alternative to use dictionary-based language identification: Such automatic language identification methods using statistics of character sequences, can be done with ready-made programs. If you want to try it out on your corpus, you can use the programs "lingua-ident" or "langid.py" (a Python program), which are able to identify several languages (for langid.py for 90 languages) and assign the correct label to the sentences. It is also good to know that these approaches for automatic language identification are also used in Microsoft Word and other systems. And if we take a look at the most frequent trigrams in a text, we see that this works fairly well. In this case, we computed the most frequent trigrams of the book "Tom Sawyer". Let's take a closer look: There are two different languages at the left and at the right in the current slide. And you will quickly see that the trigrams on the left are the trigrams of the English version of "Tom Sawyer" while the trigrams on the right correspond to the German translation of "Tom Sawyer". Note that if there are only two letters a blank space was counted before or after the sequence. Automatic language identification using the comparison of dictionaries is an alternative that can be used for closely related languages that only slightly differ in their typical character sequences like Swedish and Norwegian for instance. For languages without spelling rules and conventions or for short sentences these regular sequences of letters cannot be trained very well. An example from our corpus for sentences that were identified as German sentences using statistical language identification we can verify with a special word list if a specific sentence is Swiss German dialect for example. There we collected words like "isch", "chli", "nöd" and "guet" which are typical for texts in Swiss German dialect and never occur in Standard German text. That's why we can conclude that these sentences are written in Swiss German dialect. A quick overview: Language identification works in three steps: First of all, we need to check whether a sentence contains more than 40 characters. If yes, we automatically differentiate between sentences in English, German, French or Italian. Additionally for all German sentences, we also check whether it is Swiss German dialect or Standard German. And some more information about the topic "code-switching". Here you see an extract from a sentence in the yearbook of 1925 (Text+Berg Corpus) "Und ich finde es very nice and delightful, einen Vortrag halten zu dürfen." We have a German sentence that include an insertion of an English word sequence. a quotation marked with quotation marks. And you can also see that the part-of-speech tags that have been assigned are in this case "proper noun" and foreign words for "and". These tags are wrong and match neither German nor English. Of course, it also wasn't possible to assign a lemma. That's why it is important to identify this sentence part as English. And it is exactly this code-switching part that we want to identify in this sentence. After that, this part of the sentence should look like this: We assign the POS tag "foreign word" and some dummy lemma, like "@fn@" in this case. And in addition, the entire block has been marked as "foreign-language material" with the information about the involved language, in this case English. How can we do that? We developed a small algorithm for this purpose: If a sentence contains a pair of quotation marks and if in between these quotation marks, tokens with unknown lemmas can be found, and if at least three tokens can be found outside the quotation marks, i.e. not the entire sentence is contained within the quotation marks. If this sequence of tokens in between the quotation marks is longer than 15 characters and gets a different label after the language identification process, then we can say that it is code-switching. We evaluated this approach and can report fairly good results to analyse code-switching in such a way in a corpus. Let's summarise: In the first part of module 6a, we talked about automatic language identification. We discussed that language identification is absolutely required for further analyses and annotations of a text. Language identification works robustly and reliable using a comparison of character trigrams. In special cases, word lists are applied in order to guarantee a correct language identification. Code-switching, the spontaneous switching between languages within a sentence (definition for the purpose of this online course and the way we proceeded in our corpus) Thank you for your attention and I'm looking forward to seeing you in the second part of module 6. Thank you very much!