Hi. You've already heard about suffix arrays from pile in the first two modules, and you know that they can be very useful for the better matching of very long strings like billion nucleotide genomes. In this module, we will go through a real algorithmic challenge of building and storing suffix arrays efficiently. What is a suffix array? Basically, the problem of construction of a suffix array is a problem of sorting all of the suffixes of a string S in lexicographic order. What is the lexicographic order? First, we have some alphabet. Typically it's English alphabet, but for the genomes for example, is alphabet of just four different nucleotides: a, c, g, and t. The alphabet is the set of all characters which can occur in the strings that we consider, and we assume that the alphabet is somehow sorted. For example, the English alphabet is sorted from a to z, then we can define the lexicographic order on strings. The formal definition is on the slide, but the basic meaning is that a string S is lexicographically smaller than string T either if S is a prefix of T or if they have some common prefix potentially empty such that the next character in S after this common prefix is smaller than the corresponding character in T. This is the definition. Examples of lexicographical order include ab is less than bc in terms of lexicographic order because their common prefix is empty. The next character in the first string is a, and the next character in the second string is b, and a is smaller than b in the lexicographic order of the alphabet. Another example is abc is less than abd because there is a common prefix of ab and the next character in the first string is c which is smaller than the next character in the second string which is d. The last example is abc is less than abcd because abc is a prefix of abcd, and in this case our first string is less than the second string. Now let's see an example of a suffix array. We have a string S which is equal to ababaa, and below we'll list all the suffixes, and they're listed from the first one starting in position zero to the last one starting in the last position of the string. They're sorted by decreasing line. What happens if we sort them in lexicographic order? Then we get first the suffix a; it is the smallest one, then aa. Aa is bigger because a is a prefix of aa. The next one is abaa. It is bigger than aa because they have common prefix a, and then the next character in aa is character a and the second character in aba is b which is bigger. Then goes ababaa. They have a common prefix of three characters with the previous suffix. Those are aba, and then the next character is a in the first case and b in the second case. Then goes baa. It starts with letter b which is bigger than the first character of the previous suffix, and the last one is babaa. It has common prefix ba with the previous one and then goes b and in the previous suffix, and then goes a which is smaller. This is indeed a correct sorted order of all the suffixes in lexicographic order. This is what we want to build. While we will build suffix array, we want to get rid of the inconvenient part of a rule that if S is a prefix of T then S is smaller than T. To avoid this special case, we will append a special character dollar to all the strings we will consider. This special character dollar is a special character that is considered smaller than all the other characters, all the characters in the alphabet. In this case if S initially it was a prefix of T, then string S$ will differ from string T$ in the position equal to the length of the initial string S, and in this position a new string S will contain dollar and the new string T will contain some character from the alphabet. S$ is less than all the characters in the alphabet, the new string S will be smaller by definition than the new string T. However, we won't need to consider the special rule of our prefix anymore because of strings that were initially prefixes of others are now just regularly less than no strings in lexicographic order without applying rule of our prefix. Example of adding dollar to the end of the string. We have initially string ababaa. We add a dollar to the end of the string, and then we sort the suffixes of the new string taking into account the fact that dollar is less than every other character, and we get this order on the slide. What happens if we remove all the dollars back? We get exactly the same order of all the suffixes of the initial string. This is because the order of the strings doesn't change after adding a dollar which is smaller than the other characters to the end of each string. Basically by adding dollar to the string initially, we can sort the suffixes of the new string and forget about the prefix rule in the definition of lexicographic order. Now how to store suffix arrays is an interesting question because the total length of all the suffixes of a string S is actually quadratic in terms of the length of the string S. That is too much especially for strings like genomes which consists of billions of characters and storing all of the suffixes themselves with a billion squared memory is totally infeasible. Instead of storing the suffixes themselves, we will store just the order of the suffixes. The order can be given by the sequence of integer numbers which are the positions of these suffixes where they start in the initial string. That will be just proportional to the length of the string, not the square of the length of the string. This is an efficient way to store the suffix array given that we have built it. Suffix array later in this lecture is just the order of the suffixes, not the array of the suffixes itself. For example, if S is the string ababaa$; we already added dollar in the end, then suffixes are numbered by their starting positions. For example, the suffix which is the whole string is numbered with zero because it starts in position 0, and suffix abaa$ is numbered with 2 because it starts in position 2. Now let's build the suffix array as the order. The smallest suffix is dollar and it is at position 6, so the first integer in the order is six. Then goes a$ which is at position 5, then aa$ at position 4, and then the next one goes abaa$ at position 2. The next one is at position 0, the whole string, and then the baa$ at position 3, and the last one is babaa$ at position 1. The suffix array is 6, 5, 4, 2, 0, 3, 1 and such kind of array is what we want to build as the suffix array of our string. Now you know how to store a suffix array efficiently, but how to actually build it in the first place. If you want to now, see you in the next videos.