Hello. I haven't seen you for a long time, since we were working on the change problem. But today we'll work on a completely different topic called String Algorithms. String Algorithms are everywhere. Every time you spell check your documents, or Google something, you execute sophisticated String Algorithms. But today, we walk about very different application of String Algorithms. Sam Berns gave a fantastic pep talk when he was 16. He was talking about his life and a year later, he died. Sam was suffering from a rare genetic disease called progeria. Children with this progeria often will be above average intelligence, look old already at the age ten and usually die in their teen years. But for many years biologists have no clue of what causes progeria. But in 2003, they figure out that progeria is caused by single mutation, on chromosome one. To understand what pattern matching has to do with progeria, we need to learn something about genome sequencing. When my children were young, that's how I was explaining them how genome sequencing work. I was using an example of the newspaper problem. Take many copies of the identical New York Times newspaper, then set them on a pile of dynamite. Don't do it at home, and then wait until explosion is over, and collect the remaining pieces. Of course many pieces will burn in the explosion, but some pieces will remain. And so your goal is to reconstruct the content of the New York Times. A natural way to solve the newspaper problem, is to consider it as an overlapping puzzle. Look at different pieces and try to mix them together like this. And then slowly but surely, hopefully you'll be able to assemble the whole gem. And that's roughly how the human genome was assembled in 2000. And here, Bill Clinton is congratulating Craig Venter, one of the leaders of the Human Genome Project, on completion of this $3 billion mega science project. We don't need to know much about genomes for the rest of these talks. The only thing we probably need to know is a genome is simply a long strand in A, C, G, T alphabet. I will try to explain how the newspaper problem translates into genome sequencing. They start from millions of identical copies of a genome. Then they break the genome at random positions using molecular scissors. These molecular scissors don't look quite like this one shown in this picture. Then we generate short Substrings of the genome called reads, using modern sequencing machines. Of course, during this generation some reads are lost. And the only thing left is to the assemble the genome from millions, or maybe even billions of tiny pieces. The largest jigsaw puzzle humans ever attempted to solve. Today, I won't be able to tell you about algorithms for genome assembly. But if you are interested in learning about this algorithm, you can attend or Bionformatics specialization at Coursera. Or read the book, Bioinformatics Algorithms. Assembling human genome was a challenging $3 billion project and afterwards the era of genome sequencing began. But in the first ten years after sequencing human genome, biologists were able to sequence only about ten other mammalian genome because it was still difficult. However, five, six years ago, so-called next-generation sequencing revolution happened. And today, biologists sequence thousands of genomes every year. Why do biologists sequence thousands of species? There are many applications. For example, the next big science sequencing project after the human genome was mouse genome. Because we can learn lot about Human biology and diseases from mouse gene. An important application in agriculture, for example by sequencing rice genome, biologists are able to develop new high yield crops of different plants like rice. And there are many hundreds and hundreds of other applications. But recently, in addition to sequencing of many species, there is also much effort on sequencing millions of personal genomes. The things that make us different are mutations. However, there are surprisingly few mutations that distinguish my genome from your genome. Roughly one mutation per thousand nucleotides. However, this mutation make big difference. They account for height or they account for more than 7,000 known genetic diseases. Five years ago, the era of personalized genomics started and Nicholas Volker is a foster child of personalized genomics. He was so sick that he went through thousands of surgery, but doctors still were not able to figure out what is wrong with this kid. However, after he is genome sequence can reveal a mutation in a gene linked to defect in the immune system. Doctors applied immunotherapy and Nicholas Volker is now a healthy child. However, sequence in personal genomes, from scratch, still remains difficult even today. What biologists do today however, they do so called reference base human genome sequences. Let's start from Craig Venter genome assembled in 2000, call it reference genome. And then let's start sequencing my genome by generating all reads from my genome. Here's some of the three perfectly match to genome, but some of them don't. And based on these reads that do not match, we will be able to figure out what is my genome. For example, we can find a mutation of T into C and deletion of T in my genome as compared to. It brings us to a number of computational problems. The easiest one is the exact pattern matching. Given a String pattern and a String text, we want to find all positions and texts where pattern appears as a Substring. But our goal is to find mutations, and therefore we want also to solve Approximate Pattern Matching Problem. Where input is a string Pattern, a string Text, and an integer d. And we want to find all positions in Text where Pattern appears as a Substring with at most d mismatches. I think you already have some good ideas on how to solve this rather simple problem. But think about this, even if you have fast algorithms for solving this problem, would you be able to solve the next problem? To answer the question where do billions of reads generated for me, match the reference genome from. And this leads us to Multiple Pattern Matching Problem given a set of strings, Patterns and a string Text, find all positioning texts where a string from Patterns appears as a substring.