Hi! Welcome to “Bioinformatics: Introduction and Methods”! I’m Liping Wei from the Center for Bioinformatics at Peking University. In the fall of 2013 we offered this MOOC on Coursera for the first time. Over 18,000 students from over 100 countries enrolled. We had a great time interacting with the students. We hope that we can have a great time with you guys as well! Today let’s begin with an “Introduction and History” of bioinformatics. The first unit is “What is bioinformatics?” Bioinformatics is a discipline that is near and dear to my heart. Let me start with sharing with you what had inspired me to study and work in bioinformatics. I have always been fascinated by life around us. What fascinates me the most is how a tiny fertilized cell can grow into a lovely baby in just 10 months’ time in her mother’s womb, and how the baby gradually grows into a lovely young adult. This is Dr. Wenbo Chen who recently graduated with her Ph.D. from my lab. What is even more fascinating is that most of the instructions for the development of a fertilized cell into a baby are written in the 23 pairs of chromosomes located in the nucleus of the cell, which are so small that they are only visible under a microscope. The human genome that these chromosomes encode is like a “manual of life”. The genome encodes when, where, and which tissues and organs should be made. Of course this “manual of life” is not complete or absolute. Other than the nuclear genome, our mitochondria also contain DNA. Epigenetic information such as DNA methylation, histone modification, and nucleosome positioning is important for development. Environments and gene-environment interactions shape us as we grow. The stochastic events during development also play a role. However, it is undeniable that our genome encodes many secrets of our lives. What is even more amazing is the fact that the human genome is made of stretches of just four simple nucleotides, A, T, C, and G (Adenine, Thymine, Cytosine, and Guanine). This is a small segment of human chromosome I. The genome sequence looks so simple. Yet if you think about it, what secrets were hidden in this Book of Life of the four-letter alphabet! This is a great mystery to me. The human genome has about 3.1 billion base pairs. About 2.9% are genes that encode for proteins. Where are these genes in the genome? Alternative splicing is common in higher organisms, and one gene can be spliced and translated into several or even several thousand different proteins. To give an extreme example, the Dscam1 gene in drosophila is spliced into over 38,000 different isoforms! How do we predict the proteins that a gene encode? The other 97.1% of the human genome used to be called “junk DNA”. However now we know that they actually encode important regulatory elements. They encode instructions on when,where which, and how much proteins to make. Where and what are these regulatory elements? What regulatory networks do they form to function together? Not only human, but all lives on earth except for RNA viruse have genomes made from these four nucleotides A, T, C, and G. These simple-looking sequences hide the mystery of life, waiting for us to uncover. Sequencing of the genomes of other animals, plants, and microbes allow us to re-construct how species had evolved, and how the Tree of Life looks like. New high-throughput technologies bring unprecedented opportunities for life science research. They allow us to get new data never seen before, to study new questions impossible to study before, and to discover new phenomena unimaginable before. Now we can sequence not only one person’s genome, but also many different people’ genomes. This allows us to study the genetic differences between different people at genome scale, as well as many other questions in population genetics such as the early migration of human beings across continents. At the same time, we can also study the genetic differences between patients and normal controls to find the genetic mutations responsible for some of the diseases. Shown below is a pedigree of a childhood neurological disorder that my lab recently studied. This seven-year old boy has severe intellectual disability, is unable to walk, and had epilepsy ever since he was an infant. He has a twin brother who has the same symptoms. He also had a maternal uncle who had the same symptoms and passed away from epilepsy during childhood. His maternal grandmother reported that she had two third-trimester miscarriages of two other sons. The fetus had no hair or eyebrows. A common physical feature of all the affected in this family is that they all have no hair or eyebrows at birth, and scaled skins on large portions of the body. In the end we found that the boy had just one point mutation on one gene on his X chromosome causing all these symptoms. Every one of us carries mutations. But most of the mutations do not cause diseases. How can we find this one disease-causing mutation among the 3.1 billion basepairs in the human genome? How can we distinguish disease-causing mutations from neutral mutations? These are all important questions that fascinate me. The fundamental question is, how do we decode this Manual of Life? This seemingly simple Book of Life of just four letters is actually not easy to read. I once did a calculation for the undergrads in my lab. If you print out the human genome at 100 characters per line and 50 lines per page, you’ll fill 600,000 pages which stack up to over 200 feet, taller than the Life Science building at Peking University! If you read one nucleotide per second without eating or sleeping, you’ll need 100 years to finish reading the human genome. Not to mention that we don’t want to just “read” it, we want to understand it and decode the mysteries hidden in it. In addition to the human genome, over 1000 trillion base pairs from over 165,000 species had been sequenced by the end of 2013. Just reading them would take 30 million years. Life science data is not only big in size, but also growing exponentially. This figure shows the number of nucleotides sequenced and stored in the Genbank database over year. Our modeling showed that the number of nucleotides had been increasing exponentially since 1982, doubling every 20 months. What does this mean? It means that the data deposited into Genbank in the next 20 months from now on is as much as all the data that have existed in the history of human till today. At the same time, the sequencing cost had been decreasing every year. This reminds us of Moor’s Law. The past few years had seen even steeper growth of sequencing data. The reason is that next-generation sequencing technologies had been developed and used more and more wildly in all areas of life sciences. Next-generation sequencing technologies such as Illumina Hiseq can sequence one person’s genome in a day with less than 3000 dollars. This figure shows the growth of the Sequence Read Archive (SRA) database. From 2010 to 2013, the number of nucleotides sequenced had increased from 10 trillion to 1000 trillion, or 100 times over just 3 years, doubling every five months. That is to say, just imagine it, that the amount of next-generation-sequencing data generated in the next five months will be as much as all the data generated so far in all of human history till today. Just think about what exciting discoveries these never-seen-before data may contain! Other than sequencing data, other high-throughput technologies such as mass spec, yeast-2-hybrid, etc. had also generated a large variety of huge amount of proteomic, metabolomics, and protein interaction data. No wonder biological data is now widely considered “Big Data”. With great opportunities come great challenges. High-throughput technologies usually have a higher error rate than corresponding low-throughput technologies. For instance, the per-nucleotide error rate of one run of next-generation sequencing is about 10-100 times as high as the error rate of Sanger sequencing. How can we fish out the signals from the noises? Sequencing data generated from one set of experiments in my lab are often too large to be stored in a laptop or desktop, and cannot be opened with desktop software. The storage and search of big data requires advanced ontology-based database systems. The huge amount of data requires efficient methods. Exponential growth requires scalable methods. The low signal-to-noise ratio requires accurate methods. And handling multiple types of data requires data integrative methods. These are significant technical challenges. But it depends on how you look at it. Technical challenges can also mean opportunities for technical innovations. The birth and growth of the field of bioinformatics has been driven by these opportunities and challenges. As you may be able to realize by now, the love story between life science and computer science is inevitable, and the result is the birth and growth of the field of bioinformatics. What, then, is the definition of bioinformatics? You may already have a definition in mind by now. Please pause here for a moment and think about it by yourself before continuing. Bioinformatics can be defined as an interdisciplinary field that develops and applies computer and computational technologies to study biomedical questions. It has two roles. As a technology, bioinformatics is a powerful technology to manage, search, and analyze big data in life sciences. As a methodology, bioinformatics is a top-down, holistic, data-driven, genome-wide, and systems approach that generates new hypotheses, finds new patterns, and discovers new functional elements. It complements traditional experimental biology methods. A seamless combination of computational and experimental methods should be the best way to study a biological question. Bioinformatics is truly interdisciplinary. It studies questions in biology and medicine, while developing and applying methods in computer sciences, mathematics, statistics, and physics. It overlaps with medical/clinical informatics, systems biology, and synthetic biology. The Bio- in bioinformatics signifies the biological questions it studies, many of which can be grouped under the conceptual framework from genotype to phenotype and the Central Dogma. Some examples may help you understand better. For instance, sequence alignment. Are the sequences of two genes or proteins similar? Could they be homologous? How can we find the closest homologue of the gene I’m studying from the vast databases? Can I use the known function of a known gene to guide the study of my gene of interest? Given DNA and genome sequences, how can we find the genes in the vast genome? How can we compare the similarity and difference between two whole genomes and reconstruct the evolutionary history? What are the syntenic regions between two genomes? How can we identify which genomic regions are methylated? At the level of RNA expression, we often need to find out which genes are differentially expressed between two organs/tissues, between tumor and normal tissues, or between two different developmental stages? At the level of proteins, how do you identify the expressed proteins from mass spec data? Proteins do not exist in a linear form in nature. Instead, they are folded into beautiful three dimensional structures in nature. How can we predict the three-dimensional structure of a protein from its sequence? Molecules do not exist or function in isolation. Instead, they form complex molecular network. How can we construct protein-protein interaction network, transcription regulation networks, and metabolic and signalling pathways? What are the dynamic features of these networks? Furthermore, with these data and researches, how can we simulate a virtual cell? Last but not least, how do you compare different people’s genomes and use population genetic approaches to study the evolution and migration of the human species? How do you compare the genomes of the ill and the healthy using human genetic approaches to identify the gene mutations that cause diseases? I hope you can see from these examples that there are many interesting and important questions in life sciences that await our bioinformatics investigation. Seeing bioinformatics from another angle, the –informatics in Bioinformatics signifies the information and computational methods to manage, search, and analyze the data. The theme runs along the axis from data to discovery. First, the storage, index, and search of terabytes to petabytes of big data require advanced databases, with ontology defined to standardize the data format. Just imagine it. Your laptop probably has several hundred GB of harddisk. However just one set of our experiments often generates one or more terabytes of data, which is too large to fit in your laptop. Managing such big data requires advanced database systems. To facilitate future analysis, meta-data about the details of the experimental conditions, in addition to data itself, should be stored. Analysis of noisy big data requires the development of many algorithms, software tools, and web servers. These make up a big portion of bioinformatics research. A tradition in bioinformatics since the early days till today has been to make most algorithms and software open source. This has contributed significantly to the broader life science research, and had helped to advance bioinformatics itself as well. With these tools, we can do lots of data mining to discover new patterns and phenomena in life science. Finally, by integration of data and tools, we can build predictive models of biological systems. As more and more high-throughput technologies are used more and more in life science, there will continue to be numerous new opportunities for technology innovation in bioinformatics. By now I hope that you have got a basic idea of bioinformatics. Here I listed a few summary questions for you to think about. If you have any questions, ideas, and suggestions, please discuss with other students and us on the online forum. In the next unit, I will give a brief introduction of the history of bioinformatics. See you then