Hi, I'm Liliana Florea and today we'll continue our discussion on Command Line Tools for Genomic Data Science. We'll continue that with a module on Sequences & Genomic Features. We'll look at how they are being generated, represented, and then at some of the UNIX commands that you can use to manipulate them. So, let's get started. First, I would like to give you a very quick introduction to molecular biology, a very quick primer, and we'll start by talking about genomes and genes. A genome represents a totality of genetic material within a cell. We can think about it as being a very, very long sequence. However, why we are thinking about it in a one dimensional way, the genome is organized into chromosomes. The human genome, for instance, is packaged into 46 chromosomes, 22 pairs of autosomes and the two sex chromosomes X and Y. Therefore, they have a three dimensional structure. Back to the genome. Along the genomes are important features, the most important perhaps are genes. They're organized as beads. You can think about them as beads along a string and they're organized on both strands of the double-stranded genome. So for instance, if you're looking at this representation here, you're going to see blue blocks represented at the top and red blocks represented at the top, where blue represent genes located on the fourth strand and red represent genes located only on the reverse strand. Now if you further zoom in, genes are much more complicated than simple beat. So what you can see at the bottom is that genes are actually made up of living types of blocks. There are two types of blocks, informative ones, that they call exons and in-between them there are spacers that they call introns, and you will see on the next slide why we call those informative. So as you can see, the genes reside, or are located or are housed if you wish, along the genome, however, genes also have a body of their own. So they have to take a form of their own, and they do so during the process of gene expression, and that's what I will be explaining next. Gene expression essentially is a complex process, that starts with a gene represented on the genome, and through a variety of steps, leads to the production of a protein. I will be talking about Eukaryotic genes, and I will be talking about protein coding genes, with the understanding that there are other classes of genes that are non-coding and do not lead to the production of a protein product. So what does this process look like? We start with a gene at the top and its location on the genome. The first step is transcription. During transcription, the molecular mechanism creates a copy of every base from the beginning of the gene, at the initiator position, to the end of the gene, as a single stranded RNA molecule and we call that a pre-mRNA. Now before being translated into the protein, this gives birth to another type of molecule by first undertaking a number of modifications. The first of those modifications is capping, then the tail is being chopped off and replaced with a tail, but the most important of these modifications, perhaps, is splicing. So, during splicing, those informative blocks that we were talking about get connected together, whereas influence, those spacers are being removed or spliced out. So in result is a mRNA, a so-called messenger RNA molecule, that contains the concatenation of exons in the order in which they occurred on the genome. This molecule is being exported into the cytoplast, and everything so far had happened in the nucleus, where it is being translated into a protein. This is the process by and large, however, what happens is that genes have variations. So, for instance, it's quite possible that while one messenger RNA form might be expressed in the brain, another form might be expressed in the liver. Or we might have different representations or different variations of the gene output, depending on the developmental stage, or depending on the disease versus normal condition. So in this particular case, it is possible that by selecting different combinations of exons from among the original ones, we might put together different mRNA products and correspondingly, leading to different protein products. So, we call all these embodiments or all these variations transcripts, or spliced variants, or alternatively, spliced transcripts. We will use these definitions in the following discussion, gene and transcripts. So, if you're thinking about how to put all of these together into a molecular and cellular context, you can think of any given time of the molecular cell with its organelle and its nucleus, and there's always exactly one copy of the genome in the nucleus. Then there are a number of expressed genes. You see these blue wiggly lines, the mRNA's and each gene might have multiple copies of an mRNA in the nucleus as well as in the cytoplasm, and the same thing for proteins. So every gene might have a number of protein molecules being produced, both in the nucleus and the cytoplasm. And the major questions that we need to address as competition biologists is, what are the genes, what are the proteins, what are their sequences? And additionally if we can to quantify their bonds, and lastly what is their role in the cell. We will start with how we represent the genes and the sequences and how we identify these sequences, starting with the following slides.