We ran three proteom comparison jobs. One using a whole genome as a reference, one using FASTA files from genes that we were interested in. Which are these genes, which are the lipopolysaccharide genes from Brucella. The third one was a feature file which were the proteins, the genes from this paper just to show you the differences and what it looks like. I know you're excited to see that job. I'm excited too. Let's go look at it. There are several ways you can view a job, and PATROL, after you've run it. You've seen many, many times click on the job monitor. But let's go up into workspaces, click down here and click on my jobs. I can also navigate to it if I go into my homework space, but this just gets me there quicker. I click on my jobs, and these are the three jobs. Let's see what each of them looks like. To be able to see something you need to highlight the row. When you click on that, this vertical green bar gives you a couple of downstream actions that you can do it. If you have a problem with the job, you can report an issue with it. If you want to view it, you click on the "View" icon, which is what I'm going to do right now. Okay, this is going to give me all the information I want about this particular job. I hope it's all the information you want. It's a lot of information. up here are hyperlinks to some of the files that we have here and the ability to download different things from the job. Across the top is breadcrumb. This is the name of my job. It tells me it's in this folder in my home directory of my private workspace. This is the job ID. This job started on this date of this time. It ended of this date on this time, and it was fast. Six minutes, 51 seconds. One of the things I like about the curidium comparison jobs is they turn out results pretty quickly. Unless there is a long queue, PATRIC operates on a first-come, first-serve basis. Even I do not get a special place in the queue. I have to wait along with everybody. Parameters. If you click down here this will show you everything that you selected when you submitted the job. We have a number of files here that I want to describe to you. Let's start at the bottom with this reference genome text. I highlight that and let's click the "View" icon. This just shows me what the genes in the reference genome that I use, who they were, and where they are in that genome. This tells me that this is the accession number for the contig. This gene started on that contig at that location. It ended at that location. And this is the particular gene. there are 3,000 some genes and should have one for each of those here. Let me clicking Go back here. We also have that data for comparison genomes. It's tied in here, and let's look at it here. This is telling me, against that reference. If these were the genes that had the hits and how strong their hit was. This is the accession number, the start, and stop location. This is the percent identity of this particular gene in comparison to that one. I'm sure you're thinking right away, "Well that's so very helpful to have the reference genes in one file and the comparison genes in another file. How am I supposed to see anything. Doesn't sound helpful at all." Normally I would agree with you, but don't worry, we have a solution and you're going to love it. First, I want to show you this diagram which is the Circos final HTML. Let's view that. What I'm about to show you is the details behind this visualization. But this tells you everything you would want to visually recognize about the protium comparison job. Up on the top is the color key, which distinguishes between bidirectional and uni-directional best hits. The bidirectional are more vibrant and the unidirectional are less vibrant. It shows you that tracks from outside to inside. This is our reference genome. On this track, it blasts against self in this job, so it should have a 100 percent similarity. Next we have the Brucella militants of read one strain. You can see that they're really strong hits here, but some are lesser string. Then we have Brucella canis an isolate from dogs. Now notice that there are some blank spots here, and when you see genes here, it might be an indication that this genome is missing those genes. However, when we submitted this job, we said we wanted to have 70 percent query coverage, 70 percent sequence identity, and an E value of minus seventy. We're setting the bar high for these genes. There might be genes in that genome that would blast against it if we weren't being so picky, but that's the way I played that game and these are the results. But when you publish it, you need to describe exactly what you did. Also, you need to verify things, so you need to go looking to see if those things were actually missing. Next is the oldest genome, which also has a big blank. Now you might ask, well, look here, there's a blank here on Brucella melitensis. Does that mean it's missing a gene? No, what it's trying to do is reflect the position and the order of the genes on the chromosome, and when you see it in the reference genome like right here, that it's missing something. It's an indication that there's just an intergenic region there worth no genes or call. The next one is the fasta file for those LPS genes, then the feature group. As I've said earlier, these genes and these files came from this genome so the hits, should be strong. Now you may be wondering what is going on here, and here with this break here, it starts at 0.1 and then it starts at 0.1 again, what's going on? Brucella melitensis 16M is a closed complete genome. When something is closed and complete, it means that the number of chromosomes and plasmids that an organism have, we have a context that represents that and that within the context, each nucleotide is an A, T, C or G and there are no ends calls for that. What it means the sequencer couldn't distinguish and determine what it was. The way this is visualized is this big part that takes up two-thirds of the circle, is chromosome one, the large circular chromosome in Brucella. The smaller one is chromosome two, the smaller circular. Some people have told me when they have multichromosomal genomes, "I don't want it to look like this. I want to see the individual chromosomes." The way I would do that is submit each of those chromosomes individually to the annotation service, and then run proteome comparison on that. That's what that looks like. Let's go back to the jobs folder. We just looked at the HTML, this SVG is a scalable vector graphics publication quality that you could submit with a paper if you wanted to publish on this, and please site PATRIC if you do. I look at every paper that's sites PATRIC, I also tweet and Facebook about it, so it gives you some incentive I'm trying to help you out in getting your paper recognized as well. But I see a lot of these in papers that are published that site PATRIC. There are a couple of other text documents in here. Karyotype, the large tiles, the legend. The karyotype is just talking about the two large chromosomes in Brucella melitensis. The large tiles, another boring document, I don't think it's very useful. The legend that one shows you the color thing. But the most important one to me, you have the text and the Excel file of the genome comparison. This genome comparison file is taking the reference and each of the comparison and it's combining it in the most awesome Excel file, perhaps that the world has ever seen. Let's look at it. I highlight the row, and this one I have to download. I click "Download, " it says should they open with this with Excel? You bet, because I'm a biologist, I love Excel. Come here you, don't be shy. Don't be shy little Excel file, be proud in your beauty. We are going to show these people what you look like, because it is amazing. This is all the data combined together, and you notice it's keeping the colors too, so if I had that legend up, I'd be able to distinguish between best in bidirectional, in unidirectional hits and all of that stuff. It starts with the reference genome. All of this from here all the way to here, that's our reference. When I go from here all the way to here, that's the first comparison genome. Here all the way to here, that's the second comparison genome. The column heads underneath that belong to that particular genome. Let's step through it, and I'll blow them up so you can see each one. This is the reference genome configuration. This is the accession number where the first gene on the reference genome is located. This is the size of that gene. If you look up here it says, reference genome amino acid length. This one is 341 amino acids long. This is the PATRIC gene ID. Now, be careful clicking on this, because it's a hyperlink, and it's going to take you back to PATRIC and show you that gene. If that gene has an accession number in RefSeq or GenBank, we include it here. If in RefSeq or GenBank it's called by a different gene name, like Hemi, we include it here. This gives you the local PATRIC family ID number, this is the global PATRIC family ID number, this is what the gene is called in PATRIC, this is the start on this chromosome, this is the end on this chromosome, and this is the orientation. That means it's on the forward strand, and so this goes for every single gene in this genome. You'll notice like this one has a PATRIC ID but not a RefSeq gene, meaning it's called [inaudible] PATRIC annotation service, but prokka , which is used by GenBank, didn't call this gene. Now, we get to the more exciting part. Well, they're all pretty exciting. This is saying, what [inaudible] is it? It's bidirectional. This is the accession number on this Rev 1, concave the accession number, that is the best hit to this gene. How big it is, this is how big the amino acid is, PATRIC identifier. If it had a RefSeq locus tag, it would be included here and a RefSeq gene name, it would be included here. This is what it's called in PATRIC and then here's where things get very powerful, the percent identity. Its got 100 percent identity, and you can see a couple of them are smaller, and this is the sequence coverage. It's showing you who the hits are and how good those hits are. Let's start with Canis, when we get here, like here in Ovis too, these guys didn't have any hits of that caliber. Remember, that we set a high bar of 70 percent sequence identity, 70 percent query coverage, and an E-value of minus 70. These guys didn't have any genes that met that bar, and this guy is missing a few too. You can already see that this could be a powerful way that you can look very quickly and see how close the genes in two genomes are, but also see large blocks of things, because remember this is walking down a chromosome. Large blocks of things that appear to be missing. Look, when we get here to these LPS genes, remember we were looking at a specific FASTA file, that's very limited. Suddenly, you can sort of get a feeling for why you might want to use this as a reference because if these are the genes you're interested in, that's all you care about. There's all this other stuff that you don't even want to have to wade through. Let's go in and look at what that looks like. I'm going to go back here, actually, I want to go to my jobs page as we've discussed everything I think we need on this one. This is the one with the genome, and you can get quickly to the beautiful colors and the circos diagram by clicking here. Okay, and that's going to show us this. But then, I also wanted to look at the other job so we could go back and forth between the files. This is when I used the whole genome as a reference. Here I'm going to go into my jobs. Here's the FASTA file. Let's do that and then click the View icon up here in the upper right. When you get more specific on what you're asking for, the diagram changes. You might think, "Well, it's going to look exactly the same as the last one." Right? Let's see it from the feature group file. This is the feature group, the other one is the FASTA file itself. This is the feature group. Let's view that. Then click view. Let's describe what the differences are. Here's the feature group and that was 72 genes. What this is showing from outside, the first one is the feature group. Then we have Brucella canis, Brucella ovis, Brucella abortus 2308, then the FASTA file, then the REB1 genome, and then the genes from 16M. This is just showing where, because I took this feature group from PATRIC. Was trying hard to resolve it on the genomes. So it's showing you exactly where all of these genes are in the genome. You can see that there's a large number of LPS genes here. In fact, one thing I didn't show you on the other diagrams, some of these can be used as hyperlinks and it'll open up, and show you a particular gene if you're interested in it for math. But I'm not even interested in waiting for it to load because you've gotten this, which is showing you the things that you were interested in, in the context of this whole thing. It's like blowing up these two lower circles. But I'm just really impressed with this, which is giving you all the data. This is a large gene here, it can walk through it. This is the FASTA file against itself. Canis, ovis, abortus 2308, REB1, and then the 216M. I really like the way this looks. Let's go back to the folder for it. I we were to download the genome comparison for this. I say, "Yes, I want to open that." Give it a chance to open because it's heavy, because it's got all the coloring in it. We'll stretch it out, come here. Then it's just showing you for those genes of interest, all the BLAST data and it's amazing. I just love this thing. I hope that you do as well because it's an amazing tool, and it's a great way to really dig into the data. In the last video of this series, we'll talk about citing PATRIC, and what to do when you have problems and need to find help. Thanks for joining me, and see you next time. Assignment eight. The end is so close, I can almost see it. This particular assignment is about download files. You've run all the jobs, all the data's there, and there's just a lot of it I know for you to go through, but I want you to look at at least some of the files and try to decide, what are the main differences that you saw between the Excel file with jobs that had a FASTA file, a feature group, and a genome group as the reference? Look at three of those, and right away, within the references you'll see some things that are clear differences. Which of those, the FASTA file, the feature group or the genome as the reference, does not give you protein family assignments? Which of those had more uni-directional BLAST hits? You'll have to scroll up or down, or you can put a dynamic filter on any of those lanes and filter for uni-directional and see, did you get more unidirectional hits when you used the FASTA file as the reference, the feature group as the reference, or the genome as the reference? Another question is when you use each of those three, which file had lower hits? Less quality hits, if those would be the non-blue colors. You're getting down into the yellows, and reds. In the direct comparison of the Canu and Unicycler assembled genomes, do you see any patterns related to the contigs? Look for the column that has the contigs, and then scroll down that and see what you can say. This is part of our pathway to truth on what the different assembly strategies tell us. Now, this is the end of the assignment. I'm going to have a summary video that's going to take you down this path of, well, shall we say searching for truth? But take you down the path of everything we've done across the course and the things that we've followed the same set of data, the same set of reads through a number of different services, and experiments and we're going to look at that and summarize it and hopefully it'll give you some sense of closure that comes at the end of this course. I'll see you there. Bye.