Congratulations, you've made it to the end of the comparative genomic analysis course we started with reads. We went to assemblies, then we went annotations. Then we went to comparisons of protein families. And then the comparisons of the proteins themselves and that led us to assemblies and sort of backed annotations. It seems like it's never ending cycle, but it doesn't, there isn't ending. But let's start at the beginning. Let's take a walk down memory lane and I recognize that for some of you it may not be a pleasant experience. But still, let's go on nonetheless. It all started with some read sets, some PAC bio reads and two ilumina paradigm reads sets that we joined together using two different assembly strategies. Keep in mind the same reads for use in this, but that's one thing that I can keep reminding myself up when I see these differences. They all had the same starting material. These were the assemblies that we submitted, I'm sure you can't forget this was probably the most funny pad in ages. Well, when the assembly returned we could see that there were differences in the number of Contexts. And in the size of those contexts and minor differences in the size of the assembly. Unicycler had two contexts, canu had three and remember those cute bandage plots showing the different contexts. The comprehensive genome analysis also provided us with a hint as to where this genome belonged in the great scheme of life. Remember when we submitted it, we just called it bacteria. But when you look at the tree that was run, it seems to be deeply imbedded in the Staphylococcus genus. In fact, Staph aureus seems to be branching near more than anything else in, this will be important a little bit later. So as we continue our walk down memory Lane, we took those genomes from all those different assembly strategies. And we loaded them into the protein family sorter which looks at differences in the annotated proteins. And we did some clustering, and as you can see there, the canu genomes up at the top and the unicycle are genomes down at the bottom. This should all be bringing back memories, see with differences that you can see based on the different assembly methods. And let me point out again they were all assembled using the same rate sets. So it's just showing you the differences that are generated by the different tools. So we continue our walk along, remember, when you did some masterful filtering. And you were able to find genes that were unique to unicycler but missing from canu and then we were able to take those genes and download them. And we saw something curious when we did that. Here's a list of those genes and the feature table that came down from it so you can view the features. And if you look at an individual feature, you can see the information about it. You can also see that this is on the second context, so we wanted to look at the gene neighborhood. Remember that the neighborhood around this chain? So I clicked on the compare written view and this is our guy here. This bacteria that we know is Staph aureus so it's at least staphylococcus they don't really know who he is yet. Look at the closest relatives that the comparison tool was finding. These guys, which aren't anywhere near so these genes here they are found in more distantly related bacteria. So we thought a little curious and we decided to 1st confirm that the genes were gone from canu by BLAST. So you notice that the we had created a genome group, another skill that you've got of the canu assemblies. And BLAST the sequence of those particular genes against it and it didn't get any BLAST. So we decided to try to find out what the region is that are missing. We decided to use the proteome comparison tool, which gives a very detailed BLAST analysis. We loaded the tool and then we looked at one that had the canu assembled genome as the reference compared to the unicycler one. And then the unicycler genome as the reference compared to the canu one. Remember that canu had three contig and unicycler had two. And you notice that you can see some pretty striking differences in not the main chromosomes that in those small little ones. The genes on the second contig of the unicycler assembly appeared to be missing in the Canoe Assembly. This is from that Excel sheet that comes with the proteome comparison tool, and he scroll down to look at the second contig. You can see that all those genes appear to be missing and canu. Remember saying starting material, and yet they're missing that. One thing you could see is look more deeply at these genomes, and what's going on here. So if you look at the canu assembly, you can see that some of the genes on that third small contig are missing. But the vast majority of the genes are still there, and in fact, the ones that are missing all appear to be named the same thing. So I suspect that there a pseudogene, or sometype or other. But let's get back to unicycler, let's leave canu behind, so we're looking at these missing genes. If you go to the genome landing page for this genome, Patrick. Well, remember the bandage plot? All those genes are on flat contig there. If you go to the genome landing page and look at the sequences. We can look at who the features are on the second contig by highlighting the row and clicking features, and that shows you who the genes are. So we have these genes in unicycler, but we're not seeing them in canu. So our quest for truth was to try to figure out if we really think this is real or not. The question is, did the genes really exist or are they an aberration of the assembly? I would say that yes, they do exist. Those genes all have good names that they were all hypothetical, I might be worried that they were some artifact. But I feel like this is just some proof, things that the assembler couldn't resolve within the body of the main genome. Also, I want you to remember that these differences that we saw with just the survey genomes in this little tiny contig, they're very small. Both canu and unicycler did a good job. And remember when we run canu in unicycler with there were some pretty striking differences in the time that it took. So that maybe one thing when you're using your own data to think about. But in continuing with our search for truth, do we have a conclusion? Yes, the genes exist and mainly we have to recognize that different tools have different behaviors. So there are differences between canu and unicycler, but that doesn't mean that one is wrong or the other is right. So I would also encourage you to try not to be rigid and just stick to one thing. Try different assembly strategies, try different things to try to get the best representation of your data that you can. That it leads you to the pathway for publication and finally, a motto that we should all live by BLAST is your friend. Before you say that something is missing, be sure to BLAST and look for it. I would say in all things he must BLAST, so the last word. Thank you for using PATRIC. Go forth and publish an please cite us and thank you for joining me on this long strange trip. Bye.