Why is genome annotation important?
Genome annotation is no simple feat, but it’s incredibly important in identifying the functional elements of DNA. Building the appropriate tools and pipelines is key.
Genome annotation is no simple feat, but it’s incredibly important in identifying the functional elements of DNA. Building the appropriate tools and pipelines is key.
With expertise gleaned from working with a diverse range of genomes - from aphids to wheat, and protists to fish - Earlham Institute scientists explain how genome annotation has advanced over the years, and why it is so important.
Gemy Kaithakottil is celebrating his tenth anniversary at the Earlham Institute this February, having helped to oversee a decade of dramatic transformation in how we annotate genomes.
“We’ve come a long way in the last ten years. Back then, before we began, the process was more like running software and scripts one by one. Now, we’ve merged everything into a streamlined pipeline and shaved a lot of time off the process.”
Throughout that time, Kaithakottil has worked as a Senior Bioinformatician and Software Developer in the Swarbreck Group, developing a suite of tools and pipelines that help us to more accurately map where genes lie along a genome.
Simply put, genome annotation involves taking genomic data - DNA or RNA sequences - and mapping the correct genes (or more accurately, functional elements) to the correct locations. It gives the genome meaning.
According to Kaithakottil, this is an essential step that is frustratingly undervalued.
“People often spend a lot of effort on genome assembly, but eventually the research is going to work with the protein or the functional parts of it. If you're not going to give any effort to that part, then what's the point?
“You have to put equal - or even more effort - into the annotation.”
This can be done manually, by looking directly at the data and identifying the precise starting points of genes, but that takes a lot of time.
“When I was working at the Sanger Institute on the Human Genome Project, we annotated the human genome by going gene by gene across every chromosome,” says Dr David Swarbreck, Core Bioinformatics Group Leader at the Earlham Institute.
“We used manual annotation tools to visually examine alignments of cDNAs and proteins and, based on these, we could construct gene-models to define a gene's structure. This was a huge team effort and manual curation to this extent is not possible for most newly-sequenced and assembled genomes.
“I wanted something compatational that would work in a similar way to a manual annotator, enable us to assess alternative gene models, generate metrics to aid that comparison and make choices over the models we include or exclude: allowing us to shape the annotation for specific projects but without us having to do it manually.”
You can learn from the experts how to annotate a genome at this year’s training workshop, delivered by the Core Bioinformatics Group here at the Earlham Institute.
Date: 17 - 19 May 2022
Register By: 17 April 2022
One of the first genomes annotated by Swarbreck and his group was that of the green peach aphid, Myzus persicae, in a collaboration with the Hogenhout Group at the John Innes Centre that continues to this day.
“We played around with all the available tools at the time, to see what was available,” says Swarbreck. “The problem was, many of the more comprehensive pipelines weren’t easy to translate to running in your own environment.”
Kaithakottil adds that, “with any new pipeline, you need to understand the software, then work out what parameters you need to use, or tweak, for a particular species. It's not one size fits all. You need to understand the species that you're working with.”
As those early, non-human genomes were being assembled and annotated, RNA-seq data from transcriptome sequencing - the potentially expressed functional elements - was becoming more prevalent. So, too, were longer reads from the sequencers.
“We found that there was quite a bit of variation between different methods and different ways of dealing with that data,” says Swarbreck. “We concluded that there was no single tool out there that we found worked for all situations.
“We were looking for something that would allow us to try to integrate results of all these different transcriptome assemblers.”
That led to the development of Mikado and Portcullis, which were essential tools in the huge global effort to sequence, assemble and annotate the genome of bread wheat - a major milestone for such a crucial source of food.
Mikado is a tool, according to a former developer Dr Luca Venturini, based on the traditional stick game that is its namesake, aiming to “imagine genes as sticks and to capture the ones with the highest value without getting the others.”
What Mikado does, essentially, is find more real genes - filtering out false positives and identifying where there might have been false negatives. An example would come from gene duplications, whereby some software may have accidentally identified two very similar genes as only one.
“At the time we were working as part of the International Wheat Genome Sequencing Consortium and wanted a transparent approach that would enable us to integrate two alternative gene sets created by our collaborators,” says Swarbreck. “We made some tweaks to Mikado, and used it as a method to bring these two gene sets together, essentially cherry picking the ‘better’ models from the two annotations.”
Since then, the group has been integral to many genome sequencing projects, from various plant and tree species through to insects, fish, rodents, fungi and numerous others.
Now, the aim is to integrate these tools to tackle the biggest prize of all - the Darwin Tree of Life Project that aims to sequence the DNA of all eukaryotic life in the UK.
“We've got a growing number of projects and collaborations that use these tools, but what we have aimed for all along is an easy to run, all-encompassing annotation pipeline,” says Swarbreck. “The solution to that was to develop what we’ve called the reat toolkit.”
That toolkit was most recently used in an effort to produce the best ever reference genome for tilapia - a fish of exceptional importance in global aquaculture. It helped bioinformatician Dr Will Nash produce a comprehensive genome annotation.
Reat contains a module for dealing with a whole variety of transcriptome data, including cDNA, PacBio, nanopore, and short reads, and there are different workflows for those different types of data as well.
There's also a module for dealing with homology data from protein alignments, together with a gene prediction module - and at the end of that a consolidated gene annotation across all these different methods.
“Rather than try to generate a single set of models for each project, we generate lots of different gene models through different routes,” explains Swarbreck. “We then use our Minos pipeline to bring these all together and select the final ‘best’ models.
“Rather than putting all our eggs in one basket, we accept that it’s best to vary parameters, the choice of tools, and the inputs into these tools. We can achieve a higher quality final annotation and have an approach that is more robust across projects by generating alternative gene models.
“Ultimately, you need to have some way of making a final selection. Minos provides us with a means of making that selection that we can control, allowing us to review and tweak as required.”
The reat pipeline is available to anyone who would like to use it, open-source, on GitHub. The same is true of mikado, portcullis, minos and a range of other tools and pipelines for genome annotation.
“We are more than happy to keep in touch with anyone interested in using these pipelines,” says Kaithakottil. “People tend to ask questions in the issues section of GitHub, and we gladly help them there. We’ll reply to emails, too, of course!”
There’s also a training course on Genome Annotation, which Kaithakottil would encourage those looking to make use of these pipelines to sign up for.
“The workshop starts from the very basics,” he explains. “We introduce participants to genome annotation and then go through a number of pipelines, including our own, such as Mikado and Minos, as well as some external pipelines.
“We go through using metrics, best practice, explore the parameters you should be using - and then how to run and install these tools, and how to use them most effectively. Thanks to CyVerse UK, trainees can also access the resources via virtual machines to see their outputs, tweak parameters, and modify them to improve their results.”
If you’d like to sign up for the workshop in 2022, registrations end on April 17. Keep your eyes on the events calendar, which is regularly updated, for future events, or sign up to the Earlham Institute monthly newsletter.
This 3-day course will help to provide scientists with an overview of eukaryotic genome annotation approaches, covering advances in Next Generation Sequencing (NGS) technologies, transcriptome assembly, best practice guidance for building gene models utilising short and long read sequencing data or cross species proteins, how to integrate and assess different gene models and create a publication/release ready gene set.