Bread wheat comes with many complications when it comes to genome assembly - the fact that there are actually three plant genomes combined into one as a start. The sheer size of the genome, five times that of our own, combined with an abundance in repetitive elements (which make about 80% of the DNA sequence) along with the duplicated, highly similar genes present across three sets of seven chromosomes add to the challenge.
After the initial release of the wheat genome in 2014, Earlham Institute were quick to follow this up with an improved assembly in 2015, for which Bernardo Clavijo made major modifications to the DISCOVAR software (developed by the Broad Institute in the USA for the analysis of human genomes), in order to distinguish repeat sections and give us the maximal coverage of the genome.
To achieve this, newly generated, high-quality input data was required, which was enabled thanks to the high capacity and skill of EI’s Genomics Pipelines team. There was also the requirement for significant computing power, which was provided by EI’s specialised high performance computing facilities, which were specially configured to run the three week long assembly. In fact, we won a supercomputing award for our wheat research, for the ‘best use of HPC application in life sciences’.
Helping to ensure, in the spirit of open science and as a founding principle of the UK Wheat Initiative, that data could be shared amongst the research community, the new genome was made available on EI’s Grassroots Genomics platform for BLAST searches, before the full data set, including annotated genes, was made available on EBI’s Ensembl Plants.
Grassroots Genomics is still going strong. Led by Rob Davey, Xingdong Bian and Simon Tyrell design and manage the platform to provide a versatile data repository, analytical services and enable marker assisted breeding through a 100% open source infrastructure that is freely available to researchers and the public.
Another outcome of EI’s work on the first wheat genome was the release of w2rap: a bioinformatics pipeline that can decipher complex genomes, not just of wheat, and produce robust assemblies in conjunction with the best next generation genome sequencing methods.