In terms of assembly validation, the tool is particularly useful. Often, with diploid genomes that can carry more than one copy of a gene, certain regions can be falsely duplicated or deleted during assembly. KAT can help to detect these artefacts by tracking both the data generated from the sequencer and data from the assembler.
This sort of analysis is particularly useful at EI, where we sequence a diverse range of organisms - some of which are not only diploid, but tetraploid (pasta wheat), hexaploid (bread wheat) or even octoploid (in the case of strawberries).
The nice trick of KAT is that it is carries out an internal back-checking of your own assembly, including completeness and accuracy of the data, using just the input and the output.
Bernardo, added: “For the wheat genome, we checked the K-mer spectra all the way through using KAT, which means we could run the whole thing once, rather than running 20 different parameters and searching for the best one. With wheat, this would have been ridiculous - in terms of both computational power and cost.”
Before KAT, a lot of money and effort could be put into a sequencing project, only to find out it’s wrong at the end. With KAT, you know that your data is good, and you can validate your results at every stage.
KAT was led by Bernardo Clavijo and Dan Mapleson with George Kettleborough, Gonzalo Garcia and Jon Wright.