Why code? Bioinformatics for decoding living systems
To celebrate National Coding Week, we ask some of our bioinformaticians here at EI why they code.
Here at the Earlham Institute, we process a magnitude of data for plants, microbes, animal and human DNA, and to decipher and analyse this genomic ‘big data’ we need coders. That’s where our highly-skilled bioinformaticians come in - developing software and algorithms to decode living systems, model biology and accelerate scientific discovery.
I code for three main reasons. First, I like to solve problems and build solutions but I am not so skilled with my hands, so this is as close to creating that I can get. Second, there is a certain beauty in writing lines of strange symbols and words of power that allow you to do so many different things, it is a bit like magic. In addition, it pays the bills – more or less.
As a child, I wanted to be a game programmer (at some point). So, I talked my parents into buying me a book about Game Programming, which to this day has never been used properly (and is now terribly obsolete). I remember some bits of programming in the “information technology class” at school and I dabbled a bit here and there, trying to tackle different problems (e.g. build tools to “hack” the interface of old AOL [America OnLine - an ancient internet service provider] client to do all kinds of fun things), but I didn’t really understand how code and algorithms work until the late 90s/early 2000s, i.e. during my transition from high school to civil service and then to university.
Currently, I am constantly adding bits and pieces to Earlham Institute’s bacterial genome assembly and annotation pipeline (bgrr|, pronounce “b-girl”), which is used to process and make sense of large sets of customer and collaborator microbial sequencing data sets. And I am in the process of building bioinformatics pipelines and tools for different projects, e.g. for methylation analysis, transcriptome functional annotation, and CRISPR off-target activity.
I think what i’m most proud of would be bgrr|, as it is finally something that others can and will use and that actually has been playing a role in various internal and external projects. I have written various tools and bioinformatics web services before and those have seen some use, but were definitely niche products and mostly for my own use (e.g. during my bachelor’s, master’s, and PhD research.)
For my dream code, I don’t exactly know what it would do, but I’d really love to build something that would be widely known and used in the bioinformatics research community. Something like BLAST, or HMMer, or … It is not likely to happen, but sometimes it is nice to dream a little …
I code because:
1. I am passionate about building things from blocks or putting pieces together.
2. I love being intellectually engaged in what I do while applying logic and philosophy.
3. I am fascinated about using point one and two to solve specific biological problems, and coding fits in quite well.
I got into full-time and extensive coding as a Postdoctoral Researcher at the Earlham Institute. However, prior to this, I did a bit of coding while studying for my PhD at the University of Cambridge - which was split between wet lab and bioinformatics. My PhD programme also provided me the opportunity to decide on going into coding.
I am presently working on Non Coding RNA Genomics for the Wheat ISP funded by the BBSRC. This involves employing bioinformatics approaches to understand the evolution of lncRNAs in wheat using genomics, transcriptomics, and epigenomics data. I am particularly looking at how lncRNAs control gene expression, via secondary structure analysis, and how these aid crops to adapt to adverse environmental conditions and resistance to pests and pathogens.
The coding project that I’m most proud of is parsing paml (baseml) output files. One of the software that I use for the above analysis is PAML, which has the baseml sub-algorithm for determining evolutionary rates. The output files produced by this algorithm is normally required to be transformed from a mixture of newick tree (nested tree-like data) and text format to tab-delimited readable formats which will eventually be used for statistical analysis of the data. This is because I was able to deploy advanced programming techniques having only started full time (extensive) coding in April/May 2018.
My dream code would be one that employs object-oriented programming as well as requesting resources from data analytical and machine learning tools.
I code for work and for pleasure. Even then, the pleasure projects include bioinformatics ones, as I like to program stuff that is meaningful, and I know what itches I want to scratch in that field.
I got into coding during my undergraduate degree when learning statistics. My professor introduced me to R and Linux and I never looked back. It took me a while to appreciate that R was indeed a Turing complete language and that I was coding, as I used to think of it as just a sophisticated command or macro set for some interactive stats program. Eventually I started using other languages, and picking up new ones became increasingly easy.
I’m currently working on the BSG (Basic Sequence Graph) toolkit, as part of the Clavijo Group, and BioJulia. BSG is short for ‘Basic Sequence Graph’. It’s a software toolset our group is working on as a tool for us to experiment with assemblies and methods, and also as a kit that people can actually use to assemble genomes using a variety of data types and sources (Illumina, PacBio, Nanopore etc.).
I’m most proud of BioJulia, as it’s the project that has helped me grow most as a coder, and it really helps other people. ‘Julia’ is a high level language, but to use it most effectively you have to understand (loosely anyway) what it is that the compiler does, and how it produces optimised llvm and assembly code.
That might not sound user friendly, but Julia has some built in utilities that people can use to highlight the problem areas of your code and help you see what is wrong: for example where you are allocating lots of memory or are not writing "type-stable” code.
I won’t go into what makes type-stable code here too much, except to say there’s a great article where the author writes a function in a type-stable Julia code, and then writes the same function (it does the same job) but in a non-type stable style of Julia code. He then compares their performance and shows you the lower level code the compiler generates for those two functions, and how they are different. This ability to easily inspect how your Julia code gets compiled really helps you learn a grasp of the kinds of programming habits that are good and bad for performance, and how some of those habits translate to other languages too.
If I could choose a dream code, I’d like it very much if the O.A.S.I.S. from Ready Player One was a thing that existed.
Sabrina Ward, Software Developer, Earlham Institute.
Katie Barr tells us what she thinks may lie in store for those starting out in the world of coding.
We interview Vanessa Bueno of the Saunders Group, who is a predoctoral student working on wheat yellow rust.
Ada Lovelace Day is an international celebration of the achievements of women in science and technology but behind this are the remarkable achievements of the very first computer programmer.