Statistics is important for bioinformatics, writes EI Software Developer Dr Ben Ward, because it helps us solve the sampling problem - getting enough measurements to give us an accurate picture of what’s going on.
If you want to measure what species are in a forest, for example, you can’t survey the whole thing. You have to split it up into sections, quadrats at different locations, and see what species come up again and again.
Once you’ve done that many times over, you can build up a picture from those smaller pieces of information. But you have to do it enough times to make sure what you’re seeing is real.
It’s the same with genomes. We tend to split DNA into smaller pieces, which we then use to build up a bigger picture. We use statistics to measure whether what we’re seeing is real, or whether it’s down to chance, or just wrong. It’s a crucial quality control step.
Let’s say we split DNA up into many pieces and we’ve put them through a DNA sequencer. Some of those pieces of DNA might have been sequenced incorrectly. If we only generated one set of results, we’d get an erroneous picture of what is there.
The more times we look at the data, the more we can be confident of the full picture we are revealing. The things which only come up once, you can then eliminate - they are very likely to be errors. The rest takes the form of a distribution, which usually has the shape of a bell curve.
If you run a sequencing experiment and you don’t get a nice statistical distribution, there’s no reason to think that you can assemble a genome with it. But the interesting thing is that you can already get some information purely from the statistics.
Humans are diploid. Sometimes we inherit the same version of a gene from both our parents, but other times we inherit two different versions. You can see this in the statistical distribution, which (with good data) will have two peaks.
In the animation below, you can see a representation of a statistical distribution of genome data. The black bars to the left represent likely errors, while the red bars represent the likely statistical distribution of the data