10 things you need to know about getting into machine learning
So you want to be a bioinformatics data scientist, huh? Perhaps learn a little bit of machine learning? Here are ten things you should know before you start your data science journey.
So you want to be a bioinformatics data scientist (otherwise known as a bioinformatician), huh? Perhaps you want to learn a little bit of machine learning to sink your teeth into the abundant data that we’re generating nowadays and don’t know what to do with?
Here are ten things you should know before you start your data science journey: from the very basics through to some tips on what to do and, most importantly, what not to do.
This is probably the question I get asked the most. “Matt, should I learn Python or R or is it worth learning C++ or Java…”. The answer isn’t as clear cut as it may seem. Although I work with Python for all of my code in production, I will use other languages if they are better suited.
For example, for plotting and visualising data I may use the R package ggplot, or if I am working with large graph data structures that require a more efficient language I may instead use C++. It’s about knowing when to use the best tool for the job. Sometimes you will use Python and other times you will use R. I recommend being comfortable using both if possible - but if you are getting started, Python is still king.
The title of this article is full of buzzwords within a field filled with buzzwords and technical terminology that can often be hard to get your head around. Actually, so is this article, so we’ve written you a handy key.
Machine learning (ML) is not the same as artificial intelligence (AI). In fact, ML is a subset of AI which aims to learn from data and improve from experience without explicitly being programmed to do so. AI is a buzzword which is used broadly to mean multiple different techniques used by computer and data scientists. It is typically found in presentations, grants, newspaper headlines and to sell a new exciting product.
You can have the most advanced algorithm in the world, along with the most powerful computing infrastructure, but if you don’t have good quality data to begin with then you may as well not bother. You’ll get garbage out.
Although there are some feature engineering tricks or generative models you can use to lessen the impact of poor data, ultimately the model can only be as good as the data you give it.
So you have your data with 200 observations, you have cleaned and transformed it and now you’ve built three models. During your model evaluation, you can see the deep learning model is performing the best, and the traditional machine learning models are performing almost as well.
So, which is the best model?
The problem is the more complex the model the more parameters you have to tune. This means that you end up ‘overfitting’ the model over data (of which you don’t have that much in the first place). It wouldn’t therefore respond particularly well to ‘real world’ data (‘generalise’ in ML talk), which was the whole point of generating the model in the first place.
As a general rule of thumb, in machine learning, try to use the simplest model for your problem.
But. Like most rules there are situations where it’s ok to break them. This is particularly the case with biological data where the high-dimensionality (lots of variables to consider) of the data means simpler models begin to lag behind a more complex model. In this case a more complex model with an appropriate validation scheme - e.g. K-fold cross-validation or holdout - would be acceptable. This, in combination with regularisation methods, would enable you to tackle overfitting, meaning that you would still have a generalisable model.
A shift in best practice has seen data scientists moving away from having their one favourite model. Through comparing the performance of many different models, it has become clear that the best ‘learner’ differs from task to task. This means that model selection has been extended from simply parameter tuning to looking at and empirically comparing the performance of each model.
We can combine the top-performing models for a given set of tasks together in a method called “ensembling”. This creates a committee of models which then compare the results of each model through ‘voting’. This can create a much more powerful model with little extra effort.
This is a common misconception and a concept which can be at first difficult to grasp. If we see in the data two variables changing in the same way at the same time, we must conclude that one must cause the other, right? This is not always the case.
An example of this is the global average temperature and approximated number of pirates. Between 1820 and 2000 there has been a steady decrease in the number of pirates worldwide. At the same time, global temperatures have been increasing year on year. They are correlated variables, but not causal. Climate change is responsible for many things, but a reduction in the number of pirates is unlikely to be one of them.
Machine learning models are very good at finding the underlying structures within data, but they are not very good at identifying underlying causes. This is not always an issue, as maybe the brief is to achieve the best predictive performance. However, in other cases where explaining data is important (this is particularly the case in biological sciences), we require some other form of judgement to explain what the models have found (currently, probably human judgement, among other methods).
Building the model is one small part of the job. Data cleaning, transformation and features engineering take up a vast majority of your time. But what does that entail?
It means moving raw data into, say, a single dataset, and then extracting features (sets of variables from that data) that better represent the signal in your data. This step has many different advantages; not least to ensure data quality but also to move the data into a shape that a model can use, ensuring the data is representative of the problem you are trying to solve.
No. Simple as that. Machine learning is not a magic bullet.
The worst offender for this is deep learning - models which use deep neural networks. Although they are incredibly powerful models, which encompass automated feature engineering states within the models themselves, they require a large amount of time to tune and are not an absolute substitute for the data cleansing and transformation.
Practice and patience. There’s no book, course, degree or tutorial that will give you more understanding than actually playing with your data. Visualise it, understand it and play with different models. Be critical and ask questions of every result you get and, most importantly, don’t be afraid to show your model and results to others (who will be more than happy to tell you what you might have missed).
A hot topic at the moment with a wide range of views is the effect that machine learning and particularly deep learning might have on society. Some of these issues are actively happening at the moment, like when Microsoft lost control of its AI Twitter bot, and others which appear to be born from science fiction, the famous example of SkyNet. However, there are some issues that you may even encounter once you begin your data science journey.
One example is unconsciously-biased machine learning models. Models, if not checked correctly, can become unintentionally biased towards a result which ultimately leads to discrimination.
A model which is not checked correctly could potentially cheat. For example, if a majority of samples are infected with a virus, has an ID with a “V” in it, and the remainder of samples which are not infected have an ID with a “W”, and this feature is given to the model, it will appear to achieve a near-perfect score when predicting between samples. Actually, what it is doing is ignoring the meaningful signal(s)/pattern(s) in the data and learning noise from the ID instead. The problem, therefore, lies not with the machine learning models but the people who are using them.
Matthew Madgwick & Josh Colmer are funded as part of a Doctoral Training Programme on the Norwich Research Park.