Session 1: Introduction, environment and text manipulation
In this session we will introduce Python and get comfortable with the basic syntax of the language and with our development environment. We will get started with an overview of our plans for the week and take care of any housekeeping details (like coffee breaks and catering arrangements). We will use some of our time to check that everybody has a suitable programming environment, checking that we have the correct software and packages installed and checking any version issues.
To get started with programming, we'll introduce some examples of tools for working with text and show how they work in the context of biological sequence manipulation. We also cover different types of errors and error messages, and learn how to go about fixing them methodically. Core concepts introduced: terminals, standard output, variables and naming, strings and characters, special characters, output formatting, statements, functions, methods, arguments, comments.
Session 2 : Lists and loops
In this session we'll start by thinking about the kinds of programs that we need to write to build and test machine learning models. An important idea is that we want to write programs that can deal with arbitrary amounts of data. In order to do so, we need two things: a way of storing large collections of values, and a way of processing them. In Python, lists and loops do these jobs respectively.
We'll go over the new syntax needed for each, and see how together they allow us to write programs that are much closer to being useful in the real world. This new syntax will allow us to see how lists, strings and files all share similar behaviour and how we can take advantage of that fact to write concise code. In the practical session we'll tackle some problems that involve iteration, as a necessary prerequisite for many aspects of ML work. Core concepts introduced: lists and arrays, blocks and indentation, variable scoping, iteration and the iteration interface, ranges.
Session 3: Intro to Pandas and thinking in columns
In this session we address the main difference between working with core Python objects and working with pandas: the need to think about operating on entire columns of values rather than one value at a time. Looking at a large number of examples will help to make this clear. Once we start thinking in this way, we'll find that we can do many common data processing tasks - filtering rows and columns, creating new columns, sorting, and summarizing columns - with very little code.
After a look at some special types of filtering that require slightly different syntax, we are in a position to practice solving some fairly tricky data analysis questions that involve a mixture of selecting, filtering and aggregating columns. This session will also give us a chance to introduce some of the datasets that we will be using in the machine learning part of the course.
Session 4: Introducing seaborn
In this session we will turn our attention from data analysis (which normally produces tables of values as output) to data visualization (which produces figures as output). We'll start with an overview of the seaborn package then dive straight in to the core chart types for looking at distributions and relationships.
Histograms, kernel density plots and scatter plots are covered in this session, along with a few more exotic chart types like hex plot and contour plots, which can be useful alternatives to scatter plots when we have very large numbers of points to deal with. In this session we will also explore the power of seaborn's ability to map dataframe columns to things like marker size, shape and colour, and to easily make small multiple plots. Just like with pandas, by the end of this session we'll understand how to make complex charts with only a small amount of code. In the machine learning parts of the course we will rely heavily on visualisation to understand the behaviour and performance of our models.
Session 5 - Background to machine learning
In this session we cover some background:
- the relationship between machine learning and ai
- classical machine learning vs. deep learning
- the essential concept of learning from data.
We discuss some of the ways that we can organise ML approaches:
- supervised vs. unsupervised
- regression vs classification
and start to consider the spectrum between simple and complicated methods.
We point out how a number of properties of ML methods:
- parameter count
- computational requirements
- data requirements and interpretability)
- tend to scale together.
Next, we turn to the practicalities of using ML methods for research - how will datasets be obtained, the difference between training and inference, the use of pre-training and fine-tuning. This leads to an overview of universal issues when using ML - how to score and evaluate models; how to choose between models, how to visualise their behaviour, how feature engineering and selection fit into a ML workflow.
Session 6 - Core concepts of classification
In this session we dive straight in to a simple one-feature classification problem. We start off writing a manual classifier which allows us to get used to the core concepts of features/classes. We can use this very simple example to address two of the most important questions for understanding ML models:
- how can we score them, and
- how can we visualise their behaviour.
We also look at the concept of a confusion matrix - this will be important later when talking about different scoring metrics.
Taking a detailed look at our manual classifier reveals that this approach will not scale for a number of reasons, so we give an intuitive explanation of the K-Nearest-Neighbours algorithm. Pandas allows us to write a simple implementation, and we can use the tools we've already built to contrast this with our manual classifier. Now that we have a parameterised algorithm we can discuss the idea of systematic parameter searching that will form the basis of training more complex models.
At this point we can cover in detail an incredibly important point: the division of data into training/test sets - why this is necessary and how to do it. We also introduce the idea of cross validation.
Exercise: building a KNN classifier for a new dataset, including parameter optimisation, visualisation and scoring. At the end, having two working examples lets us pinpoint what is common to all classification problems and get an intuitive sense of how this will apply to more complicated models.
Session 7 - sklearn and adding features
In this session we have two main goals:
- explaining the architecture of the sklearn package
- getting started with the idea of feature engineering.
We start by quickly recapping the role of sklearn/numpy/pandas/etc in the Python ML ecosystem, before diving into the practicalities of how to use sklearn. Talking a little bit about the data model, how models are represented, etc. now will save a lot of time later on. This allows us to quickly reproduce the workflow from the previous session with a fraction of the code.
We can point out explicitly how using existing implementations of ML algorithms will let us focus on higher level concerns and briefly cover a few ideas and best practises that are more easily explained using sklearn:
- stratified splitting
- balanced/unbalanced datasets
- potential pitfalls of sorted data.
We can now show that something that would be hard with our manual implementation, but is very easy with sklearn, and introduce extra features to our classification problem. We use the visualisation approaches that we've already learned to see the effect of additional features, and illustrate the importance of scaling - another aspect of feature engineering. This is a natural time to mention a few other feature engineering types.
Now that we have a two-dimensional classifier, we can really investigate the effect of of different parameters on the classifiers behaviour, and see how it fits into the universal issue of over/under fitting. We finish on the idea of feature selection; given a dataset with many potential features, how to scalably pick useful ones. There are some intuitive approaches like sequential selection, and some univariate ones that leverage existing statistical knowledge.
Exercise: taking a new dataset with many features, experiment with feature selection to optimise a classification model, being sure to pay attention to scaling and overfitting.
Session 8: Binary classification and new models
In this session we are going to cover two main topics: binary classification (as opposed with the multiclass problems we have been tackling so far) and some other algorithms. First, we will introduce binary classification as a particularly common and useful type of classification tool. Contrasting it with the kind of classification problem we've been looking at so far, we see that it makes some aspects of our workflow easier, and some (particularly scoring) harder.
We can refer back to our previous discussion of confusion matrices to explain recall/sensitivity/specificity/true positives/negatives and the trade-offs between them. At this point we can also discuss how to represent categorical data as features. The subtleties of different types of encoding (ordinal, one-hot, etc) and the practicalities of how to create them are another chance to touch on the importance of feature engineering to avoid bias.
We will also go into a fair amount of detail on two completely new algorithms:
- support vector machines (SVN)
- decision trees.
With three entirely different classification algorithms in front of us we can use visualisation tools to compare their behaviour and to think about their interpretability. We can also start some benchmarking to directly measure their computational requirements, and some scoring to explicitly compare them and begin to answer an overarching question of many ML projects: how do we choose which type of model to use?
Exercise: with a complex dataset with mixed data types, carry out preprocessing with feature engineering/selection then compare KNN/decision tree/SVM. This exercise will show how different criteria for success in terms of recall/precision will sometimes lead to different model choices.
Session 9: Regression
Having spent a good deal of time discussing and solving classification problems, we now turn our attention to the other common type of ML problem: regression. After introducing a simple example, we can point out similarities to and differences with classification, particularly with regard to our practical workflow. Visualisation and scoring will be very different, but many ideas around feature engineering/selection and parameter searching will be the same.
Despite the classification/regression dichotomy, a few examples will make it clear that many problems can be stated in both regression/classification terms, and that many algorithms work for both with slight tweaks. The behaviour of regression models with categorical features leads to particularly interesting visualisation which allow us to check our intuitive understanding of how the algorithms work. We look at the effect of feature count on interpretability, and think about another way to organise ML methods based on whether prediction changes are linear or stepwise.
At this point we will introduce a new, large, unstructured dataset to use as an example of feature extraction, for which domain-specific knowledge will be useful. This example demonstrates how it's surprisingly easy to end up with many thousands of features, and allows us to recap ideas from session 3: which approaches to feature selection are viable for such large numbers.
Exercise: Creating features from scratch using an unstructured dataset, finding ones that have a strong predictive value, and checking that results make intuitive sense. We will get comfortable using ML techniques to identify patterns that are impossible to see with simple visualisation, and with interpreting a confusion matrix with many classes.
Session 10: ML workshop time
The last session is set aside for students to work on complete ML workflows, involving
- data gathering/merging/cleaning
- feature extraction/engineering
- feature selection
- model selection
- parameter searching and model
- evaluation.
With real-world datasets we will likely have to write custom code to e.g. scrape data from websites, merge multiple datasets, clean and filter human-curated data files, etc. Students are encouraged to bring their own datasets, but suitable examples can easily be sourced for those who don’t have them yet.
Depending on students’ particular interest either the students or the trainer will likely present some case studies at the very end of the course, showing the application of tools and ideas from the course to real scientific problems.