Event Scientific training

Introduction to Python and Machine Learning

This workshop will provide a basic introduction into using Python for Machine Learning application, starting with the basics of Python, before looking at a few packages that are particularly useful for data exploration and visualisation.

Start date: 12 May 2025

End date: 16 May 2025

Time: 09.00 - 17.00

Venue: Earlham Institute (Norwich UK)

Organiser: Emily Angiolini

Enquiries:

training@earlham.ac.uk

Registration deadline: 27 April 2025

Cost: £550

About the event

Python is a dynamic, readable language that is a popular platform for all types of bioinformatics work, from simple one-off scripts to large, complex software projects.

This workshop is aimed at complete beginners and assumes no prior programming experience. We will start with the very basics of Python, before looking at a few packages that are particularly useful for data exploration and visualisation.

We will then use these skills to start exploring the world of machine learning, with an emphasis on high-level descriptions of algorithms, practical applications, and the use of visualisations to get an intuitive sense of how our models behave.

After completing the workshop, students should be in a position to:

apply the skills they have learned to tackling problems in their own research and
dive more deeply into areas of machine learning and data science that are relevant to research work.

All course materials (including copies of presentations, practical exercises, data files, and example scripts prepared by the instructing team) will be provided electronically to participants.

Who is this training for?

This workshop is aimed at all researchers and technical workers with a background in biology who want to learn about practical machine learning.

The syllabus has been planned with complete beginners in mind; people with previous experience in Python/pandas/seaborn/sklearn are welcome to attend but should view the first couple of sessions as a refresher.

If in doubt, take a look at the detailed session content under Session Information tab or drop Martin Jones (martin@pythonforbiologists.com) an email.

Pre-requisites

Students should have enough biological background to appreciate the examples and exercise problems (i.e. they should know about DNA and protein sequences. No previous programming experience or computer skills (beyond the ability to use a text editor) are necessary, but you'll need to have a laptop with Python installed.

Course fees

Please be aware that places are offered to Earlham Institute staff and students in the first instance.

If you are a member of staff from the John Innes Centre or Quadram Institute, please email training@earlham.ac.uk and CC training@nbi.ac.uk to register your interest and we will provide the password to complete registration if spaces are available.

The anticipated registration fees are £550 per delegate, assuming all 16 spaces are filled. These costs will be covered by core training budgets at EI for EI staff and students, and by the Learning and Development team for those based at JIC and QIB (please ensure you have your line manager’s permission to register).

Organisers and trainers

Martin Jones

Python for Biologists

Emily Angiolini

Head of Advanced Training

Beth Bankes-Jones

Events Manager (Maternity Cover)

Programme

Day 1 - 12 May 2025

09.00 - 12.30

Session 1: Introduction, environment and text manipulation

12.30 - 13.30

Lunch break

13.30 - 17.00

Session 2: Lists and loops

Day 2 - 13 May 2025

09.00 - 12.30

Session 3: Intro to Pandas and thinking in columns

12.30 - 13.30

Lunch break

13.30 - 17.00

Session 4: Introducing Seaborn

Day 3 - 14 May 2025

09.00 - 12.30

Session 5: Background to machine learning

12.30 - 13.30

Lunch break

13.30 - 17.00

Session 6: Core concepts of classification

Day 4 - 15 May 2025

09.00 - 12.30

Session 7: Sklearn and adding features

12.30 - 13.30

Lunch break

13.30 - 17.00

Session 8: Binary classification and new models

Day 5 - 16 May 2025

09.00 - 12.30

Session 9: Regression

12.30 - 13.30

Lunch break

13.30 - 17.00

Session 10: Machine learning workshop time

Further information

Registration includes attendance at the week-long event and access to all the teaching materials prepared by the trainer.

Please be aware, no lunch or refreshments are being provided. You are welcome to bring your own food and use the facilities within Earlham Institute to refrigerate, store and reheat food as required during the lunchbreak.

There is no dress code for our events, please dress however you feel comfortable.

You will find information about travelling to Earlham Institute here.

Booking Terms and Conditions

Please carefully review our standard online event booking terms and conditions prior to registering for this event. Completing an online registration and associated payment process will mean that you are bound by these terms and conditions. Any supplemental terms or changes to these conditions on a per event basis will be included on this page.

If you have any queries regarding our events or in relation to your booking, please contact us at training@earlham.ac.uk.

Session Information

Session 1: Introduction, environment and text manipulation

In this session we will introduce Python and get comfortable with the basic syntax of the language and with our development environment. We will get started with an overview of our plans for the week and take care of any housekeeping details (like coffee breaks and catering arrangements). We will use some of our time to check that everybody has a suitable programming environment, checking that we have the correct software and packages installed and checking any version issues.

To get started with programming, we'll introduce some examples of tools for working with text and show how they work in the context of biological sequence manipulation. We also cover different types of errors and error messages, and learn how to go about fixing them methodically. Core concepts introduced: terminals, standard output, variables and naming, strings and characters, special characters, output formatting, statements, functions, methods, arguments, comments.

Session 2 : Lists and loops

In this session we'll start by thinking about the kinds of programs that we need to write to build and test machine learning models. An important idea is that we want to write programs that can deal with arbitrary amounts of data. In order to do so, we need two things: a way of storing large collections of values, and a way of processing them. In Python, lists and loops do these jobs respectively.

We'll go over the new syntax needed for each, and see how together they allow us to write programs that are much closer to being useful in the real world. This new syntax will allow us to see how lists, strings and files all share similar behaviour and how we can take advantage of that fact to write concise code. In the practical session we'll tackle some problems that involve iteration, as a necessary prerequisite for many aspects of ML work. Core concepts introduced: lists and arrays, blocks and indentation, variable scoping, iteration and the iteration interface, ranges.

Session 3: Intro to Pandas and thinking in columns

In this session we address the main difference between working with core Python objects and working with pandas: the need to think about operating on entire columns of values rather than one value at a time. Looking at a large number of examples will help to make this clear. Once we start thinking in this way, we'll find that we can do many common data processing tasks - filtering rows and columns, creating new columns, sorting, and summarizing columns - with very little code.

After a look at some special types of filtering that require slightly different syntax, we are in a position to practice solving some fairly tricky data analysis questions that involve a mixture of selecting, filtering and aggregating columns. This session will also give us a chance to introduce some of the datasets that we will be using in the machine learning part of the course.

Session 4: Introducing seaborn

In this session we will turn our attention from data analysis (which normally produces tables of values as output) to data visualization (which produces figures as output). We'll start with an overview of the seaborn package then dive straight in to the core chart types for looking at distributions and relationships.

Histograms, kernel density plots and scatter plots are covered in this session, along with a few more exotic chart types like hex plot and contour plots, which can be useful alternatives to scatter plots when we have very large numbers of points to deal with. In this session we will also explore the power of seaborn's ability to map dataframe columns to things like marker size, shape and colour, and to easily make small multiple plots. Just like with pandas, by the end of this session we'll understand how to make complex charts with only a small amount of code. In the machine learning parts of the course we will rely heavily on visualisation to understand the behaviour and performance of our models.

Session 5 - Background to machine learning

In this session we cover some background:

the relationship between machine learning and ai
classical machine learning vs. deep learning
the essential concept of learning from data.

We discuss some of the ways that we can organise ML approaches:

supervised vs. unsupervised
regression vs classification

and start to consider the spectrum between simple and complicated methods.

We point out how a number of properties of ML methods:

parameter count
computational requirements
data requirements and interpretability)
tend to scale together.

Next, we turn to the practicalities of using ML methods for research - how will datasets be obtained, the difference between training and inference, the use of pre-training and fine-tuning. This leads to an overview of universal issues when using ML - how to score and evaluate models; how to choose between models, how to visualise their behaviour, how feature engineering and selection fit into a ML workflow.

Session 6 - Core concepts of classification

In this session we dive straight in to a simple one-feature classification problem. We start off writing a manual classifier which allows us to get used to the core concepts of features/classes. We can use this very simple example to address two of the most important questions for understanding ML models:

how can we score them, and
how can we visualise their behaviour.

We also look at the concept of a confusion matrix - this will be important later when talking about different scoring metrics.

Taking a detailed look at our manual classifier reveals that this approach will not scale for a number of reasons, so we give an intuitive explanation of the K-Nearest-Neighbours algorithm. Pandas allows us to write a simple implementation, and we can use the tools we've already built to contrast this with our manual classifier. Now that we have a parameterised algorithm we can discuss the idea of systematic parameter searching that will form the basis of training more complex models.

At this point we can cover in detail an incredibly important point: the division of data into training/test sets - why this is necessary and how to do it. We also introduce the idea of cross validation.

Exercise: building a KNN classifier for a new dataset, including parameter optimisation, visualisation and scoring. At the end, having two working examples lets us pinpoint what is common to all classification problems and get an intuitive sense of how this will apply to more complicated models.

Session 7 - sklearn and adding features

In this session we have two main goals:

explaining the architecture of the sklearn package
getting started with the idea of feature engineering.

We start by quickly recapping the role of sklearn/numpy/pandas/etc in the Python ML ecosystem, before diving into the practicalities of how to use sklearn. Talking a little bit about the data model, how models are represented, etc. now will save a lot of time later on. This allows us to quickly reproduce the workflow from the previous session with a fraction of the code.

We can point out explicitly how using existing implementations of ML algorithms will let us focus on higher level concerns and briefly cover a few ideas and best practises that are more easily explained using sklearn:

stratified splitting
balanced/unbalanced datasets
potential pitfalls of sorted data.

We can now show that something that would be hard with our manual implementation, but is very easy with sklearn, and introduce extra features to our classification problem. We use the visualisation approaches that we've already learned to see the effect of additional features, and illustrate the importance of scaling - another aspect of feature engineering. This is a natural time to mention a few other feature engineering types.

Now that we have a two-dimensional classifier, we can really investigate the effect of of different parameters on the classifiers behaviour, and see how it fits into the universal issue of over/under fitting. We finish on the idea of feature selection; given a dataset with many potential features, how to scalably pick useful ones. There are some intuitive approaches like sequential selection, and some univariate ones that leverage existing statistical knowledge.

Exercise: taking a new dataset with many features, experiment with feature selection to optimise a classification model, being sure to pay attention to scaling and overfitting.

Session 8: Binary classification and new models

In this session we are going to cover two main topics: binary classification (as opposed with the multiclass problems we have been tackling so far) and some other algorithms. First, we will introduce binary classification as a particularly common and useful type of classification tool. Contrasting it with the kind of classification problem we've been looking at so far, we see that it makes some aspects of our workflow easier, and some (particularly scoring) harder.

We can refer back to our previous discussion of confusion matrices to explain recall/sensitivity/specificity/true positives/negatives and the trade-offs between them. At this point we can also discuss how to represent categorical data as features. The subtleties of different types of encoding (ordinal, one-hot, etc) and the practicalities of how to create them are another chance to touch on the importance of feature engineering to avoid bias.

We will also go into a fair amount of detail on two completely new algorithms:

support vector machines (SVN)
decision trees.

With three entirely different classification algorithms in front of us we can use visualisation tools to compare their behaviour and to think about their interpretability. We can also start some benchmarking to directly measure their computational requirements, and some scoring to explicitly compare them and begin to answer an overarching question of many ML projects: how do we choose which type of model to use?

Exercise: with a complex dataset with mixed data types, carry out preprocessing with feature engineering/selection then compare KNN/decision tree/SVM. This exercise will show how different criteria for success in terms of recall/precision will sometimes lead to different model choices.

Session 9: Regression

Having spent a good deal of time discussing and solving classification problems, we now turn our attention to the other common type of ML problem: regression. After introducing a simple example, we can point out similarities to and differences with classification, particularly with regard to our practical workflow. Visualisation and scoring will be very different, but many ideas around feature engineering/selection and parameter searching will be the same.

Despite the classification/regression dichotomy, a few examples will make it clear that many problems can be stated in both regression/classification terms, and that many algorithms work for both with slight tweaks. The behaviour of regression models with categorical features leads to particularly interesting visualisation which allow us to check our intuitive understanding of how the algorithms work. We look at the effect of feature count on interpretability, and think about another way to organise ML methods based on whether prediction changes are linear or stepwise.

At this point we will introduce a new, large, unstructured dataset to use as an example of feature extraction, for which domain-specific knowledge will be useful. This example demonstrates how it's surprisingly easy to end up with many thousands of features, and allows us to recap ideas from session 3: which approaches to feature selection are viable for such large numbers.

Exercise: Creating features from scratch using an unstructured dataset, finding ones that have a strong predictive value, and checking that results make intuitive sense. We will get comfortable using ML techniques to identify patterns that are impossible to see with simple visualisation, and with interpreting a confusion matrix with many classes.

Session 10: ML workshop time

The last session is set aside for students to work on complete ML workflows, involving

data gathering/merging/cleaning
feature extraction/engineering
feature selection
model selection
parameter searching and model
evaluation.

With real-world datasets we will likely have to write custom code to e.g. scrape data from websites, merge multiple datasets, clean and filter human-curated data files, etc. Students are encouraged to bring their own datasets, but suitable examples can easily be sourced for those who don’t have them yet.

Depending on students’ particular interest either the students or the trainer will likely present some case studies at the very end of the course, showing the application of tools and ideas from the course to real scientific problems.

Register today.

Registration deadline: 27 April 2025

Participation: Invite only