Article Technology

AI and life sciences: why FAIR data is essential

DNA sequencing has unlocked massive amounts of complex information. AI can help us analyse this. But it’s only effective with the right data.

23 April 2025

Advances in sequencing mean vast amounts of data -  imaging, proteomics and metabolomics - are being generated worldwide on a daily basis. 

There is far more than researchers can easily analyse by traditional methods, but properly trained AI systems can analyse this mountain of complex information, quickly picking up patterns and details too subtle for humans to see. 

And they are already changing the face of science. 

Last year’s Nobel Prize for chemistry went to the creators of DeepMind AI programme AlphaFold, which has transformed structural biology by enabling scientists to accurately predict protein structures from amino acid sequences. The first Nobel for AI - but quite possibly not the last. 

But - as anyone who has ever been confused by their social media recommendations knows  - AI can only help you if it can find and understand the data it needs. 

Stack of old photos and memorabilia

Scientists around the world are using a plethora of different methods to generate, label, store and share their data. Without a universal approach, data can’t be easily accessed or understood. 

Professor Neil Hall, Director of the Earlham Institute, says the importance of FAIR - Findable, Accessible, Interoperable, and Reusable - data to AI cannot be overstated. He describes it as the difference between a photo stored on a phone and one stored in a box in the attic. 

“Metadata automatically attached to a phone photo will tell you the size of the photo, the format, when it was taken, where it was taken and can even offer some information about the subject.

“Crucially, the metadata is interoperable - I can look at this photo on my phone or my computer and see the same information. 

“A box of unsorted photos contains no information. They are in no order, and might not even have dates or names - sorting them relies on the knowledge of the person looking at them. The first example can be used by AI. The second can’t be used.”

The Earlham Institute is developing FAIR methods for managing, storing, and labelling data to ensure the data is readable and interpretable by AI.

“I once heard someone describe FAIR data as Fully AI Ready, which I thought worked very well as a description,” he says. 

Data that isn’t FAIR creates massive barriers to the sorts of innovative insight today’s AI algorithms can yield.

Tony Burdett, BioFAIR UK Director

Staying BioFAIR

AI requires well-curated, properly organised information. It’s the data needed to train the complex algorithms that sit behind the tools. It’s the only way systems like AlphaFold can learn to predict protein structures.

Widely-adopted FAIR-compliant metadata standards and tools could open up a new world for the biosciences - one where scientists can interrogate existing data to answer new questions.
Earlham Institute is host to two significant biodata infrastructure organisations: BioFAIR and ELIXIR-UK. 

BioFAIR, funded with £34 million from UK Research and Innovation (UKRI)’s Infrastructure Fund, is a groundbreaking national research infrastructure for life sciences. 

It focuses on connecting researchers, digital professionals, and institutions, offering data commons - accessible, shareable datasets - shared workflows, and expertise and training in research data management.

Richard Ostler from Rothamsted Research leads a workshop exploring open science and FAIR data

At a recent conference of BBSRC Research Institutes, one session looked at the role of open science and FAIR data in research transparency. Pictured: Richard Ostler from Rothamsted Research is a ELIXIR-UK Fellow with strong connections to BioFAIR. 

BioFAIR Director Tony Burdett says FAIR data principles make data more accessible and valuable for researchers, breaking down data barriers and fostering collaboration.

“The days when researchers would sit in their own labs, performing their own bespoke experiments, and writing up and publishing their own findings are long gone,” he says. “The world we live in now is much more data driven, with a greater focus on team science. Today’s life sciences researchers are much more technology savvy. 

“Being able to share data effectively with your collaborators, reuse data that's previously been published from experiments that other groups have done, these are core foundations in modern science. Watson and Crick made the discovery of the structure of DNA thanks to the timely sharing of data from Rosalind Franklin’s lab.”

“Data that isn’t FAIR creates massive barriers to the sorts of innovative insight today’s AI algorithms can yield.”

He says research typically shows that bioinformaticians and computational biologists doing data driven science can spend as much as 80 per cent of their time simply collecting, organising, and cleaning the data they need. 

This leaves only 20 per cent of their time for their key work - the computational analysis that supports novel research. 

If we can focus on creating FAIR data that is more ‘AI ready’, we will see a huge leap forward in what machines can do with that data, in surprising and sometimes unpredictable ways. 

Tony Burdett, BioFAIR UK Director

“One of the ambitions of BioFAIR is to help change research culture, removing the barriers that mean researchers waste so much time simply finding and cleaning data that can be useful to them," he says.

“The easy answer that people often jump to is: Let’s just train machines to do data cleanup data for us! But I am skeptical about that. 

“We need to find the right role for human experts to unlock the potential of AI. We need the people that are working in labs like those here at the Earlham Institute and creating new, really valuable genomic datasets, to work alongside expert data managers and expert data stewards, to think about how best to structure and organise the data so that it's reusable by others,” adds Tony.

“In general, I think the opportunities are not particularly in using AI to create FAIR data for us - but instead, if we can focus on creating FAIR data that is more ‘AI ready’, we will see a huge leap forward in what machines can do with that data, in surprising and sometimes unpredictable ways. That’s a huge amount of value that can be unlocked by AI.”

Prof Irene Paptheodorou, with Dr Felix Shaw and Dr Liliya Serazetdinova

Felix Shaw (centre) and Irene Papatheodorou (right) lead the COPO platform and are also working to develop standardised metadata systems for single-cell genomics.

The ELIXIR of knowledge

Pan-European organisation ELIXIR provides essential data services and infrastructure, facilitating collaboration between countries and institutions to ensure effective data-driven science by sharing data from different countries and researchers. 

The Earlham Institute is the lead institute for the UK node (ELIXIR-UK) and Prof Hall is the co-leader of the organisation. The Institute’s Collaborative OPen Omics (COPO) is an ELIXIR-UK endorsed service, and plays a key role.

COPO helps researchers with uploading, labelling, and tagging their work in a consistent way. It is designed to make it easy to share both results and the metadata around them, storing it according to agreed terms so it is easily findable and describable.

Dr Felix Shaw is a research software engineer at the Institute, working on COPO alongside fellow software engineers Debby Ku and Aaliyah Providence. 

“It’s been really interesting to be part of ELIXIR,” he says. “Europe and the UK have in general been very good at recognising the need for FAIR data and metadata, so I would say we are in a very strong position.”

While AI offers an exciting new approach for analysing the huge quantities of life science data currently being produced, a lot of work is still needed. Data must be organised and interpretable by AI tools.

But, with the right infrastructure, collaboration between human expertise and AI analysis could solve some of the most pressing challenges in genomics.

Researchers and machines working together, aided by robust data infrastructures like BioFAIR and tools like COPO, could drive the next wave of breakthroughs in life science. 

Image
Profile of Amy Lyall
Article author

Amy Lyall

Scientific Communications and Outreach Officer