Research

Data Science for Cellular Genomics

Developing data standards, workflows, infrastructure, and methods to enable interpretation, sharing, reuse, and integration of single-cell datasets.

Project Summary.

Project Lead: Irene Papatheodorou

Funding:

This research is supported by the UKRI Biotechnology and Biological Sciences Research Council (BBSRC).

Earlham Institute Strategic Programme Grant Cellular Genomics BBX011070/1 and its constituent work package BBS/E/ER/230001A.

At the Earlham Institute, we have a unique blend of computational and molecular expertise in single-cell genomics, developed through our National Bioscience Research Infrastructure in Transformative Genomics.

We have a strong reputation for developing robust systems for large-scale genomics projects supporting open and FAIR data principles.

Metadata is essential for ensuring data is FAIR (findable, accessible, interoperable, and reusable) - a key principle of modern, open science. It allows researchers to make deductions, replicate experiments, and draw more accurate conclusions.

The field of single-cell genomics is incredibly fast-moving, making the process of curating and managing datasets for subsequent reuse time-consuming, costly, and in danger of being rapidly outdated.

To integrate these datasets and test novel hypotheses, we need to develop and apply approaches that can harmonise those data sets to improve biological interpretation.

As part of our Cellular Genomics strategic programme, we’re applying our knowledge in data science and computational developments to deliver metadata standards for single-cell biology, and developing bioinformatic pipelines for single-cell genomic data analyses.

We aim to generate outputs that will enable reliable and robust integration of single-cell data and metadata into maps of cellular characteristics or cell atlases.

The resulting cell atlases will be deployed within our CyVerse UK cloud for immediate reuse by the community. The underlying software and data infrastructure framework will also be publicly released so that others can build similar atlases themselves. 

Image
Cellular Genomics Logo

Impact statement.

One of the major challenges associated with publicly available ‘omics data comes from its high diversity and the highly variable quality of the associated metadata.

Improved metadata description and curation will promote and enable reproducible research and data reuse. Enhancements to our data brokering tool, COPO, will enable better description and easier submission of single-cell data to public repositories.

All our outputs will be open-access, depositing our protocols, pipelines, and datasets onto public repositories.

Our collaborations with IBM Research will contribute to making our data and computational developments openly accessible through cloud infrastructures (CyVerse UK) and as containerised pipelines and workflows for deployment on other public cloud offerings.

The research community will directly benefit from our computational and method developments through our training activities, offered to scientists at all career levels.