Data Science for Cellular Genomics
Developing data standards, workflows, infrastructure, and methods to enable interpretation, sharing, reuse, and integration of single-cell datasets.
Project Lead: Irene Papatheodorou
Funding:
This research is supported by the UKRI Biotechnology and Biological Sciences Research Council (BBSRC).
Earlham Institute Strategic Programme Grant Cellular Genomics BBX011070/1 and its constituent work package BBS/E/ER/230001A.
At the Earlham Institute, we have a unique blend of computational and molecular expertise in single-cell genomics, developed through our National Bioscience Research Infrastructure in Transformative Genomics.
We have a strong reputation for developing robust systems for large-scale genomics projects supporting open and FAIR data principles.
Metadata is essential for ensuring data is FAIR (findable, accessible, interoperable, and reusable) - a key principle of modern, open science. It allows researchers to make deductions, replicate experiments, and draw more accurate conclusions.
The field of single-cell genomics is incredibly fast-moving, making the process of curating and managing datasets for subsequent reuse time-consuming, costly, and in danger of being rapidly outdated.
To integrate these datasets and test novel hypotheses, we need to develop and apply approaches that can harmonise those data sets to improve biological interpretation.
As part of our Cellular Genomics strategic programme, we’re applying our knowledge in data science and computational developments to deliver metadata standards for single-cell biology, and developing bioinformatic pipelines for single-cell genomic data analyses.
We aim to generate outputs that will enable reliable and robust integration of single-cell data and metadata into maps of cellular characteristics or cell atlases.
The resulting cell atlases will be deployed within our CyVerse UK cloud for immediate reuse by the community. The underlying software and data infrastructure framework will also be publicly released so that others can build similar atlases themselves.
Managing metadata and laying the foundations for integrative analyses
Single-cell experiments produce important metadata at each stage of their progression, including by instrument, human observation, and primary or secondary analyses.
Working with our partners and collaborators, we aim to; define the properties of ‘good’ data for integration; develop guidance, standards, analytical workflows, and FAIR data infrastructure; and identify the ways to standardise, integrate, and analyse heterogeneous datasets.
Improving and validating the data baselines for cellular genomics experiments
To enable single-cell sequencing applications in non-model systems there is an urgent need for the development and deployment of automated pipelines and algorithms to deliver reproducible analyses.
We are developing, testing, and improving such pipelines, releasing them for use by the community.
Our software developments will be made available on the Earlham Institute’s Galaxy and CyVerse UK servers to target scientists with limited bioinformatic knowledge or resources.
Scaling up and scaling out data integration
To allow the integration of public data, and enable further investigations in the consequences of cellular heterogeneity, we need to control for potential confounders associated with the diverse origin of datasets.
Furthermore, data arising from single-cell investigations - especially for non-model organisms - represents a significant infrastructure challenge for the community.
We are developing pipelines and toolkits to enable the community to reuse and integrate single-cell data. To ensure our improved metadata standards, curations, and open access pipelines for reproducible single-cell analyses truly benefit the community, we aim to deliver a toolkit for the building of single-cell atlases.
COPO is a portal to describe, store and retrieve data more easily. Data description is critical to increase the value of the data itself, allowing scientists (and online search tools) to better understand its relevance.
If you are a life scientist looking for access to additional computational power, virtual machines or a web hosting service, CyVerse UK can help.
Galaxy is an open, web-based platform for accessible, reproducible, and transparent data-intensive research.
Cellular Genomics is a highly collaborative programme, including partners and collaborators from:
EMBL-EBI
ELIXIR
Plant Cell Atlas
IBM Research
PacBio
Alan Turing Institute
One of the major challenges associated with publicly available ‘omics data comes from its high diversity and the highly variable quality of the associated metadata.
Improved metadata description and curation will promote and enable reproducible research and data reuse. Enhancements to our data brokering tool, COPO, will enable better description and easier submission of single-cell data to public repositories.
All our outputs will be open-access, depositing our protocols, pipelines, and datasets onto public repositories.
Our collaborations with IBM Research will contribute to making our data and computational developments openly accessible through cloud infrastructures (CyVerse UK) and as containerised pipelines and workflows for deployment on other public cloud offerings.
The research community will directly benefit from our computational and method developments through our training activities, offered to scientists at all career levels.