Data is almost meaningless without metadata - the essential information about where, how and why the data has been collected. These bits of metadata are often in scientific papers, or lab notebooks, and are rarely shared in a way that’s reusable.
Methods for effectively recording and managing metadata have historically been lacking. That’s where COPO, a big data broker for life sciences, comes in. COPO, developed by Earlham Institute’s Davey Group, allows users to easily upload all of the metadata associated with an experiment, ensuring that important data is found along with crucial context. It’s also readily accessible for anyone who needs it.
Today, COPO launches in support of the Darwin Tree of Life Project (DToL), which aims to understand our biodiversity by sequencing the DNA of all the animals, plants, fungi and protists in the British Isles. It's open to a small set of users before training is rolled out to the wider community.
“When you do a sequencing experiment, the DNA comes from an organism which has to be collected either out in the wild, in a herbarium, from a seed bank, or even grown up in culture in a lab,” says Dr Rob Davey, Head of e-Infrastructure at Earlham Institute.
“There’s valuable information about the sample - where, when, how and why it was collected, for example, and perhaps the specific body part of an insect, or the salinity of the water where a single-celled organism was found. When you put all that together, there’s a lot of metadata to record.
“COPO is being used to track those collection events so that, when an organism’s DNA has been sequenced, the associated metadata is also available as this might be vital information for someone else’s experiments.”
DToL is a collaboration of 10 research institutes in the UK, who between them are working to record Britain’s biodiversity at an unprecedented scale. Such a collaboration requires an agreement on how data is stored and managed, so that it is easily accessible and searchable for anyone working on the project. COPO is the solution for this, allowing users to upload a single spreadsheet, automatically submitting all the data to the European Nucleotide Archive (ENA).
“COPO ensures that metadata is validated,” says EI Research Software Engineer Alice Minotto. “This could be metadata such as taxonomies. which can be tricky, as identifying organisms is not a fixed process. Names and species identification can change over time, and even within specific communities.
“Instead of having to check and submit this information manually, which would take a very long time for each row in the spreadsheet, COPO automates the process. This makes it far less time consuming, easier, and eliminates errors.”
Importantly, COPO is open to any life scientist who wants to annotate and submit data more quickly and easily.
“Anyone who wants to submit data to a repository that we support can use COPO,” says Minotto. “It makes the process of annotating your data easier.
“It can be expanded to other communities, similar to DToL. Since it’s open source, people could do that themselves.”
To find out more about COPO, you can visit the website.
To speak to the team, contact Dr Felix Shaw and Alice Minotto.