Open data sharing: how and why?
The more open biological data is, and the better it is shared, the more we can hope to get out of it for the benefit of all.
The biological sciences have been propelled into a new era where big data and bioinformatics are inextricably intertwined, and more tools are required to explore interactions between living systems in an unprecedented level of detail. The more open this data is, and the better it is shared, the more we can hope to get out of it for the benefit of all.
The world is becoming increasingly ruled by metadata. Metadata, or simply ‘data about data’, might seem fairly dull and only of interest to computer scientists. However, huge amounts of metadata are being recorded and analysed every second.
Your smartphone knows where you are, who you talk to, and what you pay for. LinkedIn and Facebook know who you have been dating even before your best friend. Amazon assigns actual monetary value to individual customers and makes recommendations based on this. Every click you make directs targeted adverts to your social media accounts, based on your search and buying history.
While this might be scary for some people, and indeed, misuse of metadata is a real privacy problem, efficient sharing of data has clear benefits for the industries that embrace and use it. And scientific fields such as genomics and bioinformatics should be no different.
Only a few years ago it might have taken someone an entire PhD to painstakingly sequence the DNA encoding a single gene, we now have machines that do this in a matter of seconds. While still a gargantuan effort, we can now churn out an entire 17Gb wheat genome in just over a week, and some institutes are busy decoding upwards of 100,000 human genomes.
17Gb of DNA represents 17 billion letters of our natural information storage mechanism, and this translates to roughly the same amount of information on a computer system (17 gigabases is equivalent to 17 gigabytes). This is clearly a lot of information to process. However, it’s not just DNA sequences that add to the data complexity of the biological sciences; potentially, it’s the results of every single experiment ever done in the history of published science. If they were all readily accessible, of course, but more on that later.
We have the capacity not only to produce abundant data like never before, resulting in the “big data” buzzword phenomenon but also to mine the internet for the data that, all together, can provide an incredible resource for better understanding life on earth. By providing the right tools and platforms to allow researchers to share, distribute, access and use diverse datasets, we can make scientific data accessible, improving reproducibility of research outputs and cross-disciplinary collaboration.
At Earlham Institute, the Data Infrastructure and Algorithms Group led by Rob Davey are responsible for just that.
There is not much point in having abundant data if its relevance isn’t properly understood, which is where COPO (Collaborative Open Plant Omics) comes in. Developed at EI by Felix Shaw and Anthony Etuk, COPO aims to make data of all types more searchable and more valuable to scientists.
Many funding bodies and journals now require that data is released openly at the time of publication, and many scientists understand the importance of data sharing.
However, some academic circles still see data reuse as some form of scourge on the scientific process, coining the term “research parasites”. Much work is still not publicly available - perhaps due to a lack of understanding of where and how to deposit it, but almost universally because it can take a lot of time and attention to describe and share a dataset.
So the main aims of COPO are simple. Firstly, to enable scientists to better share their data. Secondly, and crucially, to give proper credit where credit is due through tracking the outputs of a researcher’s work. COPO hides much of the complex data capture and management from the end user, providing a more simple route for plant scientists to submit data and have it appear in public repositories - essentially a ‘metadata brokering service’.
The process of data entry and sharing can be a daunting and lengthy process not only for novice scientists but also for well-established groups. COPO supplies web-based wizard-like graphical interfaces that allow researchers to easily describe and prepare data for submission to a public repository, putting their work into context through efficient labelling and tagging, as well as giving advice on what sort of data would be appropriate to share in the first place. This lets researchers spend less time and effort describing their data and more time getting on with their experimental or analytical work.
Essentially, COPO allows different forms of metadata, or information about datasets, for example, information about an organism, experimental parameters and sample characteristics etc. to be submitted using consistent terminology - standardising terminology so that it is more searchable and more useable. Metadata allows us to make sense of and join up data. COPO allows us to do that effectively.
According to Felix Shaw, who recently presented COPO at the Plant and Animal Genomes Conference in San Diego: “COPO is a first step in a promising direction for science. It is likely that difficult problems such as feeding the global population and finding cures for rare diseases can only be solved by pooling data and resources.
“There is a lot of talk in the community about open data and ‘FAIR’ principles. Discussion is great and demonstrates resolve, and the individual technologies required to achieve these aims have been available for some time, so now we need to start joining them up in innovative ways to create tools of genuine use to our researchers. COPO is one such tool, and what we really need now, is for people to embrace this change and start sharing.”
Big data analysis comes with large computing requirements, as well as an ever-increasing range of software and tools. However, not all research groups have equal access to these resources or have the necessary computing specialists to install and run specific programmes and platforms.
The aim of CyVerse is to bring all of this together, allowing researchers to store, access and reuse different datasets, models and analytical tools hosted in an interconnected network of international institutions. CyVerse is a project initiated in the US and developed under a large National Science Foundation (NSF) grant.
The Davey Group at EI are leading the deployment of CyVerse UK - the first CyVerse node outside of the US, which provides free HPC facilities for UK scientists to allow better data analysis and management.
The specific aims of CyVerse are to ensure reproducibility and shareability, with applications running in tandem across various computing environments. Data can be shared at any time either publicly or privately among collaborators, while the proper documentation and code is open source to ensure that others can build on and benefit from CyVerse.
Essentially, if we want to get metaphorical, CyVerse is like an amusement arcade for biologists and bioinformaticians. It brings lots of different tools together into one place, making it easier to analyse data from a variety of sources.
So, rather than having to go to someone’s house to play Street Fighter, then travel to the other side of town to have a go on the Dance Mat - all of the games are readily accessible in one place.
Even better still, the amusement arcade publishes its floor plan, where power sockets are, and lets you share lighting, heating, and who can come in through the front door. So, if someone else has their own machine they want to install and run in the arcade, CyVerse is immediately able to help. This makes CyVerse able to support services all the way from low-level ‘plumbing’ applications, all the way up to user interfaces for data sharing and analysis.
Currently, the CyVerse UK node, developed and maintained by Erik van den Bergh and Alice Minotto, provides access to a number of bioinformatics applications and workflows, including Mikado and Gwasser developed at EI, with several others currently in the pipeline.
“It's great to be developing and supporting an open platform like CyVerse UK, especially at the Earlham Institute where we can contribute many tools and resources to the scientific community. It's a great feeling to be able to help more scientists to use the amount of biological data available for their research,” says Alice.
Eventually, we hope that CyVerse UK will become a National Capability at EI, making it a cornerstone of UK bioscience through providing HPC and storage for collaborative research.
The EI hub of Galaxy is another web-based open platform with large numbers of “apps” for biologists that can be easily downloaded, installed and linked together to form pipelines, enabling greater and more transparent access to and the sharing of bioscience data. Using EI’s state-of-the-art high-performance computing and next-generation sequencing technologies, the EI Galaxy hub provides tools and resources for a variety of bioinformatics applications.
Anil Thanki and Nicola Soranzo are responsible for developing such resources. Nicola develops and manages the Galaxy server at EI, while Anil presented recently at the Plant and Animal Genomes Conference in San Diego on developing a pipeline that helps users find collections of genes that are shared between organisms.“One tool that is heavily used for this already is Ensembl Compara, which is a powerful pipeline but complicated to use,” says Anil.
“We have been working on incorporating this existing pipeline into Galaxy to provide a much more user-friendly version to work with, called GeneSeqToFamily. You can install and run GeneSeqToFamily using a web browser and the Galaxy interface, rather than having to install a large number of tools on your own computer and getting them to work in tandem together.”
This powerful tool is of great use to bioinformaticians, as it can be used to not only scour newly sequenced genomes for genes that we already know about but also to find completely novel genes, known as ‘orphan genes’. Orphan genes are unique to a given organism - having no ‘homologues’ that we are aware of, at least for the organisms and genomes that we have sequences for, and can, therefore, be incredibly interesting to study in terms of how they affect specific traits.
Importantly, the code and pipeline developed for the project are open source and can be installed on any Galaxy server, opening up the resource to the international scientific community.
Such diverse projects just go to show how far the biological sciences have progressed into the digital era. It also shines a light on how important it is to foster a sense of cross-disciplinary skill and expertise within all scientific fields, and that expertise and knowledge is best shared openly to make sure everyone can benefit.
Technical advances in analysis like GeneSeqToFamily and infrastructure like CyVerse, COPO, and Galaxy, aim to help more users get access to the information they need and the tools to analyse that information quickly and easily. Platforms like CyVerse are sufficiently powerful and openly developed that large-scale online software like COPO can actually run inside them, allowing us to save time and effort in getting them into the hands of users.
Now, it’s not just important for a biologist to understand genetics and biochemistry, but also to understand how to access and use abundant, large sets of data. However, as the Davey Group shows, there are computer scientists and bioinformaticians working hard to ensure that this sort of data is usable for everyone - whether you’re an expert bioinformatician or not. We want everyone to be data parasites, and proud of it!