Earlham Institute establishes the first UK dedicated high-performance computing (HPC) cluster for international data portal ‘CyVerse’ - providing free, open-source genome analysis for big data research.
Genomics is increasingly a big data science as now commonplace high-throughput technologies support faster, cheaper generation of data analysis. This enables potentially exciting breakthroughs as researchers can unearth previously hidden patterns and make new discoveries of biological significance.
However, the scientific community struggles to take full advantage of the data generated because of a lack of computing resource, appropriate support, and technical skills. Additionally, bioinformatics tools generated during research projects to test and validate biological hypotheses often remain limited to prototype form and can only be used by those with computational expertise.
Therefore, to undertake modern science when faced with a plethora of tools and datasets, researchers need to be able to efficiently store and access datasets, models, and analysis tools, ideally hosted in different global locations to facilitate international projects - this is where CyVerse can help.
As an international collaboration between hardware and middleware engineers at EI, support staff in the Norwich Research Park Computing Infrastructure for Science (NRP CiS) team, University of Arizona, Texas Advanced Computing Centre and Cold Spring Harbor Labs, CyVerse UK provides free, large scale, computing facilities and data storage designed for life scientists.
Lead Engineer of the CyVerse UK team Erik van den Bergh, said: “Establishing the first CyVerse node outside of the US represents a vital hub in the UK for data analysis and management. CyVerse UK can provide free HPC facilities for all UK scientists as well as allowing integration of UK apps and pipelines into the wider international CyVerse ecosystem.
“CyVerse provides an intuitive web interface, Discovery Environment (DE), where scientists can upload data and run analyses. While this resource is hosted in the US, the DE can automatically run tools hosted in the CyVerse UK platform, giving geographical advantages to data access speed, analysis time, and data placement policy.”
CyVerse UK currently hosts two open-source apps and a new virtual machine environment. Gwasser (Ben Ward, Clark Group) is a statistics pipeline which performs Genome-Wide Association Studies for single phenotypes. Mikado (Luca Venturini, Swarbreck Group) is a lightweight Python pipeline to identify the optimal set of data readings from multiple transcript genomics assemblies. Both apps have been used for the analysis and recent publication of the allohexaploid wheat genome; a crop genome that is paramount in tackling the societal challenge of global food security.
The Polymarker pipeline will soon also be available to scientists to create efficient SNP genome assays in wheat, together with a modified ‘Tuxedo suite’ app developed by the University of Liverpool which executes a series of pipelines for RNA-seq analysis. CyVerse UK’s robust virtualisation platform will also provide back-end data services and web hosting for the COPO and Grassroots Genomics projects.
All tools are available through the Discovery Environment (de.iplantcollaborative.org), full documentation can be found at cyverseuk.org.
The CyVerse UK node hardware and software environment has been set up and deployed by the core CyVerse UK team (Erik van den Bergh and Alice Minotto) in the Davey Group, Tim Stitt (Scientific Computing), and NBI Scientific Computing. The CyVerse UK project is a BBSRC-funded collaboration between the EI, University of Warwick, University of Nottingham, and the University of Liverpool.