Grassroots Genomics
The Grassroots Genomics project at EI is a user-driven platform for integrating and reconciling wheat genomic information.
Support: grasshelpdesk@earlham.ac.uk
Grants:
BBS/E/T/000PR9783: Data access and analysis (Work Package 4 of Designing Future Wheat)
BB/N023420/1: Federating access to wheat data services for efficient genome-specific marker design
BB/L024144/1: CerealsDB: A community resource for wheat genomics
BB/M025519/1: Using field pathogenomics to study wheat yellow rust dispersal and population dynamics at a national and international scale
Integrative research requires extensive multi-level approaches to enrich and expose data and workflows so that informatics infrastructures can process them effectively. The Grassroots Genomics project represents EI’s contribution to the international Wheat Information System (WheatIS) to consolidate data and analyses, facilitating consistent approaches to generating, processing and disseminating public wheat datasets. The Grassroots Genomics platform is powered by a powerful yet lightweight set of middleware services, called the Grassroots Infrastructure, which comprises: a data management layer to provide structure to unstructured filesystems; interfaces to interact with local or cloud-based analysis platforms; a search layer to provide multi-faceted metadata and literature querying; a web server layer to deliver content and provide access to public programmatic interfaces.
The Grassroots infrastructure framework can be run locally or packaged in virtual containers and deployed on a variety of hardware thus representing a decentralised system, allowing information generators to retain control over their resources but allowing interconnected resources to access each other consistently. EI has an extensive National Capability in e-infrastructure to provide scientific computing hardware to the UK research community and is therefore perfectly positioned to build a point-of-access to previously disparate resources to serve wheat breeders, biologists and bioinformaticians. Coupling the Grassroots Genomics project with BBSRC-funded efforts to bring Galaxy and CyVerse UK to EI provides community standardised methodologies for data integration, interpretation and discovery.
Grassroots Genomics utilises iRODS to track files on a filesystem as objects rather than files and folders. We use iRODS APIs to abstract data search and data access functionality in order to consistently expose and share data for use in downstream analyses. Other iRODS instances are designed to be brought together, such as the DSpace instance under development at INRA which also uses iRODS. This means federation on the data level across geographical and political boundaries is facilitated out-of-the-box.
Grassroots Genomics uses a standard Apache httpd webserver to serve web content to users as well as an Apache Tomcat Java servlet container to host Java enterprise web applications for the platform. We have developed a single simple API as an Apache module so that the webserver can consistently interact with the plethora of Grassroots services, such as: BLAST capability, iRODS data management, and ElasticSearch integration. We have also developed Grassroots services can also search 3rd party resources, such as the Ensembl, Agris, F1000, and BASE repositories.
The collaborative approach to building the international WheatIS infrastructure means that WheatIS is actually a federated network of nodes. Each node might house different data, metadata or analytical processes, but can be federated into the global network through consistent shared APIs on various middleware levels. The Grassroots Genomics platform can therefore interact with EI’s National Capability HPC infrastructure through iRODS, as well as the WheatIS node in INRA (France), the CerealsDB at Bristol University (UK), and Ensembl/EnsemblGenomes at the EBI (UK).
In order to gain access to information quickly, we use metadata indexing to provide faster and federated querying. We index user-relevant fields of iRODS iCAT metadata catalogues in order to quickly find and access Grassroots Genomics data objects, as well as content-mined literature text via Solr. We can also index content from 3rd party databases, such as CerealsDB, via our Grassroots APIs, thereby database owners maintain access and control over their data.
We use ElasticSearch to filter and aggregate metadata, which allows us to search the already indexed data across WheatIS nodes. Its distributed “shard” model supports federation of search functionality, allowing WheatIS nodes to host and seamlessly federate their own ElasticSearch shards. ElasticSearch integration enables Grassroots Genomics users to submit a search term to multiple indexed repositories, resulting in a detailed and rich search platform.
A vital requirement for Grassroots Genomics is to facilitate data-to-analysis infrastructure so that researchers do not have to funnel data through their own networks and storage media. This “Platform as a Service” (PaaS) architecture forms the basis of many cloud solutions for bioinformatics analysis, and the WheatIS project follows these conventions to deliver software to users.
Grassroots currently supports running analysis jobs on local hardware and managed HPC clusters via the DRMAA library, to which the commonly-used schedulers (SLURM, LSF, PBS, SGE) conform. This makes deployment much easier, facilitating the setup of new Grassroots nodes that are compliant with the WheatIS network, and providing a solid basis for Grassroots Genomics HPC requirements. We currently have two active projects at EI to deploy and maintain Galaxy and CyVerse instances on EI’s HPC hardware. As such, the WheatIS network will be able to exploit the APIs of both platforms to enable data transfer and workflow initiation, enabling Grassroots Genomics users to access and analyse a huge array of datasets and pipelines in one place.
International Wheat Genome Sequencing Consortium
Ensembl plants - TGACv1 sequence
Grassroots provides a versatile data repository that comprises a range of open datasets (some under the Toronto licence) freely available for researchers and the public, as part of the Designing Future Wheat programme:
Grassroots also provides analytical services, including our large-scale BLAST service over a range of wheat genomic resources, including the recently released TGACv1 genome, and 5 additional elite genomes with relevance to breeding:
Grassroots Genomics BLAST Search
Marker assisted breeding is enabled through services such as Polymarker, allowing researchers to design primers against a range of available wheat genomes:
Develop genome specific primers with Polymarker
The Grassroots infrastructure is 100% open source and all our code can be found on our github:
Grassroots Genomics GitHub
Leonelli S., Davey R. P., Arnaud E., Parry G., Bastow R. Nature plants (2017) 3 17086 doi:10.1038/nplants.2017.86
Wilkinson PA, Winfield MO, Barker GLA, et al. B MC Bioinformatics. 2016;17:256. doi:10.1186/s12859-016-1139-x.
A full explanation of the technology that comprises the Grassroots APIs and infrastructure can be found on our GRASSROOTS website.
URGI, INRA
WheatIS node partner
Wheat Initiative Wheat Information System Expert Working Group
WheatIS EWG partner consortium
WheatIS EWG Collaborators
List of WheatIS EWG collaborators
Bristol Wheat Genomics
University of Bristol group working on CerealsDB and other wheat resources
In recent years, there has been a revolution in the generation of genomic data for cereal crops, especially wheat. Our goal is to engage the community of wheat researchers, from breeders to bioinformaticians, in generating, evaluating and integrating wheat data.
Grassroots Genomics aims to connect data generators and data users, providing an information-rich data sharing and analysis platform that will enhance the value of available wheat genomic resources. We are committed to maintain and promote the principles of Open Data to enable discoveries and to enhance the value of integrative research, and work with wheat communities and technologists to deliver a coordinated and federated infrastructure.