Transforming metagenome assembly from long reads with metaMDBG

MetaMDBG allows for more efficient assembly of highly accurate long reads produced by PacBio HiFi sequencing. The work was led by Dr Gaëtan Benoit, a former postdoc at the Earlham Institute, under the guidance of Dr Chris Quince and with Dr Rayan Chikhi, at Institut Pasteur.

Metagenomics is the study of the collective genetic material of microorganisms in a sample. It allows researchers to analyse the diversity and function of microbes in contexts ranging from soil to the human gut, without having to isolate and culture individual species.

Long-read sequencing has dramatically improved metagenomics by enabling more complete and accurate assembly of microbial genomes from bulk samples.

By using long-read sequencing researchers can read much longer fragments of DNA, ranging from thousands to hundreds of thousands of base pairs, compared to short-read sequencing, which typically uses fragments that are a few hundred base pairs in length.

Assembling genomes from sequenced DNA fragments is a bit like a puzzle. Long DNA fragments function like large puzzle pieces, they are easier to place in the correct order, especially in complex regions with repetitive sequences, than small pieces.

There are two main long-read sequencing platforms - Oxford Nanopore Technologies and Pacific Biosciences (PacBio).

“Nanopore sequencing is cheaper and reads are easier to generate but, until recently, had a high error rate,” explains Chris.

HiFi PacBio on the other hand, generally produces longer reads with significantly higher accuracy. The choice between the two usually depends on the project's needs, including factors such as read length, accuracy, cost and application.

Dr Chris Quince, Group Leader at the Earlham Institute and Quadram Institute

Scaling-up metagenome reconstruction

Assembly is a significant computational challenge in genomics. Solving it requires high level programming skills and computational resources.

Chris is leading the High-resolution Microbiomics Group, based at the Earlham Institute and nearby Quadram Institute. His team is combining statistical bioinformatics with technological developments in sequencing to characterise microbial communities. Until 2023, Gaëtan Benoit worked as a postdoc in his group, developing a dedicated assembler for metagenomes.

Gaëtan recalls how when he was recruited the one thing he didn’t want to work on was a metagenomics long-read assembler.

“In my mind the assembly field was super-competitive,” he says. Yet, because most people have been focusing on assembling the human genome, he has been able to address an important gap in the field of metagenome assembly.

Gaëtan used assembly algorithms created by Rayan Chikhi, his current supervisor at Institut Pasteur as a starting point. “de Bruijn graphs are a computationally efficient way to represent overlaps and relationships between reads,” Gaëtan says.

He developed the metaMDBG tool by combining de Bruijn graphs with other strategies to deal with vast amounts of data and huge numbers of possible overlaps between reads. “It was a lot of work, but I like coding!” he adds.

When Gaëtan and colleagues tested metaMDBG on a range of HiFi PacBio metagenome data sets, they found that it was up to 12 times faster than two other state-of-the-art assemblers and required between one-tenth and one-thirtieth of the memory1. Importantly, it gave better results as the researchers were able to retrieve more near-complete metagenome-assembled genomes, particularly from more complex samples.

“metaMDBG works particularly well on seawater and soil samples that contain more diverse communities, both in terms of species and strains,” Gaëtan explains.

The assembler is being used as part of the Earlham Institute’s strategic research programme, Decoding Biodiversity. In collaboration with the UK Centre for Ecology & Hydrology (UK CEH) and the Quadram Institute, Chris’ team aims to profile the genomic diversity of soil for the first time.

This information is crucial for understanding which (and how) soil microbes influence greenhouse gas emissions and provide nutrients to plants. “Most of what is in soil is unknown,” says Chris, “we are reconstructing completely new genomes.”

Soil being scattered on the ground in low light — metaMDBG works particularly well on samples that contain diverse communities, such as soil.

Adapting for nanopore sequencing

Gaëtan has recently adapted the assembler for long reads generated by Oxford Nanopore Technologies, and expects that this version of the assembler, nanoMDBG, will have a big impact on advancing metagenomic research.

Improvements in nanopore sequencing, including in accuracy, read length and throughput, means that the assembler can achieve similar results as with the PacBio HiFi-produced reads.

Microbial communities often contain multiple strains of the same species, and these strains can have very different functions. “The efficiency of the assembler allows us to work with bigger data sets and to start thinking about co-assembling data sets and retrieving strain-level genomes,” Chris says.

While co-assembling samples will allow them to examine changes in microbial diversity over space and time, resolving genomes at the strain-level will provide new insights into microbial community dynamics and pathogen evolution.

metaMDBG is open source, free to use and easy to install from GitHub. nanoMDBG will be available to the community soon.

Authored by Monica Hoyos Flight, writing for the Earlham Institute.

Transforming metagenome assembly from long reads with metaMDBG

Scaling-up metagenome reconstruction

Adapting for nanopore sequencing

Related reading.

Assembly line: making sense of metagenomics through MAGs

Cultural differences: how analysing mixed communities of microorganisms could help us understand AMR

Decoding Biodiversity: bridging the gap between data and discoveries.