< Back to previous page

Project

Comparative genomics of pathogenic bacteria

Omics techniques and bioinformatics analyses have become an integral part of modern biological investigation. In this dissertation, we focus on their application to the field of microbiology, more
specifically to the domain of bacteria and their viruses (phages). The work presented can be broadly divided into three sections: i) the development of new bioinformatics methods made possible by the introduction of long read sequencing technologies, ii) the genomics and evolutionary characterization of bacteria relevant to clinical and environmental settings, and iii) an exploration of the determinants of interactions between bacteria and phages by combining omics and host range data analyses.

For the first section, we discuss two novel bioinformatics applications that make use of long read sequencing. These technologies enable the sequencing of DNA molecules that are multiple thousands of base pairs long (and up to millions of bp). In chapter 2, we describe a method to tabulate the composition of synthetic DNA libraries that have been combinatorially assembled from individual DNA building blocks through “one-pot” assembly reactions. We show that the use of nanopore sequencing enables a quality assessment of such library. This is done directly after the creation of the library in vitro and before it is used in downstream applications. It can highlight potential biases in the assembly reaction or issues with the upstream repository of individual DNA building blocks.

Next, we turn to bacterial genome assembly, which is a critical computational step to reconstruct consensus genomes from sequencing data. Assemblies based on short read sequencing always fall short of reconstructing the full set of (closed) replicons in bacteria. These assemblies are referred to as “draft assemblies” as they necessitate further work to be fully completed. Large-scale sequencing efforts based on short read technologies have resulted in the accumulation of hundreds of thousands of bacterial draft genomes in public databases. In chapter 3, we introduce SASpector, a piece of software to compare draft with complete assemblies and identify the missing regions in the draft assembly. SASpector can be used to investigate the causes of the missingness, and the impact on feature annotation.

In the second section, we apply a combination of short and long read sequencing techniques to various bacteria to study their genetics and phylogenomics placement. In chapter 4, we introduce hybrid assembly to fully reconstruct the genome of the Bacillus thuringiensis HER1410 strain, host of the model Betatectivirus virus Bam35. The combination of sequencing technologies enables us to close all the replicons of that isolate. This is followed by an extensive genomics analysis that highlights the presence of two mega plasmids and a structural variation analysis that reveals the presence of an integrative plasmidial prophage that appears to use flgK as integration site.

In chapter 5, we analyze a longitudinal series of Burkholderia multivorans isolates that infected multiple cystic fibrosis patients in Belgian hospitals. In collaboration with the National Reference Centre for Burkholderia, we uncover the first known endemic strain from that species and showed that the strain's genomic diversity is much higher than expected between the different patients. A combination of Illumina and Nanopore sequencing technologies allows us to tease apart in great details the genomic variations between the different isolates and to show variations that were not captured by Illumina sequencing alone. Finally, the specific genetic background of the strain is shown to dictate in part the evolutionary trajectory of adaptation within the cystic fibrosis lung environment.

In chapter 6, we present a bacterial taxonomy work which stems from a rpoD-based screen previously developed in collaboration with our lab. The rpoD gene has a higher discriminative power than the 16S rDNA gene and allowed us to highlight new Pseudomonas species. In this piece, 43 novel Pseudomonas sp. are described, significantly expanding the catalog of known species within that genus. The group organization of the genus is also discussed, and we propose to reclassify existing (non-type) strains. Finally, we propose a new segmentation of the P. putida group informed by the distribution of cyclic lipopeptides biosynthesis gene cluster.

In the third section of this thesis, we bring our attention to the interactions between phages and their bacterial host. We begin in chapter 7 with Pseudomonas virus PA5oct, a model virus with a large genome of 287 kbp, which was annotated more accurately than the previous state-of-the-art by utilizing RNA-Seq and mass spectrometry data. PA5oct does not appear to be organized in contiguous regions of temporal transcriptions, but some early features can be discriminated. Additionally, RNA-Seq enables us to annotate i) four non-coding DNA regions that are highly transcribed during the infection, ii) six other regions of the phage genome that appear to be barely transcribed during the infection process, raising questions concerning their utility. Finally, we use a gene-sharing network approach to relate this isolate to the other known jumbo phages.

Next, we turn to the analysis of CRISPR-Cas, a known genetic defense system used by bacteria to mitigate phage infection. We show in chapter 8 that Pseudomonas aeruginosa strains equipped with CRISPR-Cas have a smaller genome on average than those without and investigate how the population structure of the species confounds this relationship. This impact is then further stratified by considering co-located antiCRISPR proteins on the genome. Interestingly, there is a paradoxical association between the presence of the system and the likelihood that the bacteria can be infected by diverse Pseudomonas phages. This observation is linked to a systematic depletion of other defense system genes on strains with CRISPR-Cas and we show these defense system genes are associated with genomic islands.

The population of P. aeruginosa has an extensive pangenome, which reflects the vast between-strains genomic diversity of that species. In chapter 9, we build models to predict phage susceptibility within P. aeruginosa by applying supervised machine learning techniques. We use the accessory gene clusters delineated during the pangenome analysis as “bag-of-genes” features and train classifiers using both black box and white box algorithms including a linear SVM and Random Forests. We report an accuracy of 80% with a F1-score of 69% in the prediction of susceptibility using the linear SVM, and further propose to introspect the white box models for feature importance as a mean to put forward candidate genes potentially linked to phage susceptibility.

In the perspective chapter 10, we discuss the various machine learning (ML) approaches that have been applied to the investigation of phage-bacteria interactions and evaluate the type of data and predictive features that are available to train ML models. A multi-layer ML model in three layers is also proposed, with each layer designed to reflect the biology of phage-bacteria interactions. This model could be applied in the future to predict phage susceptibility and produce ‘digital phagograms’.

Finally, in a second perspective chapter 11, we discuss the current limitations of therapeutic phage cocktail design as they are used in clinical trials and patient case series. The combination of phages into cocktails is often done empirically and without explicit design rules. However, given the demand for such phage products, the field is poised to take advantage of data-driven methods in the future, alongside the creation of phage banks and automated host-range assays. We describe here an implementation of association rules mining based on phage-bacteria host range datasets and show that this approach can detect positive and negative associations between phages, providing potential ground rules for the design of cocktail products.

Taken together, the works presented in this dissertation highlight new opportunities in the field of microbial genomics. First, we focused our attention on long read sequencing and the new bioinformatics analyses it enables. We then looked at the ways in which this technology can be used to provide new insights into the genomes of microbes. Finally, we shifted to phage-bacteria interactions and showed associations between phage susceptibility and genomic features.

Date:29 Sep 2016 →  31 Dec 2021
Keywords:Bioinformatics
Disciplines:Scientific computing, Bioinformatics and computational biology, Public health care, Public health services, Genetics, Systems biology, Molecular and cell biology, Microbiology, Laboratory medicine, Biomaterials engineering, Biological system engineering, Biomechanical engineering, Other (bio)medical engineering, Environmental engineering and biotechnology, Industrial biotechnology, Other biotechnology, bio-engineering and biosystem engineering
Project type:PhD project