HDR Final Thesis: Sean Buckley - University of the Sunshine Coast, Queensland, Australia

Accessibility links

HDR Final Thesis: Sean Buckley

We would like to invite you to attend the Final Thesis of Sean Buckley, a Doctor of Philosophy candidate in the School of Health and Behavioural Sciences.

Thesis Title: Using the random forest machine learning algorithm to predict harmful bacterial.

Abstract: Group A Streptococcus (GAS) is a strictly human pathogen that is responsible for more than half a million deaths annually. GAS-related clinical outcomes can range in severity from asymptomatic carriage and mild self-limiting presentations (tonsillitis and 'school sores') to life-threatening invasive infections ('flesh-eating' disease, toxic shock, sepsis, and scarlet fever) and post infection sequelae (including kidney failure and rheumatic heart disease). GAS is capable of expressing an arsenal of virulence genes, as it survives and thrives in the diverse range of human tissues encountered throughout infection.  Distinct from many other bacteria that engage multiple RNA polymerase sigma factors, the growth-phase gene expression of GAS is modulated globally by transcription response regulators (RRs). RRs control the initiation of transcription of often multiple genes that are implicated in GAS virulence forming complex and unresolved transcription regulatory networks. The aim of my research is to advance the understanding of GAS-host interactions, and GAS virulence in the context of GAS transcriptional regulatory networks.

The gold standard of GAS molecular genotyping, the emm-type, is based on the variable 5' end of the emm gene. Along with mrp and enn, emm composes the variable genes of the regulon of a key RR (that is, mga), which has been implicated in GAS virulence. Given the importance of GAS RRs, we hypothesised that variation in the RRs may correlate with the genomic traits (including emm-type, emm-subtype, country of origin, clinical outcomes, tissue preference, and propensity to cause invasive disease). We set out to quantify the variation in the DNA sequence of the response regulator genes (and the closely relate family of two-component systems: TCSs), and qualify the associated allele types. Using phylogeny and other indices of concordance we then measured associations with tested genomic traits. We then sought to qualify the utility of the supervised random forest machine learning algorithm in predicting the tested genomic traits.

The findings of this project include the following which have been summarised in three articles published in international peer-reviewed journals. We have characterised the distribution and diversity of 14 TCSs and 35 RRs and their associated intergenic regions, and identified strong associations between these alleles and GAS emm-type. We developed a machine learning workflow in which the random forest algorithm was applied to a RR-based GAS typing system (derived from 53 RRs) to predict the six genomic traits. From which we were able to predict all of the genomic traits with high accuracy except for clinical outcome and tissue preference, which is not unexpected given the complexity of GAS-host interactions. Moreover, we discovered a utility in discovering and explaining rare anomalies in the genes of the mga regulon. We developed novel biological models explaining the plasticity of the GAS mga regulon. With the addition of judiciously chosen human genes to the dataset we suspect we could increase the accuracy with which we could predict the clinical outcome of GAS infection. In conclusion, our work flow represents a template for inferring other untested GAS genomic traits, and ML has allowed us to interpret the biology of GAS and propose new evolutionary models.

Bio:    In a previous chapter, Sean has applied the trade learnt during his Bachelor of Materials Engineering (UQ 1992). More recently, he commenced his transition to a 'carbon-based' world with his Bachelor of Biomedical Science at USC. He subsequently completed his Honours year with the Ventura lab (USC) investigating the phylogeny of G protein-coupled receptors using spiny lobster transcriptomes. The initial scope of his PhD focused on the characterisation of response regulator transcription factors of the human pathogen, group A Streptococcus genomes. Serendipitously this has expanded to including the burgeoning field of machine learning under the tutelage of Professor Robert Harvey and Dr Zack Shan.

We look forward to seeing you there!