Identifying genetic determinants of complex phenotypes from whole genome sequence data

Long, George S; Hussen, Mohammed; Dench, Jonathan; Aris-Brosou, Stéphane

Identifying genetic determinants of complex phenotypes from whole genome sequence data

dc.contributor.author	Long, George S
dc.contributor.author	Hussen, Mohammed
dc.contributor.author	Dench, Jonathan
dc.contributor.author	Aris-Brosou, Stéphane
dc.date.accessioned	2019-06-16T04:02:21Z
dc.date.available	2019-06-16T04:02:21Z
dc.date.issued	2019-06-10
dc.date.updated	2019-06-16T04:02:22Z
dc.description.abstract	Abstract Background A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. Results To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. Conclusions Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.
dc.identifier.citation	BMC Genomics. 2019 Jun 10;20(1):470
dc.identifier.uri	https://doi.org/10.1186/s12864-019-5820-0
dc.identifier.uri	https://doi.org/10.20381/ruor-23558
dc.identifier.uri	http://hdl.handle.net/10393/39311
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dc.title	Identifying genetic determinants of complex phenotypes from whole genome sequence data
dc.type	Journal Article

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: 12864_2019_Article_5820.pdf
Taille:: 2.19 MB
Format:: Adobe Portable Document Format

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

Publications par les auteurs d'uOttawa publiés par BioMed Central // uOttawa authored publications from BioMed Central