Feature selection and aggregation for antibiotic resistance GWAS in Mycobacterium tuberculosis: a comparative study

K.O. Reshetnikov, D.I. Bykova, K.V. Kuleshov,K. Chukreev, E.P. Guguchkin,V.G. Akimkin,A.D. Neverov,G.G. Fedonin

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 1|浏览2
暂无评分
摘要
Drug resistance (DR) remains a global healthcare concern. In contrast to other human bacterial pathogens, acquiring mutations in the genome is the main mechanism of drug resistance for Mycobacterium tuberculosis (MTB). For some antibiotics resistance of a particular isolate can be predicted with high confidence knowing whether specific mutations occurred, but for some antibiotics our knowledge of resistance mechanism is moderate. Statistical machine learning (ML) methods are used in attempts to infer new genes implicated in drug resistance. These methods use large collections of isolates with known whole-genome sequences and resistance status for different drugs. However, high correlations between the presence or absence of resistance to drugs that are used together in one treatment regimen complicate inference of causal mutations by traditional ML. Recently, several new methods were suggested to deal with the problem of correlations of response variables in training data. In this study, we applied the following methods to tackle the confounding effect of resistance co-occurrence in a dataset of approximately 13 000 complete genomes of MTB with characterized resistance status for 13 drugs: logistic regression with different regularization penalty functions, a polynomial-time algorithm for best-subset selection problem (ABESS), and “Hungry, Hungry SNPos” (HHS) method. We compared these methods by the ability to select known causal mutations for the resistance to each particular drug and not to select mutations in genes that are known to be associated with resistance to other drugs. ABESS significantly outperformed the others selecting more relevant sets of mutations. We also showed that aggregation of rare mutations into features indicating changes of PFAM domains increased the quality of prediction and these features were majorly selected by ABESS. Impact statement Due to the high significance of the problem, many studies in the recent decade aimed to predict drug susceptibility/resistance of MTB from its genotype. Most of such methods were based on prior biological knowledge, e.g. consideration of mutations occurring in known genes involved in the metabolism of drugs. In our study, we estimated to what extent ML methods could extract de novo biologically relevant associations of mutations with resistance/susceptibility to drugs from large datasets of clinical MTB isolates. As a criterion of accuracy we used the known experimentally verified associations of mutations in MTB genes to corresponding drugs. The most accurate approach from the benchmarked ones addressed the most of these known genes to proper drugs. The result of feature selection was robust despite the presence of population structure with strong phylogenetic and geographic signals in the dataset. Also, we designed an original approach for aggregation of rare mutations and demonstrated that it improved classification accuracies of ML models. To our knowledge, this study is the first comparison of modern feature selection methods applied to genome-wide association studies (GWAS) of MTB drug resistance. Data Summary The dataset unifies characterized whole-genome sequences of M. tuberculosis from multiple studies [[1][1]–[10][2]]. Short Illumina reads are available in public repositories (SRA or ENA). Sample ids, phenotypes and links to the source papers are summarized and listed in Table S1. The dataset and the source code can be downloaded from the GitHub repository: ### Competing Interest Statement The authors have declared no competing interest. * MTB : Mycobacterium tuberculosis DR : drug resistance MDR : multidrug resistant XDR : extensively drug resistant WHO : World Health Organization NGS : Next Generation Sequencing PCR : polymerase chain reaction GWAS : genome-wide association study DA : direct association ML : machine learning PCA : principal component analysis HMM : hidden Markov model MCP : minimax concave penalty SCAD : smoothly clipped absolute deviation HHS : ‘hungry-hungry SNPs’ ABESS : polynomial-time algorithm for best-subset selection problem [1]: #ref-1 [2]: #ref-10
更多
查看译文
关键词
antibiotic resistance gwas,in<i>mycobacterium tuberculosis</i>,antibiotic resistance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要