单氨基酸多态的疾病相关性预测及分析

中文摘要

疾病往往与基因组变异如单核苷酸多态（SNP）和拷贝数变异（CNV）有关。高通量技术的迅猛发展生成了大量未经注释的变异数据，故挖掘和预测疾病相关的基因组变异是目前生物信息学研究中非常重要的课题。鉴于SNP中的单氨基酸多态（SAP）的特殊重要性，通过计算手段对其进行疾病相关性预测的工作已大量展开。但是，已有算法往往依赖于有限的预测属性且多是对机器学习算法的简单套用。本项目一方面探索了一系列新属性包括蛋白质相互作用网络的介度和众多KEGG通路的富集分数，另一方面也尝试了对数据集进行适当分类后再分别训练机器学习分类器。研究结果表明，这两方面的改进都可以提高预测分类器的准确度：前者采用近邻法通过交叉验证获得了约80%的准确率，后者采用支持向量机相比不分类训练提高了约3.7个百分点。这对开发高准确率预测软件打下了良好基础。此外，从CNV数据中挖掘与疾病相关的子集也是迅速兴起的研究热点。作为本项目的重要拓展，我们采用比较基因组芯片杂交技术从二型糖尿病（T2D）模型GK大鼠中鉴定出了一批CNV，并且采用生物信息学方法优选出了与T2D高度疑似相关的16个蛋白质编码基因和2个小RNA基因，以供实验验证。

英文摘要

Diseases are often associated with genomic variations like single nucleotide polymorphism (SNP) and copy number variation (CNV). Rapid development of high-throughput technology generated a huge amount of unannotated variation data, so mining and predicting the disease-associated variation becomes a very important research challenge in bioinformatics. Due to the particular importance of single amino acid polymorphism (SAP, a type of SNP), the work of predicting their disease-association using computational methods has been performed widely. However, previous methods often relied on a limited number of attributes, and were often simple and mechanical application of machine learning approaches. In this project, we explored and extended a set of novel attributes including the betweenness derived from protein-protein interaction network and the enrichment score of various KEGG pathways on one hand, and attempted appropriate partition of the dataset before training classifiers on the other hand. The results demonstrated that both of the attempts improved the performance of the trained classifiers: the former obtained an accuracy of about 80% through cross-validation by the means of nearest neighbor algorithm, and the latter increased the accuracy by 3.7 percents when compared with training without dataset partition, which provided good basis for developing prediction software with high accuracy. In addition, mining disease-related ones from copy number variants became another hot research focus rapidly. As an extension of this project, we identified a set of CNVs from GK rat, a type 2 diabetes (T2D) model, by utilizing array-based comparative genome hybridization (aCGH). Further bioinformatics approaches prioritized 16 protein coding genes and 2 microRNAs with high susceptibility to T2D, which could be subject to experimental validation.