基于贝叶斯变量选择的生物标志物筛选方法与策略研究

中文摘要

高通量的"组学"（-omics）数据中蕴含着与疾病相关的生物标记物，为深入研究疾病的发生发展机制带来了前所未有的机遇。但是，如何从高维高通量的组学数据中发现生物标记物是生物数据分析的重点和难点。本项目针对目前生物标记物筛选中通用的过滤式（如方差分析、秩和检验等）和封装式（如LASSO，支持向量机等）等方法的局限和不足，提出了一个在贝叶斯变量选择框架下通过集成数据内部结构信息来筛选生物标记物的策略；重点探讨并建立从基因水平结合当前数据所蕴含的特定结构信息（如基因间的关联网络）来筛选生物标记物的平台，以进一步阐明这种"融合"式的筛选策略对生物标记物发现的影响及价值。本项目的成功实施，将对转化医学和分子生物学提供重要的方法论和计算工具，为"个性化医疗"的研发和实施产生明显的促进作用。

英文摘要

Biomarker discovery using all types of high-throughput omics data provides a great opportunity for effective diagnosis, treatment and prevention of complex diseases. Nonetheless, the challenge lies in how to find biomarkers from high dimensional omics data sets, which usually have relatively small sample sizes. This challenge is often called the problem of "Large P, Small N". As our literature review suggested, most popularly used methods for biomarker identification are not satisfactory. First, selection strategy via univariate testing (e.g., ANOVA and rank sum test) would ignore the correlational or regulatory relationships between genes. Even though many p-value adjustment schemes have been proposed, the adjustment of multiple testing is not straightforward or explicitly. Second, integrative selection methods (e.g., LASSO and Support Vector Machine) could conduct biomarker identification at a global scope, but they usually work like a "black box", providing little interpretability. As a solution, we propose a Bayesian variable selection (BVS) strategy for biomarker discovery within which informative prior distributions are used for making meaningful selection results. The main aim of this project is to develop the methodology for constructing and formulating informative priors from current study data. In this proposed project, we will further evaluate the validity, accuracy, and efficiency of BVS for biomarker identification using both simulated and practical breast cancer datasets. Accomplishment of this research will provide an integrative biomarker discovery strategy. Our research is expected to offer users an effective statistical methodology for biomarker discovery in the conduct of research on translational medicine. Our BVS strategy for biomarker discovery could greatly enhance the quality of personalized healthcare delivery.

结题摘要

高通量的“组学”（-omics）数据中蕴含着与疾病相关的生物标记物，为深入研究疾病的发生发展机制带来了前所未有的机遇。但是，如何从高维高通量的组学数据中发现生物标记物是生物数据分析的重点和难点。本项目针对目前生物标记物筛选中通用的过滤式（如方差分析、秩和检验等）和封装式（如LASSO，支持向量机等）等方法的局限和不足，开发了一个在广义线性模型中基于贝叶斯变量选择的生物分子标记物筛选模型。通过模拟实验，我们探讨了该模型在不同影响因素下的筛选效果；利用高斯图模型挖掘模拟数据内部结构特征并与模型融合，得到了较好的筛选效果；我们还提供了一套从公共生物信息学文献中抽取生物学知识的系统方法；以及对基于LASSO变量选择方法进行了比较研究并提出利用LASSO对我们的贝叶斯筛选方法进行改进的设想。