基于熵的公共卫生大数据信息挖掘方法研究

中文摘要

大数据是针对特定目标多种数据的集合，特征为：体量巨大、形式多样、快速更新、价值隐藏。目前大数据分析的应用、方法仍存在缺陷。研究者往往局限于分析单一数据，基于多源数据分析的研究缺乏。常见机器学习算法并非专门为多维变量间交互作用而设计，容易遗漏高价值的交互信息。此外，现有方法效率低，很难短时间内实现大数据更新、分析结果同步的目标。信息熵因其速度优势，在数据挖掘领域具有重要地位。但是，其相关方法仍有不足，表现为：要求变量相互独立、统计分析过于耗时、无法控制混杂因素、等。公共卫生领域的大数据隐藏着高价值的信息，却缺乏高效、合理的方法。本研究的思路是：(1) 站在公共卫生大数据的层面，从多种来源、动态更新的数据中挖掘信息，建立更加准确的肿瘤风险预测模型。(2) 利用信息熵的优点，针对现有方法的不足，探索计算速度快、统计性能好的分析方法和挖掘策略。(3) 编制CPU、GPU并行计算程序，提供实用工具。

英文摘要

Big data is a combination of a series of datasets for a specified research purpose, characterized with volume, variety, velocity and veracity. However, the application or the method of public health big data analysis still needs improvement. Researchers always focused on just single one dataset, instead of a lot of datasets containing of a wealth of public health information. Additionally, common machine learning methods are not designed for detecting multiple-variable interaction, resulting in missing of high valuable interaction information respect to disease. Meanwhile, the calculation speed is not fast for these common methods. Thus, we cannot update statistical analysis results for the updated big data in a short time. The information entropy method is quite important in data mining for its fast speed advantage. However, the information entropy related methods still cannot satisfied big data analysis, for the requirement of independence among variables, time consuming in statistical analysis and the disability of controlling confounders. Up to now, there is no efficient and effective method for big data analysis in public health datasets. Based on these considerations, we aim to (1) build up more accurate cancer risk prediction model based on public health big data, including demography information, personal environmental exposure, regional environmental exposure, and so on, (2) propose a new information entropy method with fast calculation speed and robust statistical performance for data mining, (3) release a CPU or GPU parallel computing software for real data analysis for convenience .

结题摘要

公共卫生大数据复杂关联信号分析缺乏特异性的快速方法。我们基于信息熵提出iterative entropy epistasis (IEE)方法。IEE法可用于评价一阶、高阶交互作用。IEE法不仅能够适应变量间不独立结构，而且不受变量主效应影响，是交互作用特异性的检测方法。模拟试验显示：IEE法一类错误可控，优于现有基于熵的方法。检测一阶交互作用时，IEE法检验效能与对数线性模型相当；检测高阶交互作用时，IEE法检验效能明显高于对数线性模型。更为重要的是，IEE法计算速度快于对数线性模型，且样本量越大优势越明显。此外，检测一阶、高阶交互作用时，IEE法分别在25%、50%的原始迭代精度下，可维持一类错误、把握度；并再次分别提高3倍、1倍计算速度。由于KSA统计量总是不小于IEE法统计量，且计算速度更快。我们进一步提出“KSA初筛→IEE再筛→logistic检验( KIL)”的降维分析策略。模拟试验显示：KIL的计算负担仅为原始总量的30%-40%，把握度平均可达Logistic回归检验效能的92%以上。上述理论研究结果对统计方法和降维策略选择具有一定的参考价值。基于理论方法研究结果，我们开发了6款数据质控、交互作用信息挖掘软件，均获得国家版权局软件著作权。此外，我们对不同类型的公共卫生大数据(临床、基因组、表观基因组、转录组、代谢组)进行了信息挖掘，涉及头颈部鳞状细胞癌、非小细胞肺癌、口腔鳞状细胞癌，并鉴别出多个肿瘤相关的生物标记物，提高了模型的风险预测精度。上述应用研究结果，对个体化医疗和风险评估具有一定的实用价值。