中文摘要
生物医学研究经常借助有效的变量选择方法发掘数据中关键信息。研究表明,在经典变量选择统计模型中,基于信息准则的最优变量子集选择法筛选变量的阈值标准具有主观性,变量选择结果易受随机偏差的影响,而以LASSO为代表的模型系数收缩和估计法存在过度选择的不足。本研究拟基于传统LASSO变量选择方法构建两种改进模型Bootstrap ranking LASSO和Two-stage hybrid LASSO,降低变量筛选的假阳性率,克服其对变量过度选择的缺点,并通过蒙特卡洛统计模拟和实证分析对改进的模型与现有方法进行系统地比较和评估。另外,针对LASSO方法构建的回归模型的预测不稳定性,本课题拟运用模型集成方法和多指标优化评估策略建立一种集成的LASSO回归模型,增强模型预测准确性和稳定性。最后,将所建立的方法应用于广东省登革热疫情影响因素的识别和预测模型的构建,以实证分析结果修正模型。
英文摘要
In biomedical research, effective variable selection methods were often used to discover key information in data. Research shows that, in classical models of variable selection, the optimal variable subset selection methods applied information criteria to include or exclude variables using specific threshold values and the selection results are to some degree subjective. The variable selection results are vulnerable to the impacts of stochastic errors. The LASSO (Least Absolute Shrinkage and Selection Operator) model, a representative type of coefficient shrinkage and variable selection model, tends to over-selecting variables and still has limits. This project aims to build two improved models of variable selection, the Bootstrap ranking LASSO and Two-stage hybrid LASSO, using traditional LASSO model and decrease the false positive rates in the process of filtering variables, improving the whole ability of variable screening. By Monte Carlo statistic simulation and empirical analysis, we will systematically compare these two proposed models with the existing variable selection methods. In addition, to improve prediction accuracy and stability of traditional LASSO regression model, we seek to combine the methods of ensemble prediction and multi-index optimization evaluation to construct a novel ensemble LASSO regression model. Finally, the proposed methods will be applied to dengue monitoring data analysis in Guangdong, identify factors related to dengue epidemics, and establish an accurate predictive model of dengue. The empirical analysis results will help to optimize the model.
