数据缺失对基于超保守序列（UCEs）的系统树及分子钟分析的影响

中文摘要

分子系统发育关系重建中，数据缺失很难回避，往往部分样品的部分基因未获得序列。是否保留、如何排除缺失数据的基因和样品便成为问题。对此的处理取决于数据缺失对系统树重建及其下游分析的影响，相应研究有利于这些分析的实践。数据缺失模式及信息量方面，传统双脱氧测序数据不同于高通量测序得到的超保守序列UCEs，缺失的影响亦很可能不同。对于前者，相关研究支持在很大程度上保留缺失数据的基因和样品。UCEs近年成为系统学分析的重要标记，其高比率、类群特异的数据缺失的影响尚待系统研究。本项目以两栖动物为例，探讨其对系统树重建及其重要下游分子钟分析的影响，比较保留和去除缺失数据两种策略。基于真实和模拟数据，生成不同数据缺失程度的数据集，模拟去除缺失数据的处理，以完整或较完整数据集为参照，对比对不同数据集的分析结果。重点研究测序深度相关和拼合不同数据集相关的数据缺失，以期为合理处理数据缺失、客观评估分析结果提供参。

英文摘要

In the practice of molecular phylogenetic analysis, missing data are almost always observed in assembled datasets. Some taxa miss data for some genes and some genes miss data for some taxa. This raises the issue of how to deal with the genes and taxa with missing data. For example, one can choose to include only those genes with complete sampling among taxa, or use all genes regardless of missing data. Apparently, knowledge on the effects of data missing to phylogenetic reconstruction and its downstream analysis such as molecular dating will be beneficial. Currently, molecular phylogenetic analysis uses sequence data obtained by different sequencing methods, the traditional Sanger and next generation (ultraconserved elements, UCEs) ones. Compared with the Sanger data, UCEs data have much more information and different pattern of data missing. It includes higher percentages of missing data, which is lineage dependent. Consequently, the effects of data missing may be different for analyses based on Sanger or UCEs datasets. For the Sanger data, in general, results from many studies support keeping genes and taxa with missing data to a large extent. UCEs data are increasingly used for phylogenetic analysis in recent years. But studies on effects of data missing are lacking for UCEs. In this project, we use amphibians as an example to explore the effects of data missing on phylogenetic reconstruction and divergence dating based on UCEs, and compare different strategies of dealing with data missing. Both empirical data and data simulation approaches will be used. Based on empirical and simulated data, new datasets with different degrees of data missing will be generated and then treated by various ways of reducing data missing. Using complete or relatively complete datasets as references, results based on datasets with missing data can be compared and evaluated. Two types of sources of UCEs data missing will be considered in this project, one related to sequencing coverage and the other related to the supermatrix approach. This project will facilitate treating data missing in UCEs datasets and comprehending results based on these datasets.