The Science web site has published a fascinating study of modern and ancient human genome sequences that has revealed previously unknown features of our evolutionary past. Here is the introduction:
You can read much, much more by starting at: https://www.science.org/doi/10.1126/science.abi8264.
The characterization of modern and ancient human genome sequences has revealed previously unknown features of our evolutionary past. As genome data generation continues to accelerate—through the sequencing of population-scale biobanks and ancient samples from around the world—so does the potential to generate an increasingly detailed understanding of how populations have evolved.
However, such genomic datasets are highly heterogeneous. Samples from diverse times, geographic locations, and populations are processed, sequenced, and analyzed using a variety of techniques. The resulting datasets contain genuine variation but also complex patterns of missingness and error. This makes combining data challenging and hinders efforts to generate the most complete picture of human genomic variation.
To address these challenges, we use the foundational notion that the ancestral relationships of all humans who have ever lived can be described by a single genealogy or tree sequence, so named because it encodes the sequence of trees that link individuals to one another at every point in the genome. This tree sequence of humanity is immensely complex, but estimates of the structure are a powerful means of integrating diverse datasets and gaining greater insights into human genetic diversity. In this work, we introduce statistical and computational methods to infer such a unified genealogy of modern and ancient samples, validate the methods through a mixture of computer simulation and analysis of empirical data, and apply the methods to reveal features of human diversity and evolution.
We present a unified tree sequence of 3601 modern and eight high-coverage ancient human genome sequences compiled from eight datasets. This structure is a lossless and compact representation of 27 million ancestral haplotype fragments and 231 million ancestral lineages linking genomes from these datasets back in time. The tree sequence also benefits from the use of an additional 3589 ancient samples compiled from more than 100 publications to constrain and date relationships.
Using simulations and empirical analyses, we demonstrate the ability to recover relationships between individuals and populations as well as to identify descendants of ancient samples. We calculate the distribution of the time to most recent common ancestry between the 215 populations of the constituent datasets, revealing patterns consistent with substantial variation in historical population size and evidence of archaic admixture in modern humans.
The tree sequence also offers insight into patterns of recurrent mutation and sequencing error in commonly used genetic datasets. We find pervasive signals of sequencing error as well as a small subset of variant sites that appear to be erroneous.
Finally, we introduce an estimator of ancestor geographic location that recapitulates key features of human history. We observe signals of very deep ancestral lineages in Africa, the out-of-Africa event, and archaic introgression in Oceania. The method motivates improved spatiotemporal inference methods that will better elucidate the paths and timings of historic migrations.
The profusion of genetic sequencing data creates challenges for integrating diverse data sources. Our results demonstrate that whole-genome genealogies provide a powerful platform for synthesizing genetic data and investigating human history and evolution.