Individuals

Row

Individuals

452

Mil. SNPs

6.5

Estimated diversity

0.1%

Mean depth

16.4

About

This is an individual quality control report generated snpArcher for the dataset final. In total, 11793862 SNPs were discovered before any filters were applied. The GATK best practices filters were then applied to this dataset and 5244920 were removed, leaving 6548942. The approximate nucleotide diversity in the sample using the Watterson estimator is 0.1%. For the purposes of this report, we apply several sensible filters including removing all indels, non-biallelic SNPs, SNPs with a minor allele frequency < 0.01, SNPs with >75% missing data, and samples with <2x sequencing depth. We then randomly selected SNPs within a set window size to end up with approximately 100k SNPs (in this report, 103416). These are effectively an LD pruned set of SNPs. All analyses in this report are based on this set of 100k SNPs. This should not be considered a final analyses and are solely intended to direct quality control of the dataset.

Row

Genomic PCA

Fig. 1: The first step is to run a principal component analysis of the genotypes to identify broad clustering patterns in the data. In order to aid the visualization of sample groupings throughout the document, a k-means clustering of 3 is applied. This means that throughout the document the three colors in each plot refer to the three clusters identified in the genomic PCA. This is a good first pass to look for outlier samples, which may either be problematic or interesting samples. The following analyses help to distinguish among these possibilities. Note these 3 clusters may or may not have any meaningful biological relevance!.

Depth and PC Correlation

Fig. 2: Sometimes if there are batch effects, the PCA groups will correlate with sequencing depth, which may indicate there is some technical signal in the data. An R2 value (depth ~ PC) is shown for each component and a large value here may suggest there is a technical signal in the data. The percent variance explained by each PC is also shown as the amount of variance explained by one PC out of 10 total PCs.

Row

SNP Depth

Fig. 3: There is typically a relationship between how much missing data there is and total sequencing depth. Use this plot to identify a potential cutoff for how strictly you want to filter your individuals by sequencing depth and/or individuals. For example, one might remove individuals with a sequencing depth < 4 if the rate of missingness per SNP is higher than seems reasonable.

Mapping rate

Fig. 4: The percent of reads mapped is calcuated from the mapping rate to the reference genome. A lower mapping rate may indicate there are contaminants in your reads or the sample is from the wrong species. For example, a mapping rate <80% means either the sample is of the wrong species (so many reads did not map) or 20% sample comes from another species (such as bacterial contamination).

Heterozygosity

Fig. 5: The inbreeding coefficient is an estimate of excess homozygosity or heterozygosity. Values close to +1 indicate extensive homozygosity in the sample and values cluse to -1 indicate excess in heterozygotes. Check for samples that are outliers in the PCA plot that have very negative F values, as these could indicate cross contamination among samples.

Row

Tree

Fig. 6: A very simple neighbor joining tree is built from a simple distance matrix among all samples in the dataset. The leaves are colored by the clusters identified in the PCA.

Row

Relatedness matrix

Fig. 7: We build a king relatedness matrix using Plink to evaluate if there are closely related samples. Samples with a relatedness of 0.5 indicate they are identical, values close to 0.25 indicate parent-child or sibling relatedness, and second degree relatedness are ~0.125. You may want to remove samples with > 0.354 if you want to remove very closely related sample which can bias population genomic estimates.

Row

Map

[1] "a map will appear here if you includes a .coords file "

Fig. 8: Here, an interactive map is produced if there is a coordinate file available with latitude and longitude in decimal degrees. See the project README for how to setup this file for analysis.

Row

Terrain Map

[1] "a terrain map will appear here if you provide a google API key in the config file"

Fig. 9: Here, a terrain map is produced using the google maps API

Row

Admixture

Fig. 10: Admixture was run on the dataset for k = 2 and k = 3. These are arbitrarily selected and no cross validation is done. Because these groupings are made seperate from the clusters in the PCA, they are colored by the admixture assignments and not the PCA groupings.

Row

Generated by snpArcher

Individuals

Row

Individuals

Mil. SNPs

Estimated diversity

Mean depth

Column

About

Row

Genomic PCA

Depth and PC Correlation

Row

SNP Depth

Mapping rate

Heterozygosity

Row

Tree

Row

Relatedness matrix

Row

Map

Row

Terrain Map

Row

Admixture

Row