Biogeographical ancestry (BGA) of a trace or person/skeleton refers to the component of ethnicity, which is composed of biological and cultural elements and is biologically determined. Nowadays, many people are interested in researching their genealogy, and the ability to distinguish biogeographic information about populations and subgroups using DNA analysis plays an essential role in various fields, such as forensics. For example, it is advantageous for investigative and intelligence purposes to infer the biogeographic origin of perpetrators or victims of unsolved cases when reference profiles of perpetrators or database matches are not available for comparison purposes. Current approaches to biogeographic ancestry estimation using SNPs data are generally based on PCA and STRUCTURE software. The present study provides an alternative method that incorporates multivariate data analysis and Machine Learning strategies to assess the BGA discriminatory power of unknown samples using various commercial panels. Using datasets from the 1000 Genomes Project, Simons Genome Diversity Project, and Human Genome Diversity Project, which include African, American, Asian, European, and Oceanic individuals, powerful multivariate techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) and XGBoost were used and their discriminatory power was compared.
Machine Learning overview for biogeographical ancestry prediction - a PLS-DA approach
Alladio E.First
;
2022-01-01
Abstract
Biogeographical ancestry (BGA) of a trace or person/skeleton refers to the component of ethnicity, which is composed of biological and cultural elements and is biologically determined. Nowadays, many people are interested in researching their genealogy, and the ability to distinguish biogeographic information about populations and subgroups using DNA analysis plays an essential role in various fields, such as forensics. For example, it is advantageous for investigative and intelligence purposes to infer the biogeographic origin of perpetrators or victims of unsolved cases when reference profiles of perpetrators or database matches are not available for comparison purposes. Current approaches to biogeographic ancestry estimation using SNPs data are generally based on PCA and STRUCTURE software. The present study provides an alternative method that incorporates multivariate data analysis and Machine Learning strategies to assess the BGA discriminatory power of unknown samples using various commercial panels. Using datasets from the 1000 Genomes Project, Simons Genome Diversity Project, and Human Genome Diversity Project, which include African, American, Asian, European, and Oceanic individuals, powerful multivariate techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) and XGBoost were used and their discriminatory power was compared.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S1875176822001032-main.pdf
Accesso riservato
Tipo di file:
POSTPRINT (VERSIONE FINALE DELL’AUTORE)
Dimensione
629.71 kB
Formato
Adobe PDF
|
629.71 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.