Biogeographical ancestry (BGA) of a trace or person/skeleton refers to the component of ethnicity, which is composed of biological and cultural elements and is biologically determined. Nowadays, many people are interested in researching their genealogy, and the ability to distinguish biogeographic information about populations and subgroups using DNA analysis plays an essential role in various fields, such as forensics. For example, it is advantageous for investigative and intelligence purposes to infer the biogeographic origin of perpetrators or victims of unsolved cases when reference profiles of perpetrators or database matches are not available for comparison purposes. Current approaches to biogeographic ancestry estimation using SNPs data are generally based on PCA and STRUCTURE software. The present study provides an alternative method that incorporates multivariate data analysis and Machine Learning strategies to assess the BGA discriminatory power of unknown samples using various commercial panels. Using datasets from the 1000 Genomes Project, Simons Genome Diversity Project, and Human Genome Diversity Project, which include African, American, Asian, European, and Oceanic individuals, powerful multivariate techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) and XGBoost were used and their discriminatory power was compared.

Machine Learning overview for biogeographical ancestry prediction - a PLS-DA approach

Alladio E.
First
;
2022-01-01

Abstract

Biogeographical ancestry (BGA) of a trace or person/skeleton refers to the component of ethnicity, which is composed of biological and cultural elements and is biologically determined. Nowadays, many people are interested in researching their genealogy, and the ability to distinguish biogeographic information about populations and subgroups using DNA analysis plays an essential role in various fields, such as forensics. For example, it is advantageous for investigative and intelligence purposes to infer the biogeographic origin of perpetrators or victims of unsolved cases when reference profiles of perpetrators or database matches are not available for comparison purposes. Current approaches to biogeographic ancestry estimation using SNPs data are generally based on PCA and STRUCTURE software. The present study provides an alternative method that incorporates multivariate data analysis and Machine Learning strategies to assess the BGA discriminatory power of unknown samples using various commercial panels. Using datasets from the 1000 Genomes Project, Simons Genome Diversity Project, and Human Genome Diversity Project, which include African, American, Asian, European, and Oceanic individuals, powerful multivariate techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) and XGBoost were used and their discriminatory power was compared.
2022
1
2
https://www.fsigeneticssup.com/article/S1875-1768(22)00103-2/fulltext
BGA, Machine learning, SNPs
Alladio, E., Poggiali, B., Cosenza, G., Cisana, S., Omedei, M., Garofano, P., Pilli, E.
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1875176822001032-main.pdf

Accesso riservato

Tipo di file: POSTPRINT (VERSIONE FINALE DELL’AUTORE)
Dimensione 629.71 kB
Formato Adobe PDF
629.71 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1880124
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact