Artificial Intelligence, and Machine Learning systems in general, are becoming pervasive in our society, from the industry to the public administration. AI can often provide a very efficient means to support decision-making, but it can represent a danger for high-risk applications such as bio-medicine and healthcare. In particular, biased datasets might lead to inaccurate or discriminatory ML systems, undermining the accuracy of their predictions and putting patients’ health at risk. FanFAIR is a python tool that provides the community with a semiautomatic tool for datasets’ fairness assessment. FanFAIR is designed to integrate qualitative considerations – such as ethics, human rights assessment, and data protection – with quantitative indicators of dataset’s fairness, such as balance, the presence of invalid entries, or outliers. In this work, we extend FanFAIR to deal with categorical data, and introduce a new algorithm for outlier detection in the presence of missing values. We then provide a case study on the data collected from COVID patients admitted to pneumology departments in Italy. We show how the successive steps of data cleaning and variable selection improve the indicators provided by FanFAIR. This shows that data cleaning procedures are not only necessary to improve the performance of the machine learning algorithm using the data for learning, but are also a way to improve (a measure of) fairness. Hence, the proposed case study provides an example in which performance and fairness are not in contrast, like it is commonly believed to be, but they improve together.

Investigating Fairness with FanFAIR: is Pre-Processing Useful Only for Performances?

Gallese, Chiara
Last
2025-01-01

Abstract

Artificial Intelligence, and Machine Learning systems in general, are becoming pervasive in our society, from the industry to the public administration. AI can often provide a very efficient means to support decision-making, but it can represent a danger for high-risk applications such as bio-medicine and healthcare. In particular, biased datasets might lead to inaccurate or discriminatory ML systems, undermining the accuracy of their predictions and putting patients’ health at risk. FanFAIR is a python tool that provides the community with a semiautomatic tool for datasets’ fairness assessment. FanFAIR is designed to integrate qualitative considerations – such as ethics, human rights assessment, and data protection – with quantitative indicators of dataset’s fairness, such as balance, the presence of invalid entries, or outliers. In this work, we extend FanFAIR to deal with categorical data, and introduce a new algorithm for outlier detection in the presence of missing values. We then provide a case study on the data collected from COVID patients admitted to pneumology departments in Italy. We show how the successive steps of data cleaning and variable selection improve the indicators provided by FanFAIR. This shows that data cleaning procedures are not only necessary to improve the performance of the machine learning algorithm using the data for learning, but are also a way to improve (a measure of) fairness. Hence, the proposed case study provides an example in which performance and fairness are not in contrast, like it is commonly believed to be, but they improve together.
2025
2025 IEEE Symposium on Computational Intelligence in Health and Medicine, CIHM 2025
nor
2025
1
7
https://aiandlaw.eu/wp-content/uploads/2025/05/IEEE_SSCI_2025___FanFAIR_2.pdf
data cleaning; dataset assessment; debiasing; fairness; preprocessing; sensitive attributes
Rispoli, Michele; Nobile, Marco S.; Manzoni, Luca; D'Onofrio, Alberto; Confalonieri, Marco; Salton, Francesco; Confalonieri, Paola; Ruaro, Barbara; Ga...espandi
File in questo prodotto:
File Dimensione Formato  
IEEE_SSCI_2025___FanFAIR_2.pdf

Accesso aperto

Tipo di file: PREPRINT (PRIMA BOZZA)
Dimensione 411.02 kB
Formato Adobe PDF
411.02 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2080690
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact