Most privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.

Differentially Private Distance Learning in Categorical Data

Battaglia, Elena
First
;
Pensa, Ruggero G.
Last
2021-01-01

Abstract

Most privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.
2021
35
5
2050
2088
https://link.springer.com/article/10.1007/s10618-021-00778-0
Differential privacy, Metric learning, Categorical attributes, Distance-based methods
Battaglia, Elena; Celano, Simone; Pensa, Ruggero G.
File in questo prodotto:
File Dimensione Formato  
dami2021_online.pdf

Accesso aperto

Descrizione: PDF online (open access)
Tipo di file: PDF EDITORIALE
Dimensione 1.52 MB
Formato Adobe PDF
1.52 MB Adobe PDF Visualizza/Apri
dami2021_printed.pdf

Accesso aperto

Descrizione: PDF a stampa
Tipo di file: PDF EDITORIALE
Dimensione 1.5 MB
Formato Adobe PDF
1.5 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1793715
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact