Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. While learning image GANs on Federated Learning (FL) and Multi-Discriminator (MD) systems has just been demonstrated, it is unknown if tabular GANs can be learned from decentralized data sources. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding - risking privacy guarantees. In this paper, we propose GDTS, a distributed framework for GAN-based tabular synthesizer. GDTS provides different system architectures to match the two training paradigms termed GDTS_FL and GDTS_MD. Key to enable learning on distributed data is the proposed novel privacy-preserving multi-source feature encoding to capture the global data properties. In addition GDTS encompasses a weighting strategy based on table similarity to counter the detrimental effects of non-IID data and a validation pipeline to easily assess and compare the performance of different paradigms and hyper parameters. We evaluate the effectiveness of GDTS in terms of synthetic data quality, and overall training scalability. Experiments show that GDTS_FL achieves better statistical similarity and machine learning utility between generated and original data compared to GDTS_MD.

GDTS: GAN-based Distributed Tabular Synthesizer

Birke, R;
2023-01-01

Abstract

Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. While learning image GANs on Federated Learning (FL) and Multi-Discriminator (MD) systems has just been demonstrated, it is unknown if tabular GANs can be learned from decentralized data sources. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding - risking privacy guarantees. In this paper, we propose GDTS, a distributed framework for GAN-based tabular synthesizer. GDTS provides different system architectures to match the two training paradigms termed GDTS_FL and GDTS_MD. Key to enable learning on distributed data is the proposed novel privacy-preserving multi-source feature encoding to capture the global data properties. In addition GDTS encompasses a weighting strategy based on table similarity to counter the detrimental effects of non-IID data and a validation pipeline to easily assess and compare the performance of different paradigms and hyper parameters. We evaluate the effectiveness of GDTS in terms of synthetic data quality, and overall training scalability. Experiments show that GDTS_FL achieves better statistical similarity and machine learning utility between generated and original data compared to GDTS_MD.
2023
IEEE 16th International Conference on Cloud Computing (CLOUD)
Chicago, IL, USA
02-08 July 2023
Proceedings of the 2023 IEEE 16th International Conference on Cloud Computing (CLOUD)
IEEE COMPUTER SOC
570
576
979-8-3503-0481-7
Tabular GAN; federated learning; tabular data; Non-IID
Zhao, ZL; Birke, R; Chen, LY
File in questo prodotto:
File Dimensione Formato  
GDTS_IEEE_CLOUD_preprint.pdf

Accesso aperto

Tipo di file: PREPRINT (PRIMA BOZZA)
Dimensione 2.69 MB
Formato Adobe PDF
2.69 MB Adobe PDF Visualizza/Apri
GDTS_GAN-Based_Distributed_Tabular_Synthesizer.pdf

Accesso riservato

Tipo di file: PDF EDITORIALE
Dimensione 1.12 MB
Formato Adobe PDF
1.12 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1949571
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact