TabuLa: Harnessing Language Models for Tabular Data Synthesis

Zhao, Zilong; Birke, Robert; Chen, Lydia Y.

doi:10.1007/978-981-96-8186-0_20

Tabular data synthesis is essential for privacy and security in data-driven industries. While recent advancements adopt large language models (LLMs) for realistic tabular data generation, their long training times and limited reusability hinder practical applications. In this paper, we propose Tabula, a tabular data synthesizer that leverages the structure of LLM. Unlike state-of-the-art (SOTA) LLM-based tabular data synthesizers that rely on pre-trained LLMs, Tabula discards the pre-trained weights originally designed for natural language tasks, focusing instead on a tailored approach for tabular data. In addition, Tabula introduces a token sequence compression strategy that significantly reduces training time while maintaining data quality, alongside a novel token padding method that improves sequence alignment across training batches. Experiments on six datasets show that Tabula achieves superior synthetic data utility compared to current SOTA methods. Additionally, the results demonstrate that Tabula model trained on tabular datasets serves effectively as a foundational model for synthesizing new tabular datasets. Furthermore, the proposed padding method outperforms the conventional left and right padding strategies. Finally, the results highlight that Tabula averagely reduces training time per epoch by 46.2% compared to state-of-the-art LLM approaches while achieving higher data utility. Our code is available at https://github.com/zhao-zilong/Tabula.

TabuLa: Harnessing Language Models for Tabular Data Synthesis

Zhao, Zilong;Birke, Robert;Chen, Lydia Y.

2025-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Titolo dell'evento
	
				29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2025
			
	Luogo dell'evento
	
				Sydney, Australia
			
	Data dell'evento
	
				2025
			
	Titolo del volume
	
				Lecture Notes in Computer Science
			
	Nome editore
	
				Springer Science and Business Media Deutschland GmbH
			
	N. Volume
	
				15874 LNCS
			
	Pagine (da)
	
				247
			
	Pagine (a)
	
				259
			
	Codice ISBN
	
				9789819681853
9789819681860
			
	DOI
	
				https://dx.doi.org/10.1007/978-981-96-8186-0_20
			
	Parole Chiave
	
				Generative Model; LLM; Tabular Data
			
	Tutti gli autori
	
						Zhao, Zilong; Birke, Robert; Chen, Lydia Y.
					
	Appare nelle tipologie:
	
				04A-Conference paper in volume

File in questo prodotto:

File	Dimensione	Formato
Tabula.pdf Accesso riservato Tipo di file: PDF EDITORIALE Dimensione 1.26 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.26 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2104558

CINECA IRIS Institutional Research Information System

TabuLa: Harnessing Language Models for Tabular Data Synthesis

Zhao, Zilong;Birke, Robert;Chen, Lydia Y.

2025-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CINECA IRIS Institutional Research Information System

TabuLa: Harnessing Language Models for Tabular Data Synthesis

Zhao, Zilong;Birke, Robert;Chen, Lydia Y.

2025-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)