Improving Semantic Parsing and Text Generation Through Multi-Faceted Data Augmentation

Amin, Muhammad Saad; Anselma, Luca; Mazzei, Alessandro

doi:10.1109/access.2025.3593857

The increasing use of large language models has heightened the demand for more extensive datasets in natural language processing (NLP). While various augmentation techniques are being employed to enhance data quantity, many introduce noise or struggle with structurally complex inputs like Discourse Representation Structures (DRS). This study introduces novel data augmentation techniques for both semantic parsing (Text-to-DRS) and text generation (DRS-to-Text), emphasizing enhancements such as named entity augmentation, lexical substitutions utilizing WordNet, and grammatical transformations through changes in tense. The proposed methods led to a considerable expansion of the Parallel Meaning Bank (PMB) dataset, ensuring semantic accuracy and contextual relevance. The augmentation increased both gold and silver instances by a factor of 9, resulting in over 1.3 million new examples. We evaluated four transformer models (byT5, mT5, T5, and mBART) using this augmented dataset. Experimental evaluations revealed substantial improvements across multiple performance metrics. Notably, for semantic parsing, we observed a 17.65% increase in SMATCH (F1) score, and among different evaluation measures for text generation, we have improvements of 14.38% in BLEU score and 6.43% in METEOR score. The observed improvements highlight the effectiveness of our proposed augmentation methodologies in boosting model capabilities for complex neural semantic parsing and generation tasks.