Recently, knowledge graph-based approaches have gained wider adoption across domains thanks to their ability to enhance explainability and reduce hallucination in domain specific tasks. Although graph-based architectures have shown promising results, however, evaluation remains an open issue due to the complexity of the analysis and the inherent subjectivity and variability involved when it comes to practical use scenarios and stakeholders' needs. In this context, we present GRADES (Graph-based Reliability Assessment of Domain-specific Evaluation Systems) an evaluation framework for graph-based question answering. To investigate the reliability of the current state-of-the-art evaluation strategies we insert both automatic and qualitative human-based evaluation at each step (information extraction, entity linking and verbalization) of a reference graph-based QA pipeline. At the final step domain experts are engaged to asses both correctness and soundness of the verbalized output. We apply the pipeline and evaluation framework to a case study in the literary domain, showing that the punctual evaluation of each step is able to highlight the limits of off-the-shelf tools in a practical use case.

Beyond the Metrics: an Investigation into the Reliability of Evaluation Metrics for Domain Specific Graph-based Question Answering

Draetta L.
;
Stranisci M. A.;Corallo F.;Balestrucci P. F.;Oliverio M.;Damiano R.;Mazzei A.
2025-01-01

Abstract

Recently, knowledge graph-based approaches have gained wider adoption across domains thanks to their ability to enhance explainability and reduce hallucination in domain specific tasks. Although graph-based architectures have shown promising results, however, evaluation remains an open issue due to the complexity of the analysis and the inherent subjectivity and variability involved when it comes to practical use scenarios and stakeholders' needs. In this context, we present GRADES (Graph-based Reliability Assessment of Domain-specific Evaluation Systems) an evaluation framework for graph-based question answering. To investigate the reliability of the current state-of-the-art evaluation strategies we insert both automatic and qualitative human-based evaluation at each step (information extraction, entity linking and verbalization) of a reference graph-based QA pipeline. At the final step domain experts are engaged to asses both correctness and soundness of the verbalized output. We apply the pipeline and evaluation framework to a case study in the literary domain, showing that the punctual evaluation of each step is able to highlight the limits of off-the-shelf tools in a practical use case.
2025
2nd International Workshop on Retrieval-Augmented Generation Enabled by Knowledge Graphs, RAGE-KG 2025
Nara Prefectural Convention Center, jpn
2025
CEUR Workshop Proceedings
CEUR-WS
4079
83
96
Human-in-the-Loop; Knowledge Graph; Question Answering
Draetta L.; Stranisci M.A.; Corallo F.; Balestrucci P.F.; Oliverio M.; Damiano R.; Mazzei A.
File in questo prodotto:
File Dimensione Formato  
paper7 (2).pdf

Accesso aperto

Dimensione 364.2 kB
Formato Adobe PDF
364.2 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2122711
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact