Vulnerability management requires identifying, classifying, and pri- oritizing security threats. Recent research has explored using Large Language Models (LLMs) to analyze Common Vulnerabilities and Exposures (CVEs), generating metadata to categorize vulnerabilities (e.g., in CWEs) and to determine severity ratings. This has led some studies to use CVE datasets as benchmarks for LLM-based threat analysis. We reproduce and extend one such benchmark, testing three approaches: (i) TF-IDF embeddings with shallow classifiers, (ii) LLM-generated embeddings with shallow classifiers, and (iii) direct prompting of LLMs for vulnerability metadata extraction. For the latter, we replicate exact prompts from a recent bench- mark and evaluate multiple state-of-the-art LLMs. Results show that classic TF-IDF classifiers still win the benchmark, followed by the generative method. The best model (TF-IDF) achieves 74% accuracy on the classification task. This appears to be caused by the heavily schematic texts of CVEs, where keywords already de- termine key characteristics of the vulnerability. General purpose LLMs with generic prompts fail to capture that. These results call for better evaluation by the community of the application of LLMs to cybersecurity problems.

On LLM Embeddings for Vulnerability Management

Talibzade R.
First
;
Bergadano F.;Drago I.
2025-01-01

Abstract

Vulnerability management requires identifying, classifying, and pri- oritizing security threats. Recent research has explored using Large Language Models (LLMs) to analyze Common Vulnerabilities and Exposures (CVEs), generating metadata to categorize vulnerabilities (e.g., in CWEs) and to determine severity ratings. This has led some studies to use CVE datasets as benchmarks for LLM-based threat analysis. We reproduce and extend one such benchmark, testing three approaches: (i) TF-IDF embeddings with shallow classifiers, (ii) LLM-generated embeddings with shallow classifiers, and (iii) direct prompting of LLMs for vulnerability metadata extraction. For the latter, we replicate exact prompts from a recent bench- mark and evaluate multiple state-of-the-art LLMs. Results show that classic TF-IDF classifiers still win the benchmark, followed by the generative method. The best model (TF-IDF) achieves 74% accuracy on the classification task. This appears to be caused by the heavily schematic texts of CVEs, where keywords already de- termine key characteristics of the vulnerability. General purpose LLMs with generic prompts fail to capture that. These results call for better evaluation by the community of the application of LLMs to cybersecurity problems.
2025
9th Network Traffic Measurement and Analysis Conference (TMA)
Copenhagen
10-13 June 2025
9th Network Traffic Measurement and Analysis Conference (TMA)
IEEE
1
4
979-8-3315-5505-4
Common Vulnerabilities and Exposures (CVE), Common Weak- ness Enumeration (CWE), Large Language Models (LLMs), TF-IDF, Vulnerability Management, Software Security
Talibzade R.; Bergadano F.; Drago I.
File in questo prodotto:
File Dimensione Formato  
2025_CCS_LAMPS.pdf

Accesso riservato

Tipo di file: PREPRINT (PRIMA BOZZA)
Dimensione 517.98 kB
Formato Adobe PDF
517.98 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2094090
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact