Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse set of strategies to simulate cyberattacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation, comparison and development. This paper introduces AUTOPENBENCH, an open benchmark for evaluating generative agents in automated penetration testing. We address the challenges of existing approaches by presenting a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels and include in-vitro and real-world scenarios. To assess the agent performance we define generic and specific milestones that allow anyone to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of our methodology by benchmarking two modular agent cognitive architectures: a fully autonomous and a semi-autonomous agent supporting human interaction. Our benchmark lets us compare their performance and limitations. For instance, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, attaining 64% of success rate. AUTOPENBENCH allows us also to observe how different LLMs like GPT-4o, Gemini Flash or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap by offering a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AUTOPENBENCH along with the research community by making it available under https://github.com/lucagioacchini/auto-pen-bench.

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Idilio Drago
;
Alexander Delsanto;
2024-01-01

Abstract

Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse set of strategies to simulate cyberattacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation, comparison and development. This paper introduces AUTOPENBENCH, an open benchmark for evaluating generative agents in automated penetration testing. We address the challenges of existing approaches by presenting a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels and include in-vitro and real-world scenarios. To assess the agent performance we define generic and specific milestones that allow anyone to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of our methodology by benchmarking two modular agent cognitive architectures: a fully autonomous and a semi-autonomous agent supporting human interaction. Our benchmark lets us compare their performance and limitations. For instance, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, attaining 64% of success rate. AUTOPENBENCH allows us also to observe how different LLMs like GPT-4o, Gemini Flash or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap by offering a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AUTOPENBENCH along with the research community by making it available under https://github.com/lucagioacchini/auto-pen-bench.
2024
1
22
https://arxiv.org/abs/2410.03225
Generative agents, Large Language Models, Penetration testing, Cybersecurity
Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, Roberto Bifulco
File in questo prodotto:
File Dimensione Formato  
2024_arxiv_autopenbench.pdf

Accesso aperto

Descrizione: Pre-Print
Tipo di file: PREPRINT (PRIMA BOZZA)
Dimensione 1 MB
Formato Adobe PDF
1 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2019630
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact