Network security relies on effective measurements and analysis for identifying malicious traffic. Recent proposals aim at automatically learning compact and informative representations (i.e. embeddings) of network traffic that capture salient features. These representations can serve multiple downstream tasks, streamlining the machine learning pipeline. Researchers have proposed techniques borrowed from Natural Language Processing (NLP) and Graph Neural Networks (GNN) to learn such embeddings, with both lines delivering promising results.This paper investigates the benefits of combining complementary sources of information represented by embeddings learnt via different techniques and from different data. We rely on classifiers based on traditional features engineering and on automatic embedding generation (borrowing from NLP and GNN) to classify hosts observed from darknets and honeypots. We then stack these base classifiers trained on each embedding through meta-learning to combine the complementary information sources to improve performance.Our results show that meta-learning outperforms each single classifier. Importantly, the proposed meta-learner provides explainability on the importance of the embedding types and the impact of each data source on the outcome. All in all, this work is a step forward in the search for more effective, general, understandable, and practical representations that could carry multiple traffic characteristics.

Explainable Stacking Models based on Complementary Traffic Embeddings

Drago, Idilio;
2024-01-01

Abstract

Network security relies on effective measurements and analysis for identifying malicious traffic. Recent proposals aim at automatically learning compact and informative representations (i.e. embeddings) of network traffic that capture salient features. These representations can serve multiple downstream tasks, streamlining the machine learning pipeline. Researchers have proposed techniques borrowed from Natural Language Processing (NLP) and Graph Neural Networks (GNN) to learn such embeddings, with both lines delivering promising results.This paper investigates the benefits of combining complementary sources of information represented by embeddings learnt via different techniques and from different data. We rely on classifiers based on traditional features engineering and on automatic embedding generation (borrowing from NLP and GNN) to classify hosts observed from darknets and honeypots. We then stack these base classifiers trained on each embedding through meta-learning to combine the complementary information sources to improve performance.Our results show that meta-learning outperforms each single classifier. Importantly, the proposed meta-learner provides explainability on the importance of the embedding types and the impact of each data source on the outcome. All in all, this work is a step forward in the search for more effective, general, understandable, and practical representations that could carry multiple traffic characteristics.
2024
IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)
Vienna, Austria
08-12 July 2024
Proceedings of the 2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)
IEEE COMPUTER SOC
261
272
Representation learning; traffic classification; meta-learning; model stacking
Gioacchini, Luca; Santos, Welton; Lopes, Barbara; Drago, Idilio; Mellia, Marco; Almeida, Jussara M.; Gonçalves, Marcos André
File in questo prodotto:
File Dimensione Formato  
2024_WTMC_Stacking.pdf

Accesso aperto

Tipo di file: PREPRINT (PRIMA BOZZA)
Dimensione 581.23 kB
Formato Adobe PDF
581.23 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/2031935
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact