Network security relies on effective measurements and analysis for identifying malicious traffic. Recent proposals aim at automatically learning compact and informative representations (i.e. embeddings) of network traffic that capture salient features. These representations can serve multiple downstream tasks, streamlining the machine learning pipeline. Researchers have proposed techniques borrowed from Natural Language Processing (NLP) and Graph Neural Networks (GNN) to learn such embeddings, with both lines delivering promising results.This paper investigates the benefits of combining complementary sources of information represented by embeddings learnt via different techniques and from different data. We rely on classifiers based on traditional features engineering and on automatic embedding generation (borrowing from NLP and GNN) to classify hosts observed from darknets and honeypots. We then stack these base classifiers trained on each embedding through meta-learning to combine the complementary information sources to improve performance.Our results show that meta-learning outperforms each single classifier. Importantly, the proposed meta-learner provides explainability on the importance of the embedding types and the impact of each data source on the outcome. All in all, this work is a step forward in the search for more effective, general, understandable, and practical representations that could carry multiple traffic characteristics.
Explainable Stacking Models based on Complementary Traffic Embeddings
Drago, Idilio;
2024-01-01
Abstract
Network security relies on effective measurements and analysis for identifying malicious traffic. Recent proposals aim at automatically learning compact and informative representations (i.e. embeddings) of network traffic that capture salient features. These representations can serve multiple downstream tasks, streamlining the machine learning pipeline. Researchers have proposed techniques borrowed from Natural Language Processing (NLP) and Graph Neural Networks (GNN) to learn such embeddings, with both lines delivering promising results.This paper investigates the benefits of combining complementary sources of information represented by embeddings learnt via different techniques and from different data. We rely on classifiers based on traditional features engineering and on automatic embedding generation (borrowing from NLP and GNN) to classify hosts observed from darknets and honeypots. We then stack these base classifiers trained on each embedding through meta-learning to combine the complementary information sources to improve performance.Our results show that meta-learning outperforms each single classifier. Importantly, the proposed meta-learner provides explainability on the importance of the embedding types and the impact of each data source on the outcome. All in all, this work is a step forward in the search for more effective, general, understandable, and practical representations that could carry multiple traffic characteristics.File | Dimensione | Formato | |
---|---|---|---|
2024_WTMC_Stacking.pdf
Accesso aperto
Tipo di file:
PREPRINT (PRIMA BOZZA)
Dimensione
581.23 kB
Formato
Adobe PDF
|
581.23 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.