Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

MISALE, CLAUDIA;
2017-01-01

Abstract

Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.
2017
28
6
1663
1674
Big data;Context;Data transfer;Explosions;Memory management;Scalability;Sparks;Big data analytics;data shuffling;elastic buffering;memory-efficient I/O
Nicolae, Bogdan; Costa, Carlos H. A.; Misale, Claudia; Katrinis, Kostas; Park, Yoonho
File in questo prodotto:
File Dimensione Formato  
tpds_4aperto.pdf

Accesso aperto

Tipo di file: PREPRINT (PRIMA BOZZA)
Dimensione 2.84 MB
Formato Adobe PDF
2.84 MB Adobe PDF Visualizza/Apri
07740885.pdf

Accesso riservato

Descrizione: Articolo principale
Tipo di file: PDF EDITORIALE
Dimensione 2.71 MB
Formato Adobe PDF
2.71 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1624908
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? 17
social impact