Scalability Versus Accuracy Trade-offs in Distributed Big Data Processing Frameworks: A Comparative Evaluation of Apache Spark, Flink, and Dask Using Benchmark Datasets

Isiaq O. ALABI; Hassan T. ABDULAZEEZ; Sulaiman AHMAD; Yahaya M. SANI

doi:10.33003/cy583644

Scalability Versus Accuracy Trade-offs in Distributed Big Data Processing Frameworks: A Comparative Evaluation of Apache Spark, Flink, and Dask Using Benchmark Datasets

Authors

Isiaq O. ALABI

Author
Hassan T. ABDULAZEEZ

Author
Sulaiman AHMAD

Author
Yahaya M. SANI

Author

DOI:

https://doi.org/10.33003/cy583644

Keywords:

Big data processing; Distributed computing; Apache Spark; Apache Flink; Dask; Performance benchmarking; Fault tolerance.

Abstract

The exponential growth in data volume, velocity, and variety has intensified demand for distributed processing frameworks that balance computational scalability with analytical accuracy. Apache Spark, Apache Flink, and Dask represent three dominant open-source ecosystems, yet selecting an appropriate framework requires nuanced understanding of their performance characteristics under diverse workloads. This study presents a systematic comparative evaluation of these frameworks using standardized benchmark datasets (Transactions Processing Performance Council-Decision Support (TPC-DS) at 100 GB scale factor and HiBench version 7.1) across four dimensions: execution time, memory consumption, fault tolerance, and result consistency. Experiments were conducted on Amazon Web Services EC2 infrastructure using identical c5.4xlarge instances (16 vCPUs, 32 GB RAM) configured in standalone cluster mode. Results demonstrate that Spark achieved optimal performance for batch-oriented SQL workloads, completing 92 of 99 TPC-DS queries with the lowest average runtime (18% faster than Flink, 32% faster than Dask). Flink exhibited superior latency characteristics and exactly-once processing semantics, recovering from simulated node failures within 12 seconds compared to Spark's 45 seconds. Dask demonstrated competitive performance for iterative machine learning tasks but exhibited higher memory volatility and occasional floating-point inconsistencies during fault recovery. These findings provide empirical guidance for practitioners designing analytics pipelines in domains requiring both timeliness and computational precision, including cybersecurity threat detection and financial analytics.

References

Cover Image

Downloads

FJET_21_35_35

Published

25-04-2026

Issue

Vol. 2 No. 1 (2026): June 2026

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

How to Cite

[1]

Isiaq O. ALABI, Hassan T. ABDULAZEEZ, Sulaiman AHMAD, and Yahaya M. SANI, “Scalability Versus Accuracy Trade-offs in Distributed Big Data Processing Frameworks: A Comparative Evaluation of Apache Spark, Flink, and Dask Using Benchmark Datasets”, FJET, vol. 2, no. 1, pp. 395–401, Apr. 2026, doi: 10.33003/cy583644.

Download Citation

Scalability Versus Accuracy Trade-offs in Distributed Big Data Processing Frameworks: A Comparative Evaluation of Apache Spark, Flink, and Dask Using Benchmark Datasets

How to Cite

Similar Articles

Most read articles by the same author(s)

Similar Articles

Machine Learning Models for Predicting Flow Rate for Niger Delta Oil Wells

Techno-Economic Optimization of a Hybrid PV-Wind-Waste-to Energy System for the University of Maiduguri

Application of Machine Learning for Enhancing Fake Logo Detection

Design and Simulation of a 200 MWp Floating Solar PV Plant Integrated with the Kainji Hydropower Plant, Nigeria

Performance Evaluation of Fuzzy Logic and Fractional Order PI Controllers for Electric Vehicle Wireless Charging System

Mechanical and Biological Performance of Titanium Hydroxyapatite Composites for Biomedical Applications

Quantum Computing Applications in Software Engineering: A Scoping Review

Design, Construction, and Performance Evaluation of an Efficient Ethanol Stove for Domestic Cooking Application

Comparative Analysis of Machine Learning Algorithms for the Detection and Classification of Suspicious Emails

Predicting Requirement Change Using Bayesian Networks on Historical Traceability Data