Phan Thượng Cang * , Trần Thị Tố Quyên Triệu Thanh Ngoan

* Tác giả liên hệ (ptcang@cit.ctu.edu.vn)

Abstract

Handling and analyzing data quickly and efficiently in the era of big data is a significant challenge. Filtering algorithms enhance the performance of big data processing by eliminating irrelevant data, reducing computational costs, and shortening query processing times. This study evaluates the performance of five popular filtering algorithms: Bloom Filter, Cuckoo Filter, Quotient Filter, Morton Filter, and Vacuum Filter in an Apache Spark environment. Through experiments on large datasets, the results show that the Quotient Filter is the most efficient in terms of storage, the Cuckoo Filter demonstrates a good balance between insertion, search, and deletion speeds. The Bloom Filter is suitable for static data, the Morton Filter excels in search speed, and the Vacuum Filter has a slow insertion time but fast search and deletion times. Integrating these algorithms with Apache Spark significantly improves processing performance thanks to its distributed and parallel capabilities. The study results provide guidance on selecting suitable filtering algorithms and highlight the potential for effectively applying filtering algorithms in large-scale data processing.

Keywords: Apache Spark, Bloom Filter, Cuckoo Filter, Morton Filter, Quotient Filter, Vacuum Filter

Tóm tắt

Việc xử lý và phân tích dữ liệu nhanh chóng, hiệu quả trong kỷ nguyên dữ liệu lớn là thách thức quan trọng. Các thuật toán lọc giúp tăng hiệu suất xử lý dữ liệu lớn bằng cách loại bỏ dữ liệu không liên quan, giảm chi phí tính toán, rút ngắn thời gian xử lý truy vấn. Nghiên cứu này đánh giá hiệu năng của 5 thuật toán lọc phổ biến bao gồm Bloom Filter, Cuckoo Filter, Quotient Filter, Morton Filter và Vacuum Filter trong môi trường Apache Spark. Thông qua thực nghiệm trên các tập dữ liệu lớn, kết quả cho thấy Quotient Filter hiệu quả nhất về lưu trữ, Cuckoo Filter thể hiện sự cân bằng tốt giữa tốc độ chèn, tìm kiếm và xóa. Bloom Filter phù hợp với dữ liệu tĩnh, Morton Filter nổi trội về tốc độ tìm kiếm, Vacuum Filter có thời gian chèn chậm nhưng tìm kiếm và xóa nhanh. Việc kết hợp các thuật toán này với Apache Spark giúp cải tiến đáng kể hiệu suất xử lý nhờ khả năng phân tán và song song. Kết quả nghiên cứu cung cấp lựa chọn thuật toán lọc phù hợp và chỉ ra tiềm năng ứng dụng hiệu quả các thuật toán lọc trong xử lý dữ liệu quy mô lớn.

Từ khóa: Apache Spark, Bloom Filter, Cuckoo Filter, Morton Filter, Quotient Filter, Vacuum Filter

Article Details

Tài liệu tham khảo

Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., & Stoica, I. (2013). BlinkDB: Queries with bounded errors and bounded response times on very large data. Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys 2013, 29–42. ACM. https://doi.org/10.1145/2465351.2465355

Ahmad, F., Lee, S., Thottethodi, M., & Vijaykumar, T. N. (2012). PUMA: Purdue MapReduce Benchmarks Suite. In ECE Technical Reports.

Al-hisnawi, M., & Ahmadi, M. (2016). Deep Packet Inspection Using Quotient Filter. IEEE Communications Letters, 20(11), 2217–2220. https://doi.org/10.1109/LCOMM.2016.2601898

Breslow, A. D., & Jayasena, N. S. (2018). Morton filters: Faster, spaceefficient cuckoo filters via biasing, compression, and decoupled logical sparsity. Proceedings of the VLDB Endowment, 11(9). VLDB Endowment. https://doi.org/10.14778/3213880.3213884

Burdakov, A., Ermakov, E., Panichkina, A., Ploutenko, A., Grigorev, U., Ermakov, O., & Proletarskaya, V. (2019). Bloom Filter Cascade Application to SQL Query Implementation on Spark. Proceedings - 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2019. IEEE. https://doi.org/10.1109/EMPDP.2019.8671557

Chaudhuri, S., Ganti, V., & Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. Proceedings - International Conference on Data Engineering. IEEE. https://doi.org/10.1109/ICDE.2006.9

Ezzaki, F., Abghour, N., Elomri, A., Moussaid, K., & Rida, M. (2020). Bloom filter and its variants for the optimization of MapReduce’s algorithms: A review. Proceedings of 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications, CloudTech 2020. IEEE. https://doi.org/10.1109/CloudTech49835.2020.9365876

Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014). Cuckoo filter: Practically better than bloom. CoNEXT 2014 - Proceedings of the 2014 Conference on Emerging Networking Experiments and Technologies. Association for Computing Machinery. https://doi.org/10.1145/2674005.2674994

Fier, F., Augsten, N., Bouros, P., Leser, U., & Freytag, J. C. (2018). Set similarity joins on MapReduce: An experimental survey. Proceedings of the VLDB Endowment, 11(10). VLDB Endowment. https://doi.org/10.14778/3231751.3231760

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: methods and prospects. Big Data Analytics, 1(1). https://doi.org/10.1186/s41044-016-0014-0

Geil, A., Farach-Colton, M., & Owens, J. D. (2018). Quotient Filters: Approximate Membership Queries on the GPU. 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 451–462. IEEE. https://api.semanticscholar.org/CorpusID:3991218

Kumar, N., Sai, K. H. S., Hordiichuk, V., Menon, R., Catherene, J. A. C., Saha, G. C., & Balaji, K., (2023). Harnessing the Power of Big Data: Challenges and Opportunities in Analytics. Tuijin Jishu/Journal of Propulsion Technology, 44(2).
https://doi.org/10.52783/tjjpt.v44.i2.193

Li, L. (2021). Efficient Distributed Database Clustering Algorithm for Big Data Processing. Proceedings - 2021 6th International Conference on Smart Grid and Electrical Automation, ICSGEA 2021. IEEE. https://doi.org/10.1109/ICSGEA53208.2021.00118

Lu, Y., Prabhakar, B., & Bonomi, F. (2005). Bloom filters: Design innovations and novel applications. 43rd Annual Allerton Conference on Communication, Control and Computing 2005, 2.

Maulana, M. S., Linuwih, B. P., Nuha, H. H., & Satrya, G. B. (2023, August). Bloom, Xor, and Cuckoo Filter Comparison for Database’s Query Optimization. International Conference on ICT Convergence. IEEE. https://doi.org/10.1109/ICoICT58202.2023.10262536

Pandey, P., Bender, M. A., Johnson, R., & Patro, R. (2017). A general-purpose counting filter: Making every bit count. Proceedings of the ACM SIGMOD International Conference on Management of Data, Part F127746, 775–787. ACM.
https://doi.org/10.1145/3035918.3035963

Tran, T. T. Q., Phan, T. C., Laurent, A., & D’Orazio, L. (2020, July). Optimization for large-scale fuzzy joins using fuzzy filters in MapReduce. International Conference on Fuzzy Systems (FUZZ-IEEE) (pp 1-8). IEEE. . https://doi.org/10.1109/FUZZ48607.2020.9177610

Wang, M., Zhou, M., Shi, S., & Qian, C. (2019). Vacuum filters: More space-efficient and faster replacement for bloom and cuckoo filters. Proceedings of the VLDB Endowment, 13(2). VLDB Endowment. https://doi.org/10.14778/3364324.3364333

Yan, C., Zhao, X., Zhang, Q., & Huang, Y. (2017). Efficient string similarity join in multi-core and distributed systems. PLoS ONE, 12(3). https://doi.org/10.1371/journal.pone.0172526

Yu, M., Li, G., Deng, D., & Feng, J. (2016). String similarity search and join: a survey. In Frontiers of Computer Science (Vol. 10, Issue 3). https://doi.org/10.1007/s11704-015-5900-5

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 10. USENIX Association. https://doi.org/ 10.5555/1863103.1863113.