Vấn đề mất cân bằng dữ liệu và một số phương pháp xử lý dữ liệu mất cân bằng trong mô hình học sâu

Lê Tống Thanh Hải; Pham Ngọc Giàu

doi:10.22144/ctujos.2024.407

Lê Tống Thanh Hải ^* và Pham Ngọc Giàu

* Tác giả liên hệ (tonglethanhhai@tgu.edu.vn)

Full Text: PDF

Ngày nhận bài: 31-01-2024

Ngày nhận bài sửa: 26-02-2024

Ngày duyệt đăng: 30-04-2024

Ngày xuất bản: 17-10-2024

Title: The problem of data imbalances and some methods of processing imbalanced data in deep learning models

DOI: 10.22144/ctujos.2024.407

Lượt xem

47

Downloads

34

Trích dẫn

Hải, L. T. T., & Giàu, P. N. (2024). Vấn đề mất cân bằng dữ liệu và một số phương pháp xử lý dữ liệu mất cân bằng trong mô hình học sâu. Tạp chí Khoa học Đại học Cần Thơ, 60(5). https://doi.org/10.22144/ctujos.2024.407

Số báo

Tập. 60 Số. 5 (2024)

Chuyên mục

Công nghệ thông tin

Abstract

In this article, we address the problem of data imbalance, a common phenomenon in binary classification problems, where the sample number of one class is significantly smaller than the other. We compared and evaluated multiple approaches to processing imbalances in deep learning, using the Cat-Dog dataset to study the impact of imbalances on the classification process. The solutions compared include improvements from three approaches: Data, Model and Loss, aimed at enhancing the predictive performance of machine learning algorithms. We also recommend the Model approach by applying Transfer Learning with the Resnet-18 model, which was pre-trained on the ImageNet dataset, giving an F1-score of 95.19% and an accuracy of 95.20% after only 10 epochs. This showed superior efficacy compared to previous studies focused on improving data and loss.

Keywords: Imbalanced Data, Binary Classification, Over-Sampling, Under-Sampling

Tóm tắt

Trong bài viết này, vấn đề dữ liệu mất cân bằng, một hiện tượng phổ biến trong các bài toán phân loại nhị phân, khi mà số lượng mẫu của một lớp nhỏ hơn đáng kể so với lớp còn lại được đề cập đến. Nhiều phương pháp xử lý dữ liệu mất cân bằng trong học sâu được so sánh và đánh giá, bên cạnh đó sử dụng bộ dữ liệu Cat-Dog để nghiên cứu tác động của sự mất cân bằng đến quá trình phân loại. Các giải pháp được so sánh bao gồm cải tiến từ ba phương pháp tiếp cận: Data, Model và Loss, nhằm nâng cao hiệu suất dự đoán của các thuật toán máy học. Phương pháp tiếp cận Model qua việc áp dụng Transfer Learning với mô hình Resnet-18 cũng được đề xuất, đã được huấn luyện trước trên bộ dữ liệu ImageNet, cho kết quả F1-score là 95,19% và độ chính xác là 95,20% chỉ sau 10 epochs. Điều này cho thấy hiệu quả vượt trội so với các nghiên cứu trước đây tập trung vào cải thiện Data và Loss.

Từ khóa: Dữ liệu mất cân bằng, phân loại nhị phân, tăng mẫu dữ liệu, giảm mẫu dữ liệu

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Tài liệu tham khảo

Chawla, N. V, Bowyer, K. W, Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE : Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., 16(1), 321–357. https://doi.org/10.1613/jair.953

Tahir, M. A., Kittler, J., Mikolajczyk, K., & Yan, F. (2009). A multiple expert approach to class imbalance problem using inverse random undersampling. Proc. Of Int. Workshop on Multiple Classifier Systems, pp. 82-91. Springer, Berlin, Heidelberg.
https://doi.org/10.1007/978-3-642-02326-2_9

Paula, B., Torgo, L., & Ribeiro, R. (2015). A survey of predictive modelling under imbalanced distributions. arXiv preprint arXiv, 1505(01658). https://doi.org/10.1109/ICCV.2017.324

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
https://doi.org/10.1109/CVPR.2016.90

Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollar, P. (2017). Focal loss for dense object selection, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2980-2988.
https://doi.org/10.1109/ICCV.2017.324

Buda, M., Maki, A., & Mazurowski, M. A. (2018a). A systematic study of the class imbalance problem in convolutional neural networks. Neural networks, 106, 249-259. https://doi.org/10.1016/j.neunet.2018.07.011

Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., & Herrera, F., (2018b). Learning from Imbalanced Data Sets, Learning from Imbalanced Data Sets. Springer International Publishing, Cham.
https://doi.org/10.1007/978-3-319-98074-4

Johnson, J., & Khoshgoftaar, T. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 27.
https://doi.org/10.1186/s40537-019-0192-5

Liu, Z., Cao, W., Gao, Z., Bian, J., & Chen, H. (2020). Self-paced Ensemble for Highly Imbalanced Massive Data Classification. 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 841-852. https://doi.org/10.1109/ICDE48307.2020.00078

Yu, L., & Zhou, N. (2021). Survey of imbalanced data methodologies.
https://doi.org/10.48550/arXiv.2104.02240

Ghosh, K., Bellinger, C., Corizzo, R., Japkowicz, N., Branco, P., & Krawczyk, B. (2022a). The class imbalance problem in deep learning. Mach Learn.
https://doi.org/10.1007/s10994-022-06268-8

Thanh, T. T. P., & Nghe, N. T. (2022b). Rice Leaf Disease Recognition Using Transfer Learning Method. Can Tho University Journal of Science, 58(4), 1-7 (in Vietnamese). https://doi.org/10.22144/ctu.jvn.2022.157

Duong, T. A., & Dinh, M. H. (2023). Classifying Imbalanced Data in Customer Churn Prediction Using an Improved Random Forest Algorithm. HUFLIT Journal of Science, 7(3), 58-58. (in Vietnamese)
https://hjs.huflit.edu.vn/index.php/hjs/article/view/143

ImageNet web page.
https://www.image-net.org/

Kaggle web page.
https:/www.kaggle.com/

Article Sidebar

Abstract

Tóm tắt

Article Details

Tài liệu tham khảo