PHÂN LỚP DỮ LIỆU KHÔNG CÂN BẰNG VỚI ROUGHLY BALANCED BAGGING

Phan Bích Chung; Đỗ Thanh Nghị

Phan Bích Chung ^* và Đỗ Thanh Nghị

* Tác giả liên hệPhan Bích Chung

Full Text: PDF

Ngày nhận bài: 20-04-2021

Ngày duyệt đăng: 03-06-2022

Ngày xuất bản: 01-05-2011

Title:

Lượt xem

66

Downloads

36

Trích dẫn

Chung, P. B., & Nghị, Đ. T. (2011). PHÂN LỚP DỮ LIỆU KHÔNG CÂN BẰNG VỚI ROUGHLY BALANCED BAGGING. Tạp chí Khoa học Đại học Cần Thơ, (20b), 189-197. Truy vấn từ https://ctujsvn.ctu.edu.vn/index.php/ctujsvn/article/view/1139

Số báo

Số. 20b (2011)

Chuyên mục

Công nghệ

Abstract

In this paper, we present a novel improvement of the Roughly Balanced Bagging algorithm (Hido & Kashima, 2008) to deal with the imbalanced data classification. Our proposal use ensemble-based algorithms including Boosting (Freund & Schapire, 1995), Random forest (Breiman, 2001) as base leaner of the orginal Roughly Balanced Bagging instead of a single decision tree (Quinlan, 1993). In addition, the distribution in each subset determined by under-sampling of the majority class is belongs to negative binomial distribution function using adjust parameter. The experimental results on imbalanced datasets from UCI repository (Asuncion & Newman, 2007) showed that our proposal outperforms the orginal Roughly Balanced Bagging.

Keywords: Roughly Balanced Bagging, Bagging, Boosting, AdaBoost, RandomForest, Decision Tree, Negative binomial distribution

Tóm tắt

Trong bài báo này, chúng tôi trình bày một cải tiến của giải thuật Roughly Balanced Bagging (Hido & Kashima, 2008) cho việc phân lớp các tập dữ liệu không cân bằng. Chúng tôi đề xuất sử dụng các giải thuật tập hợp mô hình bao gồm Boosting (Freund & Schapire, 1995), Random forest (Breiman, 2001), làm mô hình học cơ sở của giải thuật Roughly Balanced Bagging gốc, thay vì sử dụng một cây quyết định (Quinlan, 1993). Chúng tôi cũng đề xuất điều chỉnh cách lấy mẫu giảm phần tử lớp đa số theo hàm phân phối nhị thức âm ở mỗi lần. Kết quả thực nghiệm trên các tập dữ liệu không cân bằng được lấy từ nguồn UCI (Asuncion & Newman, 2007) cho thấy rằng phương pháp mà chúng tôi đề xuất cho hiệu quả phân loại chính xác hơn khi so sánh với giải Roughly Balanced Bagging gốc.

Từ khóa: Dữ liệu không cân bằng, Roughly Balanced Bagging, Bagging, Boosting, AdaBoost, Rừng ngẫu nhiên, Cây quyết định, Phân phối nhị thức âm

Tài liệu tham khảo

Asuncion, A. & Newman, D.J.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science, 2007. [http://www.ics.uci.edu/~m-learn/MLRepository.html]

Breiman, L., Friedman, J., Olshen, R. and Stone C.: Classification and Regression Trees. Chapman & Hall, New York, 1984.

Breiman, L.: Bagging predictors. Machine Learning 24(2):123–140, 1996.

Breiman, L.: Random Forests. Machine Learning, 45(1):5-32, 2001.

Chawla, N., Japkowicz, N. and Kolcz, A.: ICML’Workshop on Learning from Imbalanced Data Sets. 2003.

Chawla, N., Japkowicz, N. and Kolcz, A.: Special Issue on Class Imbalances. In SIGKDD Explorations Vol. 6, 2004.

Chawla, N., Lazarevic, A., Hall, L.O. and Bowyer, K.W.: SMOTEBoost: Improving prediction of the minority class in boosting. In proc. of European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp. 107–119, 2003.

Domingos, P.: Metacost: A general method for making classiﬁers cost sensitive. In proc. of Intl Conf. on Knowledge Discovery and Data Mining, pp. 155–164, 1999.

Freund, Y. and Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Proceedings of the Second European Conference, pp. 23–37, 1995.

Hido, S. and Kashima, H.: Roughly balanced bagging for imbalanced data. In proc. of SIAM Intl Conference on Data Mining, pp. 143–152, 2008.

Ihaka, R. and Gentleman, R.: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299-314, 1996.

Lenca, P., Lallich, S., Do, T-N. and Pham, N-K.: A comparison of different off-centered entropies to deal with class imbalance for decision trees. In The Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, LNAI 5012, pp. 634–643, 2008.

Liu, X.-Y., Wu, J. and Zhou, Z.-H.: Exploratory under-sampling for class-imbalance learning. In proc. of Sixth IEEE Intl Conf. on Data Mining (ICDM’06), pp. 965–969, 2006.

Liu, X-Y. and Zhou, Z-H.: The inﬂuence of class imbalance on costsensitive learning: An empirical study. In proc. of Sixth IEEE Intl Conf. on Data Mining (ICDM’06), pp. 970–974, 2006.

Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

van Rijsbergen, C.V.: Information Retrieval. Butterworth, 1979.

Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.

Weiss, G.M. and Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artiﬁcial Intelligence Research Vol.(19):315–354, 2003.

Yang, Q. and Wu, X.: 10 Challenging Problems in Data Mining Research. Intl Journal of Information Technology and Decision Making 5(4), 597–604, 2006.

Article Sidebar

Abstract

Tóm tắt

Article Details

Tài liệu tham khảo