NÂNG CAO ĐỘ CHÍNH XÁC PHÂN LOẠI LỚP ÍT MẪU TỪ TẬP DỮ LIỆU MẤT CÂN BẰNG
Abstract
A dataset is called imbalance if it has some classes containing more instances than others. In this case, accurately classifying samples in small classes is very difficult. The higher the imbalanced ratio, the more difficult getting a good solution. Cost-sensitive learning is an effective solution for the imbalanced problem. In this paper, we present a decision system with misclassification cost. The system improves the degree of precision in the minor classes which are interested in imbalanced dataset.The system is based on the study of methods of classifying on the imbalanced dataset by cost-sensitive. This system is applied in medical diagnostic. The experimental results show that the accuracy of the diagnostic system is improved.
Tóm tắt
cải thiện.
Article Details
Tài liệu tham khảo
Abe, N., Zadrozny, B., Langford, J. (2004), An iterative method for multi-class cost-sensitive learning, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, pp. 3–11.
Allwein, E. L., Schapire, R. E., Singer, Y. (2000), Reducing multiclass to binary: A unifying approach for margin classifiers, Journal of Machine Learning Research 1, pp. 113–141.
Blake, C., Keogh, E., Merz, C. J. (1998), UCI repository of machine learning databases, [http://www.ics.uci.edu/~mlearn/MLRepository.html], Department of Information and Computer Science, University of California, Irvine, CA.
Breiman, L., Friedman, J. H., Olsen, R. A., Stone, C. J. (1984), Classification and Regression Trees. Wadsworth, Belmont, CA.
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002), SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, pp. 321–357.
Ding, Z. (2011). Diversified Ensemble Classifier for Highly imbalanced Data Learning and their application in Bioinformatics, Ph. D thesis, College of Arts and science, Department of Computer Science, Georgia State University,2011. Http://digitalarchive.gsu.edu/cs_diss/60
Domingos, P. (1999), MetaCost: A general method for making classifiers costsensitive, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 155–164.
Drummond, C., Holte, R. C. (2003), C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling, Working Notes of the ICML’03 Workshop on Learning from Imbalanced Data Sets, Washington, DC.
Elkan, C. (2001), The foundations of cost-senstive learning, Proceedings of the 17th International Joint Conference on Artificial Intelligence. Seattle, WA, pp. 973–978.
Engen Vegard. 2010. Machine Learning for Network Based Intrusion Detection. Ph. D thesis, Bounemouth University, 2010.
Hido, S. and Kashima, H. 2008. Roughly Balanced Bagging for Imbalanced Data. “In Proceedings of SIAM Conference on Data Mining (SDM2008), Atlanta, Georgia, USA, April, 2008.
Hong Zhao, Fan Min, William Zhu 2012. Minimal cost feature selection of data with normal distribution measurement errors. Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363000, China.
Jeffrey P. Bradford., Clayton Kunz., Ron Kohavi., Clifford Brunk., Carla E. Brodley. (1998), Pruning Decision Trees with Misclassification Costs. ECML-98, pp.131-136.
Ling, C. X., Yang, Q.,Wang, J., Zhang, S. (2004), Decision trees with minimal costs, Proceedings of the 21st International Conference on Machine Learning. Banff, Canada, pp. 69–76.
Liu, X.-Y., Zhou, Z.-H. (2006), The influence of class imbalance on cost-sensitive learning: An empirical study, Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, China, pp. 970–974.
Lozano, A. C., Abe, N., 2008. Multi-class cost-sensitive boosting with p-norm loss functions, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, NV, pp. 506–514.
Maloof, M. A., 2003. Learning when data sets are imbalanced and when costs are unequal and unknown, Working Notes of the ICML’03Workshop on Learning from Imbalanced Data Sets. Washington, DC.
Margineantu, D. (2001), Methods for cost-sensitive learning. Ph.D. thesis, department of Computer Science, Oregon State University, Corvallis, OR.
Provost, F, Domingos, P. (2003), Tree induction fof probability-base ranking, Machine Learning 52 (3), 199–215.
Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, California.
Ting, K. M. (2002), An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering 14 (3), 659–665.
Turney, P. D. (2000), Types of cost in inductive concept learning, Proceedings of the ICML’2000 Workshop on Cost-Sensitive Learning. Stanford, CA, pp. 15–21.
Witten, I. H., Frank, E. (2011), Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. Morgan Kaufmann Publishers. www.mkp.com. ISBN: 978-0-12-374856-0.
Yang, Y. and Ma, G. 2010. Ensemble- based Active Learning for Classification Problem. J. Biomedical and Engineering, 2010, 3, pp. 1021- 1028. Published online in SciRes. Http:/www. Scrip.org/journal/jbise.
Zadrozny, B., Langford, J., Abe, N. (2002), A simple method for cost-sensitive learning. Tech. rep., IBM.
Zhou, Z.-H., Liu, X.-Y. (2006a), On multi-class cost-sensitive learning, Proceeding of the 21st National Conference on Artificial Intelligence. Boston, WA, pp. 567–572.