PHÂN LOẠI VĂN BẢN: MÔ HÌNH TÚI TỪ VÀ TẬP HỢP MÔ HÌNH MÁY HỌC TỰ ĐỘNG
Abstract
This paper presents an approach to classify text documents usingthe Bag-of-Word (BoW) model and ensemble-based learning algorithms. The ensemble-based learning algorithms include random multinomial naive Bayes (rMNB) and random oblique decision stump (rODS) models. The bag-of-word model is used to look for the sparse vectors of occurrence counts of words in text documents. The pre-processing step using the bag-of-word model brings out a dataset with a very large number of dimensions. Thus, we propose the new algorithms, called boosting of random multinomial naive Bayes and oblique decision stump models,whichare usually suited for classifying very-high-dimensional datasets. The results of the experiment on a real dataset show that our proposed algorithms have a high performance compared with other algorithms. The new approach has achieved an accuracy of 94.8%.
Tóm tắt
Article Details
Tài liệu tham khảo
Breiman, L.: Arcing classifiers. The annals of statistics 26(3), 801–849 (1998).
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001).
Chang, C.C., Lin, C.J.: LIBSVM – a library for support vector machines (2001). http://www.csie.ntu.edu.tw/~cjlin/libsvm
Do, T-N., Lenca, P., Lallich, S. and Pham, N-K.: Classifying Very-high-dimensional Data with Random Oblique Decision Trees. in Advances in Knowledge Discovery and Management, Springer-Verlag, pp. 39-55 (2009).
Fix, E and Hodges J.: Discriminatoiry Analysis: Small Sample Performance. Technical Report 21-49-004, USAF School of Aviation Medicine, Randolph Field, USA (1952).
Freund, Y., and Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning Theory: Proceedings of the Second EuropeanConference, pp. 23–37 (1995).
Good, I.: The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press (1965).
Grove, A.J. and Schuurmans, D.: Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 692–699 (1998).
Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of SIGIR (1994).
Phạm N.K., Đỗ T.N. và Poulet F.: Phân loại văn bản với BPSVM. Kỷ yếu hội nghị @CNTT, pp. 269-278 (2006).
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993).
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (1999).
Trần, C.Đ và Phạm N.K.: Phân loại văn bản với máy học véc tơ hỗ trợ và cây quyết định. Tạp chí Khoa học Trường Đại học Cần Thơ số (21a):52-63 (2012).
Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag (1995).
Witten, I., Frank, E.: DataMining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005).