Giải thuật ước lượng số cụm dữ liệu cải tiến cho tập dữ liệu lớn

Dương Văn Hiếu; Trần Huy Long; Phạm Ngọc Giàu

doi:10.22144/ctu.jsi.2017.006

Dương Văn Hiếu ^* , Trần Huy Long và Phạm Ngọc Giàu

* Tác giả liên hệ (dvhieu@nomail.com)

Full Text: PDF

Ngày nhận bài: 15-09-2017

Ngày nhận bài sửa: 10-10-2017

Ngày duyệt đăng: 20-10-2017

Ngày xuất bản: 20-10-2017

Title: A revised cluster number estimation algorithm for big datasets

DOI: 10.22144/ctu.jsi.2017.006

Lượt xem

64

Downloads

31

Trích dẫn

Hiếu, D. V., Long, T. H., & Giàu, P. N. (2017). Giải thuật ước lượng số cụm dữ liệu cải tiến cho tập dữ liệu lớn. Tạp chí Khoa học Đại học Cần Thơ, (CĐ Công nghệ TT), 42-53. https://doi.org/10.22144/ctu.jsi.2017.006

Số báo

Số. CĐ Công nghệ TT (2017)

Chuyên mục

Công nghệ thông tin

Abstract

This paper presents a revised version of a cluster number estimation algorithm for big datasets. This algorithm was designed to work on a standard personal computer. This is an improvemennt of the Cell-MST-Based cluster number estimation algorithm by appying weighted distance instead of using the Euclidean distance. This new algorithm was named Weighted-Cell-MST-based cluster number estimation algorithm. This revised version can provide more stable results compared to its former version when testing the same datasets in the same environment.

Keywords: Big datasets, Cell-MST-based, Cluster number estimation, Weighted-Cell-MST-based

Tóm tắt

Bài báo này trình bày một giải thuật ước lượng số cụm dữ liệu cải tiến dùng để ước lượng số cụm dữ liệu của tập dữ liệu lớn. Giải thuật được thiết kế chạy trên máy tính cá nhân có cấu hình cơ bản. Đây là một sự cải tiến của giải thuật ước lượng số cụm Cell-MST-Based bằng cách áp dụng khoảng cách có trọng số thay cho khoảng cách Euclid. Thuật toán cải tiến được đặt tên là Weighted-Cell-MST-based cluster number estimation algorithm. Thuật toán cải tiến cho kết quả ổn định hơn so với thuật toán ban đầu khi xét trên cùng các tập dữ liệu và trong cùng một điều kiện thực nghiệm.

Từ khóa: Cây phủ tối thiểu, đồ thị tối ưu, tập dữ liệu lớn, tế bào hóa tập dữ liệu, ước lượng số cụm dữ liệu

Tài liệu tham khảo

Barioni, M. C. N., Razente, H., Marcelino, A. M. R., Traina, A. J. M. and Traina, C. (2014). Open Issues for Partitioning Clustering Methods: An Overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 4.3, (2014) : 161-177.

Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., Foufou, S. and Bouras, A. (2014). A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. IEEE Trans. on Emerging Topics in Computing. 2.3: 267-279.

Kokol, P. (2015). Introduction To Data Mining and Knowledge Discovery. In: Encyclopedia of Complexity and Systems Science. Robert A. Meyers (editor). New York: Springer Science+Business Media, pp 1-3.

Kolesnikov, A., Trichina, E. and Kauranne, T. (2015). Estimating the Number of Clusters in a Numerical Data Set Via Quantization Error Modeling. Pattern Recognition. 48.3: 941-952.

Romero, C. and Ventura, S. (2013). Data Mining in Education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 3.1: 12-27.

Sedgewick, R. and Wayne, K. (2011). Algorithms (4th Edition). Addison-Wesley Professional.

Jesus M., Julian L., and Salvador G. (2017). Exact fuzzy k-nearest neighbor classification for big datasets. Proceedings of 2017 IEEE International Conference on Fuzzy Systems

Shao, X., Pi, J. and Liu, L. (2013). A Method of Dynamically Determining the Number of Clusters and Cluster Centers. Proceedings of 2013 8th International conference on Computer Science Education (ICCSE):283-286.

Starczewski, A. and Krzyak, K. (2015). Performance Evaluation of the Silhouette Index. Artificial Intelligence and Soft Computing:49-58.

Stokes, K. (2014). Graph K-Anonymity through K-Means and as Modular Decomposition.

Texas Tech University (2015). Recommended Software and Hardware Configurations. 2015. https://www.depts.ttu.edu/ithelpcentral/configurations.php (ngày truy cập 17/8/2015).

Van Hieu, D. and Meesad, P. (2015). A Cell-MST-Based Method for Big Dataset Clustering on Limited Memory Computers. Proccedings of 2015 7th International Conference on Information Technology and Electrical Engineering. 632-637.

Van Hieu, D. and Meesad, P. (2016). Cell-RDOS: A Fast Outlier Detection Method for Big Datasets. International Jurnal of Advances in Soft Computing and Its Aplication. 8(3):1-15.

Yan, M. and Ye, K. (2007). Determining the Number of Clusters Using the Weighted Gap Statistic. Biometrics. 63.4, (2007) : 1031-1037.

Yu, H., Liu, Z. and Wang, G. (2014). An Automatic Method to Determine the Number of Clusters Using Decision-Theoretic Rough Set. International Journal of Approximate Reasoning. 55.1: 101-115.

Zhong, C., Malinen, M., Miao, D. and Frnti, P. (2015). A Fast Minimum Spanning Tree Algorithm Based on K-Means. Information Sciences. 295.0: 1-17.

Article Sidebar

Abstract

Tóm tắt

Article Details

Tài liệu tham khảo