Phạm Bích Như * , Nguyễn Thị Tú Trinh , Lê Thị Huỳnh Như , Lưu Minh Thư Huỳnh Lan Thanh

* Tác giả liên hệ (pbnhu@ctu.edu.vn)

Abstract

Clustering authors based on the keywords of their scientific papers using SVD and K-means algorithms. First, the keywords are represented using TF-IDF, followed by applying SVD to reduce dimensionality while retaining important features. Next, the K-means algorithm is used to cluster the papers according to the similarity of their keywords, thereby grouping authors with similar research topics. This combination helps optimize text data analysis effectively

Keywords: K-means algorithm, singular value, singular value decomposition algorithm (SVD), Term Frequency-Inverse Document Frequency method

Tóm tắt

Phân nhóm tác giả dựa trên từ khóa bài báo khoa học của họ bằng cách sử dụng thuật toán SVD và K-means. Đầu tiên, các từ khóa sẽ được biểu diễn bằng TF-IDF, sau đó áp dụng SVD để giảm số chiều, giữ lại các đặc trưng quan trọng. Tiếp theo, thuật toán K-means được sử dụng để phân cụm các bài báo theo mức độ tương đồng của từ khóa, từ đó các tác giả có cùng chủ đề nghiên cứu sẽ được nhóm lại với nhau. Sự kết hợp này giúp tối ưu hóa việc phân tích dữ liệu văn bản hiệu quả.

Từ khóa: Giá trị kì dị, phương pháp TF-IDF, thuật toán K-means, thuật toán SVD

Article Details

Tài liệu tham khảo

Chicco, D., & Masseroli, M. (2013, November). A discrete optimization approach for SVD best truncation choice based on ROC curves. In 13th IEEE International Conference on BioInformatics and BioEngineering (pp. 1-4). IEEE. https://doi.org/10.1109/BIBE.2013.6701705

Dinh, Q. N., Do, D. H., & Ha, A. N. H. (2022). Ekeland’s variational principle for bifunctions involving set perturbations. Can Tho University Journal of Science, 58, 121-128 (in Vietnamese). https://doi.org/10.22144/ctu.jvn.2022.106

Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211-218. https://doi.org/10.1007/BF02288367

Golub, G., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2), 205-224.
https://doi.org/10.1137/0702016

Kaloorazi, M. F. (2018). Low-Rank Matrix Approximations and Applications (PhD Thesis). Pontifícia Universidade Católica do Rio de Janeiro.

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.

https://doi.org/10.1109/MC.2009.263Li, X., Pang, Y., Zhao, C., Liu, Y., & Dong, Q. (2021). A new multi-level algorithm for balanced partition problem on large scale directed graphs. Advances in Aerodynamics, 3(1), 23. https://doi.org/10.1186/s42774-021-00074-x

Lam, C. H. (2021). Law of large numbers in the unfair game model. Can Tho University Journal of Science, 57(2), 44-48 (in Vietnamese).

Lam, C. H., Tran, L. P., La, K. M., & Duong, Tn. T. (2021). Central limit theorem in the fair game model. Can Tho University Journal of Science, 57(2), 39-43 (in Vietnamese).

Lam, Chuong. H., Trinh, Nghiem. H., & Le, Nhan. H (2024a). Higher order moment for random walks in discrete state space. Can Tho University Journal of Science, 60, 58-62 (in Vietnamese).

Lam, Chuong. H., Nguyen, Truong. V., Nguyen, Nhu. T. H., Phan, Hang. T. M., & Nguyen, Nhiem. C. (2024b). Limit of variance for random walk in space ????ℤ. Can Tho University Journal of Science, 60(2), 36-40 (in Vietnamese).

Miettinen, P., Mielikäinen, T., Gionis, A., Das, G., & Mannila, H. (2008). The discrete basis problem. IEEE transactions on knowledge and data engineering, 20(10), 1348-1362. https://doi.org/10.1109/TKDE.2008.53

Nguyen, P. T., Huynh, D. T., Pham, D. C., Do, D. H., & Dinh, Q. N. (2023). On generalized Ekeland’s variational principle for interval-valued functions based on the inner semicontinuity. Can Tho University Journal of Science, 59(5), 17-24 (in Vietnamese).

Nguyen, Q. T. (2020). Distributed and boundary control problems governed by semilinear elliptic partial differential equations. Can Tho University Journal of Science, 56, 1-7 (in Vietnamese).

Nguyen, Q. T., & Dao, P. D. (2022). Generalized differentiation of marginal functions in parametric optimal control governed by elliptic partial differential equations. Can Tho University Journal of Science, 58(1), 87-94 (in Vietnamese).

Nguyen, T. C., Tran, D. V., Huynh, D. T., Nguyen, P. T., & Dinh, Q. N. (2023). Weierstrass theorem for interval valued functions. Can Tho University Journal of Science, 59(5), 55-63 (in Vietnamese).

Ramponi, G., Brambilla, M., Ceri, S., Daniel, F., & Di Giovanni, M. (2019). Vocabulary-based community detection and characterization. In Proceedings of the 34th ACM/SIGAPP symposium on applied computing (pp. 1043-1050).
https://doi.org/10.1145/3297280.3297384

Sarkar, S., & Dong, A. (2011). Community detection in graphs using singular value decomposition. Physical Review E, 83(4), 046114. https://doi.org/10.1103/PhysRevE.83.046114

Stanimirovic, I. (2020). Applications of Graph Theory. Arcler Press.

Stewart, G. W. (1993). On the early history of the singular value decomposition. Society for Industrial and Applied Mathematics review, 35(4), 551-566.
https://doi.org/10.1137/1035134

Tran, A. S. H., & Nguyen, Q. T. (2024). Mordukhovich subdifferential of marginal functions in parametric optimal control with equilibrium constraints. Can Tho University Journal of Science, 60, 176-184 (in Vietnamese).

Tran, D. V., Ha, A. N. H., Do, D. H., & Dinh, Q. N. (2023). Ekeland’s variational principle for interval-valued functions based on the outer semicontinuity. Can Tho University Journal of Science, 59(5), 10-16 (in Vietnamese).

Truong, L. M., Nguyen, N. K., Nguyen, C. H., Phan, H. N., & Vo, T. V. (2024). The clustering algorithm for images based on extracted color pixels. Can Tho University Journal of Science, 60, 98-107 (in Vietnamese).

Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1), 71-86.
https://doi.org/10.1162/jocn.1991.3.1.71

Vo, T. V., Nguyen, Q. V., Huynh, H. V., Trang, K. T. M., & Nguyen, D. T. H. (2020a). An improved fuzzy time series forecasting model. Can Tho University Journal of Science, 56(1), 86-94 (in Vietnamese).

Vo, T. V., Nguyen, T. T., Huynh, H. V., Tran, T. T., & Chau, T. N. (2020b). Improving the cluster analysis algorithm for discrete elements. Can Tho University Journal of Science, 56(2), 30-36 (in Vietnamese).

Vo, T. V., Tu, T. N., & Tran, H. N. N. (2021). Building the time series forecasting model for interval data based on cluster analysis problem. Can Tho University Journal of Science, 57(5), 94-103 (in Vietnamese).

Vo, T. V., Le, C. T. K., & Chau, T. N. (2022a). Building clusters for image data from the extracted two dimensional interval. Can Tho University Journal of Science, 58(5), 22-30 (in Vietnamese).

Vo, T. V., Nguyen, T. H., Phan, T. N. N., Tang, K. X., & Tran, T. Đ. (2022b). Genetic algorithm in building cluster for discrete data and applying for image. Can Tho University Journal of Science, 58(3), 107-114 (in Vietnamese).

Vo, T. V., Nguyen, T. T. H., Dang, T. T. P., & Tran, H. N. (2022c). Classify images based on the extracted interval features from the gray level co-occurrence matrix. Can Tho University Journal of Science, 58(5), 31-38 (in Vietnamese).

Vo, T. V., Tran, H. N., & Huynh, N. V. (2022d). Classifying for image based on the extracted probability density function. Can Tho University Journal of Science, 58(6), 43-50 (in Vietnamese).

Vo, T. V., Nguyen, L. H., Danh, T. N., Tang, K. M. & Le, N. D. (2024). Building a forecasting model for interval time series based on point series. Can Tho University Journal of Science, 60, 150-158 (in Vietnamese).

Vu, T. H. (2018). Basic Machine Learning. https://machinelearningcoban.com

Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001). Bipartite graph partitioning and data clustering. In Proceedings of the tenth international conference on Information and knowledge management (pp. 25-32). https://doi.org/10.2172/816202