Giải pháp hỗ trợ chẩn đoán ung thư đại trực tràng dựa trên dữ liệu vi khuẩn đường ruột được chọn lọc dùng các giải thích phản thực đa dạng
Abstract
Colorectal cancer is a dangerous disease that can endanger human health if not detected and treated early. Analysis of microbial data in the intestinal environment can support the diagnosis of this disease. This article proposes a microbial selection approach by Diverse Counterfactual Explanations (DCE). The classification results with classical machine learning algorithms such as Random Forest and Gradient Boosting on data with less than 3% of the total original features are 0.7759, 0.8055, 0.8093, 0.7923, on Austria, American, Chinese, and German-French cohorts, respectively. These results are better than on the original datasets with more than 1900 microbial species.
Tóm tắt
Bệnh ung thư đại trực tràng là căn bệnh nguy hiểm đến sức khỏe con người nếu không phát hiện và điều trị sớm. Việc phân tích dữ liệu vi sinh vật trong môi trường đường ruột có thể hỗ trợ cho chẩn đoán bệnh này. Cách tiếp cận chọn lọc vi sinh vật bằng phương pháp giải thích kết quả của thuật toán trí tuệ nhân tạo bằng các giải thích phản thực đa dạng (Diverse Counterfactual Explanations-DCE) được đề xuất trong bài viết. Kết quả phân lớp với giải thuật máy học cổ điển như Rừng ngẫu nhiên và Gradient Boosting trên dữ liệu chỉ dưới 3% tổng số đặc trưng ban đầu, đã cho kết quả 0,7759, 0,8055, 0,8093 và 0,7923 với độ đo AUC trên các bộ dữ liệu thu thập từ nhóm người Áo, Mỹ, Trung Quốc, và Đức-Pháp. Kết quả này tốt hơn so với trên tập dữ liệu ban đầu với hơn 1900 loài vi sinh vật.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Tài liệu tham khảo
Nguyen A. T., Le N.T.N., Nguyen H.T.T., Tran P.M., Pham T.T.T., Dang H.T., Tran A.T., Deng X., Ho N.T.D., Tran N.T., Nguyen H.V., Nguyen T.D., Pham P.H.T., Nguyen C.V.V., Baker, S., Delwart, E., Thwaites, G., & Le T.V., (2021). Viral Metagenomic Analysis of Cerebrospinal Fluid from Patients with Acute Central Nervous System Infections of Unknown Origin, Vietnam. Emerging Infectious Diseases, 27(1), 205–213. https://doi.org/10.3201/eid2701.202723
Gungor, B., Ersoz, N. S., & Yousef, M. (2025). Integrating Biological Domain Knowledge with Machine Learning for Identifying Colorectal-Cancer-Associated Microbial Enzymes in Metagenomic Data. Applied Sciences, 15(6), 2940.
https://doi.org/10.3390/app15062940
Gungor, B., Temiz, M., Jabeer, A., Wu, D., & Yousef, M. (2023). microBiomeGSM: The identification of taxonomic biomarkers from metagenomic data using grouping, scoring and modeling (G-S-M) approach. Frontiers in Microbiology, 14. https://doi.org/10.3389/fmicb.2023.1264941
Dai, Z., Coker, O. O., Nakatsu, G., Wu, W. K. K., Zhao, L., Chen, Z., Chan, F. K. L., Kristiansen, K., Sung, J. J. Y., Wong, S. H., & Yu, J. (2018). Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome, 6(1). https://doi.org/10.1186/s40168-018-0451-2
Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., & Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry & Biology, 5(10), R245–R249.
https://doi.org/10.1016/s1074-5521(98)90108-9
Jabeer, A., Kocak, A., Akkas, H., Yenisert, F., Nalbantoglu, O. U., Yousef, M., & Bakir Gungor, B. (2022). Identifying Taxonomic Biomarkers of Colorectal Cancer in Human Intestinal Microbiota Using Multiple Feature Selection Methods. 2022 Innovations in Intelligent Systems and Applications Conference (ASYU), 1–6. https://doi.org/10.1109/asyu56188.2022.9925551
Karwowska, Z., Aasmets, O., Metspalu, M., Metspalu, A., Milani, L., Esko, T., Kosciolek, T., & Org, E. (2025). Effects of data transformation and model selection on feature importance in microbiome classification data. Microbiome, 13(1).
https://doi.org/10.1186/s40168-024-01996-6
Kim, D. J., Yang, J., Seo, H., Lee, W. H., Ho Lee, D., Kym, S., Park, Y. S., Kim, J. G., Jang, I.-J., Kim, Y.-K., & Cho, J.-Y. (2020). Colorectal cancer diagnostic model utilizing metagenomic and metabolomic data of stool microbial extracellular vesicles. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-59529-8
LaPierre, N., Ju, C. J.-T., Zhou, G., & Wang, W. (2019). MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods, 166, 74–82. https://doi.org/10.1016/j.ymeth.2019.03.003
Lundberg, S., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. arXiv. https://doi.org/10.48550/ARXIV.1705.07874
Martín, R., Miquel, S., Langella, P., & Bermúdez, L. G. (2014). The role of metagenomics in understanding the human microbiome in health and disease. Virulence, 5(3), 413–423. https://doi.org/10.4161/viru.27864
Merrick, L., & Taly, A. (2020). The Explanation Game: Explaining Machine Learning Models Using Shapley Values. In Machine Learning and Knowledge Extraction (17–38). Springer International Publishing. https://doi.org/10.1007/978-3-030-57321-8_2
Moges, B., & Mengistu, D. Y. (2024). Metagenomics approaches for studying the human microbiome. All Life, 17(1). https://doi.org/10.1080/26895293.2024.2350166
Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining machine learning classifiers through diverse counterfactual explanations. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 607–617. https://doi.org/10.1145/3351095.3372850
Nguyen, T., & TN, N. (2020). Diagnosis Approaches for Colorectal Cancer Using Manifold Learning and Deep Learning. SN Computer Science, 1(5), 281. https://doi.org/10.1007/s42979-020-00297-7
Pasolli, E., Truong, D. T., Malik, F., Waldron, L., & Segata, N. (2016). Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLOS Computational Biology, 12(7), e1004977. https://doi.org/10.1371/journal.pcbi.1004977
Ribeiro, M., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. https://doi.org/10.18653/v1/N16-3020
Sekaran, K., Varghese, R. P., Gopikrishnan, M., Alsamman, A. M., El Allali, A., Zayed, H., & Doss C, G. P. (2023). Unraveling the Dysbiosis of Vaginal Microbiome to Understand Cervical Cancer Disease Etiology—An Explainable AI Approach. Genes, 14(4), 936. https://doi.org/10.3390/genes14040936
Tang, H., Chen, Y., Tang, X., Wei, M., Hu, J., Zhang, X., Xiang, D., Yang, Q., & Han, D. (2025). Yield of clinical metagenomics: Insights from real-world practice for tissue infections. eBioMedicine, 111, 105536. https://doi.org/10.1016/j.ebiom.2024.105536
Telenti, A., Lippert, C., Chang, P.-C., & DePristo, M. (2018). Deep learning of genomic variation and regulatory network data. Human Molecular Genetics, 27(Supplement_R1), R63–R71. https://doi.org/10.1093/hmg/ddy115
Wachter, S., Mittelstadt, B., & Russell, C. (2017). Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR.
https://doi.org/10.48550/ARXIV.1711.00399
Yagin, F. H., Cicek, İ. B., Alkhateeb, A., Yagin, B., Colak, C., Azzeh, M., & Akbulut, S. (2023). Explainable artificial intelligence model for identifying COVID-19 gene biomarkers. Computers in Biology and Medicine, 154, 106619. https://doi.org/10.1016/j.compbiomed.2023.106619