高级检索

    基于机器学习算法构建可解释的结直肠无蒂锯齿状病变风险预测模型

    Construction of an interpretable risk prediction model for colorectal sessile serrated lesion based on machine learning algorithms

    • 摘要: 目的 探讨结直肠无蒂锯齿状病变(SSL)发生的危险因素,构建可解释的机器学习预测模型。方法 选取2019年1月—2024年10月于徐州医科大学附属徐州市立医院消化内镜中心接受结肠镜检查的患者作为研究对象。收集患者的临床资料和实验室检查结果。根据结肠镜检查结果和病理报告,将患者分为SSL组和对照组,按照7∶3的比例随机分为训练集与验证集。通过单因素分析筛选SSL发生的相关危险因素,并通过LASSO回归筛选特征性预测变量。使用Python软件构建4种机器学习模型,包括逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)和极端梯度提升(XGBoost)模型。绘制受试者工作特征(ROC)曲线,评估4种机器学习模型的性能。根据Shapley加法解释(SHAP)方法解释LR模型。绘制SHAP直方图和SHAP摘要图,评估模型预测效能。结果 最终纳入628例患者(SSL组329例,对照组299例)。单因素分析结果显示,年龄、性别、体重指数(BMI)、吸烟史、饮酒史、高血压病史、白细胞计数、中性粒细胞计数、中性粒细胞/淋巴细胞比值(NLR)、单核细胞计数、红细胞计数、血红蛋白、空腹血糖、总胆固醇(TC)、甘油三酯(TG)、TC/高密度脂蛋白(HDL)、TG/HDL、TyG指数是SSL发生的相关危险因素(P<0.05)。LASSO回归结果显示,年龄、性别、BMI、吸烟史、饮酒史、高血压、白细胞计数、NLR、单核细胞计数、血红蛋白、空腹血糖、TC、TC/HDL、TyG指数是筛选出的14个关键预测因子。ROC曲线分析显示,LR模型预测SSL发生的AUC值为0.79,优于其他3种模型。LR模型的SHAP直方图结果显示,预测变量的重要性排序为:年龄、血红蛋白、性别、TyG指数、吸烟史。年龄增大、Hb升高、TyG指数升高、男性和有吸烟史的个体对SSL的预测影响较大。结论 基于机器学习算法的可解释LR模型对SSL具有较高的预测价值。

       

      Abstract: Objective To explore the risk factors for the occurrence of colorectal sessile serrated lesions (SSL) and to construct an interpretable machine learning model. Methods Patients who underwent colonoscopy at the Digestive Endoscopy Center of Xuzhou Medical University Affiliated Xuzhou Municipal Hospital from January 2019 to October 2024 were selected. Their clinical data and laboratory test results were collected. Based on the colonoscopy results and pathology reports, the patients were divided into two groups: an SSL group and a control group (patients with normal colonoscopy findings and no polyps). Univariate analysis was used to identify the risk factors for the occurrence of SSL, and LASSO regression was used to select characteristic predictive variables. The patients were randomly divided into a training set and a validation set in a 7∶3 ratio, and four machine learning models (Logistic Regression, LR; Support Vector Machine, SVM; Random Forest, RF; and Extreme Gradient Boosting, XGBoost) were built using Python. The performance of the four models was evaluated by plotting receiver operating characteristic (ROC) curves. The LR model was interpreted based on Shapley additive explanations (SHAP), and SHAP histograms and summary plots were generated. Results A total of 628 patients were included in the study, of whom 329 were in the SSL group and 299 were in the control group. Univariate analysis showed that age, sex, body mass index (BMI), smoking history, alcohol consumption history, hypertension history, white blood cell count, neutrophil count, neutrophil/lymphocyte ratio (NLR), monocyte count, red blood cell count, hemoglobin (Hb), fasting blood glucose, total cholesterol (TC), triglycerides (TG), TC/high density lipoprotein(HDL), TG/HDL, and TyG index were associated with the occurrence of SSL (P<0.05). LASSO regression identified 14 key predictive factors: age, sex, BMI, smoking history, alcohol consumption history, hypertension, white blood cell count, NLR, monocyte count, Hb, fasting blood glucose, TC, TC/HDL, and TyG index. ROC curve analysis showed that the AUC value of the LR model for predicting the occurrence of SSL was 0.79, outperforming the other three machine learning models. Therefore, the LR model was selected for further interpretation. SHAP histogram analysis showed that the importance of the predictive variables was ranked as follows: age, Hb, sex, TyG index, and smoking history. Increasing age, elevated Hb, increased TyG index, male, and smoking history had significant positive effect on the prediction of SSL. Conclusions The interpretable LR model based on machine learning algorithms has high predictive value for the occurrence of SSL.

       

    /

    返回文章
    返回