Abstract:
Objective To explore the risk factors for the occurrence of colorectal sessile serrated lesions (SSL) and to construct an interpretable machine learning model.
Methods Patients who underwent colonoscopy at the Digestive Endoscopy Center of Xuzhou Medical University Affiliated Xuzhou Municipal Hospital from January 2019 to October 2024 were selected. Their clinical data and laboratory test results were collected. Based on the colonoscopy results and pathology reports, the patients were divided into two groups: an SSL group and a control group (patients with normal colonoscopy findings and no polyps). Univariate analysis was used to identify the risk factors for the occurrence of SSL, and LASSO regression was used to select characteristic predictive variables. The patients were randomly divided into a training set and a validation set in a 7∶3 ratio, and four machine learning models (Logistic Regression, LR; Support Vector Machine, SVM; Random Forest, RF; and Extreme Gradient Boosting, XGBoost) were built using Python. The performance of the four models was evaluated by plotting receiver operating characteristic (ROC) curves. The LR model was interpreted based on Shapley additive explanations (SHAP), and SHAP histograms and summary plots were generated.
Results A total of 628 patients were included in the study, of whom 329 were in the SSL group and 299 were in the control group. Univariate analysis showed that age, sex, body mass index (BMI), smoking history, alcohol consumption history, hypertension history, white blood cell count, neutrophil count, neutrophil/lymphocyte ratio (NLR), monocyte count, red blood cell count, hemoglobin (Hb), fasting blood glucose, total cholesterol (TC), triglycerides (TG), TC/high density lipoprotein(HDL), TG/HDL, and TyG index were associated with the occurrence of SSL (
P<0.05). LASSO regression identified 14 key predictive factors: age, sex, BMI, smoking history, alcohol consumption history, hypertension, white blood cell count, NLR, monocyte count, Hb, fasting blood glucose, TC, TC/HDL, and TyG index. ROC curve analysis showed that the AUC value of the LR model for predicting the occurrence of SSL was 0.79, outperforming the other three machine learning models. Therefore, the LR model was selected for further interpretation. SHAP histogram analysis showed that the importance of the predictive variables was ranked as follows: age, Hb, sex, TyG index, and smoking history. Increasing age, elevated Hb, increased TyG index, male, and smoking history had significant positive effect on the prediction of SSL.
Conclusions The interpretable LR model based on machine learning algorithms has high predictive value for the occurrence of SSL.