基于随机森林算法的南京地区脑卒中风险预测模型构建

万红燕; 刘婕; 郝舒欣; 刘悦

doi:10.13421/j.cnki.hjwsxzz.2024.07.008

基于随机森林算法的南京地区脑卒中风险预测模型构建

Construction of a risk prediction model for stroke based on random forest algorithm in Nanjing, China

摘要

摘要:
目的利用南京地区脑卒中患者医院诊疗数据，构建基于随机森林算法的脑卒中风险预测模型并探索脑卒中的危险因素，为早期干预和临床治疗提供参考。
方法选取2014年5月至2022年12月南京市某三甲医院的5 357名住院病人作为研究对象，其中神经内科出院主诊断为脑卒中的患者(3 104例)作为脑卒中组，医院同期就诊的非脑卒中患者(2 253例)作为对照组。按照8∶2的比例随机分为训练集(4 285例)与测试集(1 072例)，纳入人口统计学数据、临床化验指标、气象数据及环境数据，分析脑卒中患者的危险因素，通过5折交叉验证和参数调整优化模型，通过准确率、精确率、召回率、F1分数和AUC值评估模型的预测性能，运用SHAP值对模型特征进行量化和归因分析。
结果脑卒中前十位的危险因素依次为收缩压、年龄、血糖、中性粒细胞计数、白蛋白、钾、舒张压、总蛋白、总胆固醇、载脂蛋白A1。基于随机森林算法的脑卒中风险预测模型的准确率、精确率、F1分数、召回率和AUC值分别为0.78、0.76、0.72、0.69和0.85。
结论基于随机森林算法的预测模型有助于辅助早期识别和捕获脑卒中患者，为及时采取干预措施提供关键信息，具有较好的应用价值。

Abstract:
Objective To construct a stroke risk prediction model based on the random forest algorithm using the data of hospitalized patients with stroke in Nanjing, China, and identify risk factors for stroke, and to provide more references for early intervention and clinical treatment of the disease.
Methods A total of 5 357 hospitalized patients at a grade A tertiary hospital in Nanjing from May 2014 to December 2022 were included. Among them, there were 3 104 patients with a discharge diagnosis of stroke from the department of neurology (stroke group) and 2 253 non-stroke patients treated during the same period (control group). The cases were randomly divided at an 8∶2 ratio into training set (4 285 cases) and test set (1 072 cases). Demographic data, clinical and laboratory indicators, meteorological data, and environmental data were incorporated to identify stroke-associated risk factors. The model was optimized through 5-fold cross-validation and parameter tuning. The predictive performance of the model was assessed according to accuracy, precision, recall rate, F1 score, and the area under the curve (AUC). Shapley additive explanations values were used for feature quantification and attribution.
Results The top ten risk factors for stroke were systolic blood pressure, age, glucose, neutrophil count, albumin, potassium, diastolic blood pressure, total protein, total cholesterol, and apolipoprotein A1. The performance parameters of the random forest-based stroke risk prediction model were as followed: accuracy, 0.78; precision, 0.76; F1 score, 0.72; recall rate, 0.69; and AUC, 0.85.
Conclusion The random forest-based prediction model can assist in early identification and capture of patients with stroke to provide key information for timely intervention measures, which has ideal application value.

HTML全文

参考文献(27)

施引文献

资源附件(0)