前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >临床预测模型概述6-统计模型实操-单/多因素Logistic回归

临床预测模型概述6-统计模型实操-单/多因素Logistic回归

原创
作者头像
凑齐六个字吧
发布2024-08-07 22:55:55
1190
发布2024-08-07 22:55:55
举报
文章被收录于专栏:临床预测模型

既往推文已经介绍过了logistic,cox,lasso回归(https://mp.weixin.qq.com/s/pXRZ1rYUr3lwH5OlDeB0_Q),接下来将重点进行代码的实操。

首先进行logistic模型的实际操练,简单回顾一下二项logistic回归(因为还有多项的hhh),其是指研究二分类结果与一些影响因素之间关系的分析方法。在各种临床/基础数据分析中,经常需要分析疾病/状态与各种影响/危险因素之间的定量关系,如鼻咽癌的发生于EB病毒定量、年龄、不同饮食习惯等因素之间的关系,而结局变量通常是二分类的,因此这种方法是研究者必须学会的方法之一。

logstic分析流程
1、输入数据
代码语言:javascript
复制
rm(list = ls())
load("TCGA-LIHC_sur_model.Rdata")
# check 数据
head(exprSet)[1:4,1:4]
#            TCGA-FV-A495-01A TCGA-ED-A7PZ-01A TCGA-ED-A97K-01A TCGA-ED-A7PX-01A
# WASH7P             1.913776        1.2986076         1.967382         1.586170
# AL627309.6         3.129116        0.5606928         3.831265         1.363539
# WASH9P             2.490476        2.8140204         2.960338         2.106464
# MTND1P23           2.773335        3.4114257         2.591028         3.353850
head(meta)
#                                ID OS    OS.time  race age gender stage  T  N M
# TCGA-FV-A495-01A TCGA-FV-A495-01A  0 0.03333333 WHITE  51 FEMALE    II  2 NA 0
# TCGA-ED-A7PZ-01A TCGA-ED-A7PZ-01A  0 0.20000000 ASIAN  61   MALE    II  2 NA 0
# TCGA-ED-A97K-01A TCGA-ED-A97K-01A  0 0.20000000 ASIAN  54   MALE   III  3  0 0
# TCGA-ED-A7PX-01A TCGA-ED-A7PX-01A  0 0.20000000 ASIAN  48 FEMALE    II  2 NA 0
# TCGA-BC-A3KF-01A TCGA-BC-A3KF-01A  0 0.26666667 WHITE  66 FEMALE     I  1 NA 0
# TCGA-DD-A4NR-01A TCGA-DD-A4NR-01A  1 0.30000000 WHITE  85 FEMALE     I  1  0 0
identical(rownames(meta),colnames(exprSet))
# [1] TRUE

str(meta)
# 'data.frame': 365 obs. of  10 variables:
#  $ ID     : chr  "TCGA-FV-A495-01A" "TCGA-ED-A7PZ-01A" "TCGA-ED-A97K-01A" "TCGA-ED-A7PX-01A" ...
#  $ OS     : int  0 0 0 0 0 1 0 0 0 1 ...
#  $ OS.time: num  0.0333 0.2 0.2 0.2 0.2667 ...
#  $ race   : chr  "WHITE" "ASIAN" "ASIAN" "ASIAN" ...
#  $ age    : int  51 61 54 48 66 85 70 75 84 75 ...
#  $ gender : chr  "FEMALE" "MALE" "MALE" "FEMALE" ...
#  $ stage  : chr  "II" "II" "III" "II" ...
#  $ T      : num  2 2 3 2 1 1 1 2 1 2 ...
#  $ N      : num  NA NA 0 NA NA 0 0 NA NA 0 ...
#  $ M      : num  0 0 0 0 0 0 0 0 0 0 ...
table(meta$race)
# AMERICAN INDIAN OR ALASKA NATIVE                            ASIAN        BLACK OR AFRICAN AMERICAN                            WHITE 
#                                1                              155                               17                              182
table(meta$age)
# 16 17 18 20 23 24 25 29 31 32 33 34 35 36 37 38 39 40 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
#  1  1  1  3  2  2  1  2  1  2  1  1  3  1  1  5  2  2  2  5  3  7  4  3  9  3  8 15  7  7  9  8  8  5 16 14  8 15  8  5 15 11 16  9 
# 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 90 
# 17 13 10  6  9 10  7  9  7  5  3  2  5  4  1  1  1  2  1 
table(meta$gender)
# FEMALE   MALE 
#    119    246 
table(meta$stage)
#   I  II III  IV 
# 170  84  83   4
table(meta$T)
#   1   2   3   4 
# 180  91  78  13 
table(meta$N)
#   0   1 
# 248   4
table(meta$M)
#   0   1 
# 263   3

采用的是TCGA-LIHC的数据,其中自变量分别为race,age,gender,stage,T,N,M。本次实操暂时只用到meta中的数据。

其中ID,race,gender,stage是字符型数据,OS,age是整数型数据,OS.time,T,N,M是数值型数据。整数型数据和数值型数据的差别就是一种是整数,另一种是可以有小数。

2、check数据

在这个数据中,我们的因变量/结局变量是OS,其中0代表存活,1代表死亡。此外,我们也需要对自变量进行处理。

代码语言:javascript
复制
meta$race <- factor(meta$race,levels = c("ASIAN","AMERICAN INDIAN OR ALASKA NATIVE","BLACK OR AFRICAN AMERICAN","WHITE"))
meta$stage <- factor(meta$stage,levels = c("I","II","III","IV"))
meta$T <- factor(meta$T,levels = c("1","2","3","4"))
meta$age <- ifelse(meta$age > 60,">60","<=60")
meta$age <- factor(meta$age,levels = c("<=60",">60"))
str(meta)
# 'data.frame': 365 obs. of  10 variables:
#  $ ID     : chr  "TCGA-FV-A495-01A" "TCGA-ED-A7PZ-01A" "TCGA-ED-A97K-01A" "TCGA-ED-A7PX-01A" ...
#  $ OS     : int  0 0 0 0 0 1 0 0 0 1 ...
#  $ OS.time: num  0.0333 0.2 0.2 0.2 0.2667 ...
#  $ race   : Factor w/ 4 levels "ASIAN","AMERICAN INDIAN OR ALASKA NATIVE",..: 4 1 1 1 4 4 1 4 4 4 ...
#  $ age    : chr  "<=60" ">60" "<=60" "<=60" ...
#  $ gender : chr  "FEMALE" "MALE" "MALE" "FEMALE" ...
#  $ stage  : Factor w/ 4 levels "I","II","III",..: 2 2 3 2 1 1 1 2 1 2 ...
#  $ T      : Factor w/ 4 levels "1","2","3","4": 2 2 3 2 1 1 1 2 1 2 ...
#  $ N      : num  NA NA 0 NA NA 0 0 NA NA 0 ...
#  $ M      : num  0 0 0 0 0 0 0 0 0 0 ...

在R语言中,因子化的影响因素默认是以levels排序第一的值作为参考/哑变量。请注意这里对于有一些参数并没有进行因子化,以及存在NA值,这些情况会在下边的探索中解释。

3、构建模型
代码语言:javascript
复制
fit <- glm(OS~race+age+gender+stage+T+N+M,
           family=binomial(link = "logit"),data = meta) # link默认使用logit 
summary(fit)

# Call:
# glm(formula = OS ~ race + age + gender + stage + T + N + M, family = binomial(link = "logit"), 
#     data = meta)
# 
# Coefficients:
#                                       Estimate Std. Error z value Pr(>|z|)    
# (Intercept)                            -1.3213     0.3959  -3.337 0.000846 ***
# raceAMERICAN INDIAN OR ALASKA NATIVE  -15.3493  2399.5447  -0.006 0.994896    
# raceBLACK OR AFRICAN AMERICAN           1.1023     1.0382   1.062 0.288346    
# raceWHITE                               0.2644     0.3474   0.761 0.446538    
# age>60                                  0.3457     0.3261   1.060 0.289067    
# genderMALE                             -0.3704     0.3362  -1.102 0.270534    
# stageII                                33.6071  3393.4687   0.010 0.992098    
# stageIII                                0.3615  3393.4687   0.000 0.999915    
# stageIV                               -32.4250  3393.4687  -0.010 0.992376    
# T2                                    -33.1321  3393.4687  -0.010 0.992210    
# T3                                      0.9805  3393.4687   0.000 0.999769    
# T4                                      1.3966  3393.4688   0.000 0.999672    
# N                                      16.9158  2399.5447   0.007 0.994375    
# M                                      48.8854  3635.5707   0.013 0.989272    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 288.04  on 226  degrees of freedom
# Residual deviance: 252.88  on 213  degrees of freedom
#   (138 observations deleted due to missingness)
# AIC: 280.88
# 
# Number of Fisher Scoring iterations: 15

Estimate(估计值):每个自变量的回归系数。它表示自变量每单位变化对因变量的对数几率(log-odds)造成的变化。正值表示增加事件发生的概率,负值表示减少事件发生的概率。

Std. Error(标准误差):回归系数的标准误,估计值的不确定性度量。标准误差越大,估计值的不确定性越大。

z value(z值):估计值除以标准误差,用于计算p值。z值越大(绝对值),表明自变量对因变量的影响越显著,Z值的平方就是wald值。

Pr(>|z|)(p值):z值对应的p值,用于检验系数是否显著。如果p值小于显著性水平(通常为0.05或0.01),则认为该系数显著。

β值:这里就是Estimate(估计值)。

● **OR(比值比,odds ratio)**:这里没有显示,计算方式是OR = exp(β/Estimate),其取值范围是0至正无穷。同时在分析的时候我们经常会提到HR(风险比,hazard ratio),这个值通常应用于生存分析模型中。

Null deviance和Residual devianve: 是指无效偏差(零偏差)和残差偏差,前者是指只有截距项(没有任何自变量)时模型的偏差,这个模型假设所有的观测值都预测为因变量的平均值(对于分类问题来说,就是预测为最常见的类别),后者是指包括自变量在内的模型的偏差。它衡量的是该模型相对于最优模型的拟合程度。通过比较 Null deviance 和 Residual deviance,可以评估引入自变量后模型的改进情况。如果 Residual deviance 明显小于 Null deviance,说明自变量在解释因变量方面起到了重要作用,所以这两个值的差异越大越好。

AIC(Akaike Information Criterion,赤池信息准则) 是用于模型选择的一个统计量。它提供了一种在模型复杂度和拟合优度之间进行权衡的方法。这个值需要在不同模型情况下进行比较,AIC值越低则表示模型拟合越好。

同时也应观察一下自变量的情况,可以看到之前因子化后的自变量会按照levels的顺序进行哑变量设置,比如age中的<=60的信息就没有出现了,stage中的I期的信息也没有出现了。如果没有进行因子化的自变量则会被自动设置成连续变量,比如这里的N和M自变量,因此我们在分析之前必须对自变量进行因子化处理

此外,这里还有一句提示“138 observations deleted due to missingness”,这是告诉我们有138个NA数据,在运行代码的时候自动去除了。由此可知,二项logistic回归整体分析的时候是可以不处理NA的,当然如果从数据分析的角度来说,可能最好还是需要选择删除或者插补数据之后再进行分析,后面进行多因素logstic分析时则不能存在NA值。

4、提取模型关键信息
代码语言:javascript
复制
# β/Estimate值
coef(fit) #或者 coefficients(fit)
                         # (Intercept) raceAMERICAN INDIAN OR ALASKA NATIVE        raceBLACK OR AFRICAN AMERICAN 
                         #  -1.3212525                          -15.3493043                            1.1023180 
                         #   raceWHITE                               age>60                           genderMALE 
                         #   0.2644191                            0.3456869                           -0.3704287 
                         #     stageII                             stageIII                              stageIV 
                         #  33.6070538                            0.3614614                          -32.4249885 
                         #          T2                                   T3                                   T4 
                         # -33.1321368                            0.9805348                            1.3965597 
                         #           N                                    M 
                         #  16.9157534                           48.8853540 
# β值的95%可信区间
# confint(fit)
#                                              2.5 %       97.5 %
# (Intercept)                             -2.1263379   -0.5673383
# raceAMERICAN INDIAN OR ALASKA NATIVE            NA  473.4443794
# raceBLACK OR AFRICAN AMERICAN           -1.0795219    3.2830430
# raceWHITE                               -0.4223957    0.9442724
# age>60                                  -0.2935513    0.9890797
# genderMALE                              -1.0272369    0.1691020
# stageII                               -498.9459572           NA
# stageIII                               -84.3987156   77.0091134
# stageIV                              -1243.3914736   95.6290959
# T2                                              NA  499.6907267
# T3                                     -83.7797130   77.6282226
# T4                                     -75.2487745   86.1520115
# N                                     -471.9392411           NA
# M                                      -83.1626544 1374.4319782

# OR值
exp(coef(fit))
                        # (Intercept) raceAMERICAN INDIAN OR ALASKA NATIVE        raceBLACK OR AFRICAN AMERICAN 
                        # 2.668009e-01                         2.157157e-07                         3.011138e+00 
                        #    raceWHITE                               age>60                           genderMALE 
                        # 1.302674e+00                         1.412960e+00                         6.904383e-01 
                        #      stageII                             stageIII                              stageIV 
                        # 3.938747e+14                         1.435426e+00                         8.279544e-15 
                        #           T2                                   T3                                   T4 
                        # 4.082214e-15                         2.665881e+00                         4.041273e+00 
                        #            N                                    M 
                        # 2.220334e+07                         1.700746e+21 

# OR值的95%可信区间
exp(confint(fit))[1:4,1:2]
#                                         2.5 %        97.5 %
# (Intercept)                          0.1192733  5.670327e-01
# raceAMERICAN INDIAN OR ALASKA NATIVE        NA 4.114163e+205
# raceBLACK OR AFRICAN AMERICAN        0.3397579  2.665677e+01
# raceWHITE                            0.6554746  2.570942e+00

# 其他数据的提取
# wald值
summary(fit)$coefficients[,3]^2
# P值
summary(fit)$coefficients[,4]
# 预测值
fitted(fit)
predict(fit,type = "response")
predict(fit) # 如果不适用type则默认返回logit(P)
# 偏差
deviance(fit)
# 残差自由度
df.residual(fit)
# 所有的数据提取其实是基于summary(fit)中的coefficients表格

OR值的解读,比如关于T2这个数据, 相比于T1,T2的患者出现死亡的风险是4.082214e-15,P值没有统计学意义。

5、使用不同的方法来筛选变量
代码语言:javascript
复制
# 单因素logstic数据手动提取
# 构建自定义函数
uni_model<-function(x){
  fml<-as.formula(paste0("OS~",x))  
  fit1<-glm(fml,data = meta,family = binomial)
  fit2<-summary(fit1)

  #计算所要的指标
  OR<-round(exp(coef(fit1)),2)
  SE<-round(fit2$coefficients[,2],3)  
  CI2.5<-round(exp(coef(fit1)-1.96*SE),2)
  CI97.5<-round(exp(coef(fit1)+1.96*SE),2)
  CI<-paste0(CI2.5,'-',CI97.5)
  B<-round(fit2$coefficients[,1],3)
  Z<-round(fit2$coefficients[,3],3)
  P<-round(fit2$coefficients[,4],3)

  #将计算出来的指标制作为数据框
  uni_model<-data.frame('characteristics'=x,'B'=B,
                        'SE'=SE,'OR'=OR,'CI'=CI,
                        'Z' =Z,'P'=P)[-1,]

  return(uni_model)
}

#指定参与分析的自变量
variable<-colnames(meta)[c(4:10)]  
variable

#运行上面自定义批量执行函数
uni_res<-lapply(variable,uni_model)
uni_res

# 使用plyr包
# plyr 是一个用于数据操作和转换的 R 包。它提供了一系列函数,用于对数据框、列表和其他数据结构进行分组操作和汇总计算。
library(plyr)
#生成单变量分析的结果
uni_res<-ldply(uni_res,data.frame)
uni_res
#    characteristics       B      SE          OR         CI      Z     P
# 1             race -12.641 535.411        0.00      0-Inf -0.024 0.981
# 2             race   0.569   0.524        1.77  0.63-4.93  1.085 0.278
# 3             race   0.570   0.233        1.77  1.12-2.79  2.444 0.015
# 4              age   0.340   0.220        1.41  0.91-2.16  1.547 0.122
# 5           gender  -0.442   0.230        0.64  0.41-1.01 -1.924 0.054
# 6            stage   0.281   0.295        1.32  0.74-2.36  0.952 0.341
# 7            stage   1.252   0.282        3.50  2.01-6.08  4.436 0.000
# 8            stage   2.182   1.168        8.86  0.9-87.43  1.868 0.062
# 9                T   0.331   0.280        1.39   0.8-2.41  1.180 0.238
# 10               T   1.246   0.284        3.48  1.99-6.07  4.388 0.000
# 11               T   2.244   0.680        9.43 2.49-35.77  3.301 0.001
# 12               N   0.669   1.009        1.95 0.27-14.11  0.663 0.507
# 13               M  16.305 840.274 12057537.52      0-Inf  0.019 0.985

# autoReg方式
library(autoReg)
autoReg::gaze(OS~race + age + gender + stage + T + N + M,data=meta) 
# ————————————————————————————————————————————————————————————————————————————————
#  Dependent:OS               levels                    0           1         p   
#      (N)                                           (N=234)     (N=131)          
# ————————————————————————————————————————————————————————————————————————————————
# race          ASIAN                               111 (48.5%)  44 (34.9%)   .078 
#               AMERICAN INDIAN OR ALASKA NATIVE       1 (0.4%)      0 (0%)        
#               BLACK OR AFRICAN AMERICAN             10 (4.4%)    7 (5.6%)        
#               WHITE                               107 (46.7%)  75 (59.5%)        
# age           <=60                                118 (50.4%)    55 (42%)   .150 
#               >60                                 116 (49.6%)    76 (58%)        
# gender        FEMALE                               68 (29.1%)  51 (38.9%)   .070 
#               MALE                                166 (70.9%)  80 (61.1%)        
# stage         I                                   127 (56.7%)  43 (36.8%)  <.001 
#               II                                   58 (25.9%)  26 (22.2%)        
#               III                                    38 (17%)  45 (38.5%)        
#               IV                                     1 (0.4%)    3 (2.6%)        
# T             1                                   133 (57.3%)  47 (36.2%)  <.001 
#               2                                    61 (26.3%)  30 (23.1%)        
#               3                                    35 (15.1%)  43 (33.1%)        
#               4                                      3 (1.3%)   10 (7.7%)        
# N             Mean ± SD                             0.0 ± 0.1   0.0 ± 0.2   .544 
# M             Mean ± SD                             0.0 ± 0.0   0.0 ± 0.2   .083 
# ————————————————————————————————————————————————————————————————————————————————
autoReg(fit, uni=TRUE, multi = TRUE, final=TRUE)
# ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
# Dependent: OS                                        0 (N=152)    1 (N=75)                OR (univariable)             OR (multivariable)                 OR (final) 
# ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
# race                                        ASIAN  103 (67.8%)    42 (56%)                                                                                           
#                  AMERICAN INDIAN OR ALASKA NATIVE     1 (0.7%)      0 (0%)         0.00 (0.00-Inf, p=.988)        0.00 (0.00-Inf, p=.992)                            
#                         BLACK OR AFRICAN AMERICAN     2 (1.3%)    2 (2.7%)       2.45 (0.33-17.99, p=.378)      3.42 (0.45-25.80, p=.232)                            
#                                             WHITE   46 (30.3%)  31 (41.3%)        1.65 (0.93-2.95, p=.089)       1.43 (0.75-2.75, p=.279)                            
# age                                          <=60   88 (57.9%)  37 (49.3%)                                                                                           
#                                               >60   64 (42.1%)  38 (50.7%)        1.41 (0.81-2.46, p=.223)                                                           
# gender                                     FEMALE   42 (27.6%)    30 (40%)                                                                                           
#                                              MALE  110 (72.4%)    45 (60%)        0.57 (0.32-1.03, p=.061)       0.70 (0.36-1.36, p=.295)   0.62 (0.34-1.16, p=.134) 
# stage                                           I   86 (56.6%)    24 (32%)                                                                                           
#                                                II     35 (23%)    15 (20%)        1.54 (0.72-3.27, p=.266)  5799088.68 (0.00-Inf, p=.992)                            
#                                               III   30 (19.7%)    33 (44%)        3.94 (2.02-7.70, p<.001)      1.50 (0.06-36.67, p=.805)                            
#                                                IV     1 (0.7%)      3 (4%)     10.75 (1.07-108.08, p=.044)      3.45 (0.14-85.32, p=.450)                            
# T                                               1   87 (57.2%)  25 (33.3%)                                                                                           
#                                                 2   36 (23.7%)    15 (20%)        1.45 (0.69-3.07, p=.331)        0.00 (0.00-Inf, p=.992)   1.46 (0.69-3.09, p=.328) 
#                                                 3   26 (17.1%)  28 (37.3%)        3.75 (1.87-7.51, p<.001)      2.58 (0.11-62.85, p=.561)   3.73 (1.86-7.51, p<.001) 
#                                                 4       3 (2%)    7 (9.3%)       8.12 (1.96-33.72, p=.004)     4.40 (0.16-124.34, p=.384)  7.20 (1.71-30.30, p=.007) 
# N                                       Mean ± SD    0.0 ± 0.1   0.0 ± 0.2       2.05 (0.28-14.88, p=.476)                                                           
# M                                       Mean ± SD    0.0 ± 0.0   0.0 ± 0.2  12155382.71 (0.00-Inf, p=.985)                                                           
# ——————————————————————————————————————————————————————————————————————————————————————————————————

# boom方式
library(broom)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
tidy(fit)  # 将模型结果转化为整洁的数据框
# # A tibble: 14 × 5
#    term                                 estimate std.error statistic  p.value
#    <chr>                                   <dbl>     <dbl>     <dbl>    <dbl>
#  1 (Intercept)                            -1.32      0.396 -3.34     0.000846
#  2 raceAMERICAN INDIAN OR ALASKA NATIVE  -15.3    2400.    -0.00640  0.995   
#  3 raceBLACK OR AFRICAN AMERICAN           1.10      1.04   1.06     0.288   
#  4 raceWHITE                               0.264     0.347  0.761    0.447   
#  5 age>60                                  0.346     0.326  1.06     0.289   
#  6 genderMALE                             -0.370     0.336 -1.10     0.271   
#  7 stageII                                33.6    3393.     0.00990  0.992   
#  8 stageIII                                0.361  3393.     0.000107 1.00    
#  9 stageIV                               -32.4    3393.    -0.00956  0.992   
# 10 T2                                    -33.1    3393.    -0.00976  0.992   
# 11 T3                                      0.981  3393.     0.000289 1.00    
# 12 T4                                      1.40   3393.     0.000412 1.00    
# 13 N                                      16.9    2400.     0.00705  0.994   
# 14 M                                      48.9    3636.     0.0134   0.989   

# sjPlot方式
library(sjPlot)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
tab_model(fit)  # 输出逻辑回归结果表格

# stargazer方式
library(stargazer)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
stargazer(fit, type = "text")  # 输出格式化的结果
# ================================================================
#                                          Dependent variable:    
#                                      ---------------------------
#                                                  OS             
# ----------------------------------------------------------------
# raceAMERICAN INDIAN OR ALASKA NATIVE           -15.349          
#                                              (2,399.545)        
#                                                                 
# raceBLACK OR AFRICAN AMERICAN                   1.102           
#                                                (1.038)          
#                                                                 
# raceWHITE                                       0.264           
#                                                (0.347)          
#                                                                 
# age> 60                                         0.346           
#                                                (0.326)          
#                                                                 
# genderMALE                                     -0.370           
#                                                (0.336)          
#                                                                 
# stageII                                        33.607           
#                                              (3,393.469)        
#                                                                 
# stageIII                                        0.361           
#                                              (3,393.469)        
#                                                                 
# stageIV                                        -32.425          
#                                              (3,393.469)        
#                                                                 
# T2                                             -33.132          
#                                              (3,393.469)        
#                                                                 
# T3                                              0.981           
#                                              (3,393.469)        
#                                                                 
# T4                                              1.397           
#                                              (3,393.469)        
#                                                                 
# N                                              16.916           
#                                              (2,399.545)        
#                                                                 
# M                                              48.885           
#                                              (3,635.571)        
#                                                                 
# Constant                                      -1.321***         
#                                                (0.396)          
#                                                                 
# ----------------------------------------------------------------
# Observations                                     227            
# Log Likelihood                                -126.439          
# Akaike Inf. Crit.                              280.879          
# ================================================================
# Note:                                *p<0.1; **p<0.05; ***p<0.01

# report方式
library(report)
fit <- glm(OS~race + age + gender + stage + T + N + M,family = binomial,data=meta)
report(fit)  # 生成逻辑回归的报告
# We fitted a logistic model (estimated using ML) to predict OS with race, age, gender, stage, T, N and M (formula: OS ~ race +
# age + gender + stage + T + N + M). The model's explanatory power is moderate (Tjur's R2 = 0.14). The model's intercept,
# corresponding to race = ASIAN, age = <=60, gender = FEMALE, stage = I, T = 1, N = 0 and M = 0, is at -1.32 (95% CI [-2.13,
# -0.57], p < .001). Within this model:
# 
#   - The effect of race [AMERICAN INDIAN OR ALASKA NATIVE] is statistically non-significant and negative (beta = -15.35, 95% CI [,
# 473.44], p = 0.995; Std. beta = -15.35, 95% CI [, 473.44])
#   - The effect of race [BLACK OR AFRICAN AMERICAN] is statistically non-significant and positive (beta = 1.10, 95% CI [-1.08,
# 3.28], p = 0.288; Std. beta = 1.10, 95% CI [-1.08, 3.28])
#   - The effect of race [WHITE] is statistically non-significant and positive (beta = 0.26, 95% CI [-0.42, 0.94], p = 0.447; Std.
# beta = 0.26, 95% CI [-0.16, 0.94])
#   - The effect of age [>60] is statistically non-significant and positive (beta = 0.35, 95% CI [-0.29, 0.99], p = 0.289; Std.
# beta = 0.35, 95% CI [-0.29, 0.99])
#   - The effect of gender [MALE] is statistically non-significant and negative (beta = -0.37, 95% CI [-1.03, 0.17], p = 0.271;
# Std. beta = -0.37, 95% CI [-1.03, 0.30])
#   - The effect of stage [II] is statistically non-significant and positive (beta = 33.61, 95% CI [-498.95, ], p = 0.992; Std.
# beta = 33.61, 95% CI [-498.95, ])
#   - The effect of stage [III] is statistically non-significant and positive (beta = 0.36, 95% CI [-84.40, 77.01], p > .999; Std.
# beta = 0.36, 95% CI [-45.09, 44.27])
#   - The effect of stage [IV] is statistically non-significant and negative (beta = -32.42, 95% CI [-1243.39, 95.63], p = 0.992;
# Std. beta = -32.42, 95% CI [-1253.80, 93.73])
#   - The effect of T [2] is statistically non-significant and negative (beta = -33.13, 95% CI [, 499.69], p = 0.992; Std. beta =
# -33.13, 95% CI [, 499.69])
#   - The effect of T [3] is statistically non-significant and positive (beta = 0.98, 95% CI [-83.78, 77.63], p > .999; Std. beta =
# 0.98, 95% CI [-47.06, 48.58])
#   - The effect of T [4] is statistically non-significant and positive (beta = 1.40, 95% CI [-75.25, 86.15], p > .999; Std. beta =
# 1.40, 95% CI [-71.27, 61.90])
#   - The effect of N is statistically non-significant and positive (beta = 16.92, 95% CI [-471.94, ], p = 0.994; Std. beta = 2.23,
# 95% CI [-62.23, ])
#   - The effect of M is statistically non-significant and positive (beta = 48.89, 95% CI [-83.16, 1374.43], p = 0.989; Std. beta =
# 5.59, 95% CI [-13.05, 137.95])
# 
# Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals
# (CIs) and p-values were computed using a Wald z-distribution approximation.

各位可以比较一下不同的方式,个人觉得自行提取,autoReg和broom包这三种方式比较好用。

6、多因素logstic回归筛选自变量
代码语言:javascript
复制
# 在用forward/backward/both方法的时候需要去除NA值!
meta <- na.omit(meta)
dim(meta) #去除了蛮多数据的
# [1] 227  10 

#1.多因素enter回归
formula <- as.formula(OS~race+age+gender+stage+T+N+M)
model_enter <- glm(formula,data = meta,family=binomial)
model_enter             
summary(model_enter)   

#2.多因素—forward
# 初始模型(仅包含截距项)
model_start <- glm(OS ~ 1, data = meta, family = binomial)
# 前向选择
model_forward <- step(model_start,
               scope = list(upper = ~ race+age+gender+stage+T+N+M,
                            lower = ~ 1),
               direction = "forward")
# Start:  AIC=290.04
# OS ~ 1
# 
#          Df Deviance    AIC
# + T       3   267.73 275.73
# + stage   3   268.19 276.19
# + M       1   281.32 285.32
# + gender  1   284.56 288.56
# <none>        288.04 290.04
# + age     1   286.56 290.56
# + N       1   287.55 291.55
# + race    3   283.88 291.88
# 
# Step:  AIC=275.73
# OS ~ T
# 
#          Df Deviance    AIC
# + M       1   264.76 274.76
# + age     1   265.12 275.12
# + gender  1   265.51 275.51
# <none>        267.73 275.73
# + N       1   267.08 277.08
# + race    3   263.63 277.63
# + stage   3   265.77 279.77
# 
# Step:  AIC=274.76
# OS ~ T + M
# 
#          Df Deviance    AIC
# + age     1   262.26 274.26
# + gender  1   262.66 274.66
# <none>        264.76 274.76
# + N       1   264.10 276.10
# + stage   3   260.54 276.54
# + race    3   260.84 276.84
# 
# Step:  AIC=274.26
# OS ~ T + M + age
# 
#          Df Deviance    AIC
# <none>        262.26 274.26
# + gender  1   260.52 274.52
# + N       1   261.64 275.64
# + stage   3   258.28 276.28
# + race    3   259.68 277.68
summary(model_forward)

#3.多因素-backward
model_backward<-step(model_enter,direction ="backward")
# Start:  AIC=280.88
# OS ~ race + age + gender + stage + T + N + M
# 
#          Df Deviance    AIC
# - race    3   254.91 276.91
# - T       3   257.74 279.74
# - age     1   254.00 280.00
# - gender  1   254.08 280.08
# - stage   3   258.21 280.21
# - N       1   254.61 280.61
# <none>        252.88 280.88
# - M       1   257.61 283.61
# 
# Step:  AIC=276.91
# OS ~ age + gender + stage + T + N + M
# 
#          Df Deviance    AIC
# - T       3   259.67 275.67
# - stage   3   260.12 276.12
# - N       1   256.60 276.60
# - gender  1   256.74 276.74
# <none>        254.91 276.91
# - age     1   257.00 277.00
# - M       1   259.54 279.54
# 
# Step:  AIC=275.67
# OS ~ age + gender + stage + N + M
# 
#          Df Deviance    AIC
# - N       1   259.77 273.77
# - gender  1   261.25 275.25
# - age     1   261.49 275.49
# <none>        259.67 275.67
# - M       1   263.85 277.85
# - stage   3   277.28 287.28
# 
# Step:  AIC=273.77
# OS ~ age + gender + stage + M
# 
#          Df Deviance    AIC
# - gender  1   261.41 273.41
# - age     1   261.65 273.65
# <none>        259.77 273.77
# - M       1   264.12 276.12
# - stage   3   277.56 285.56
# 
# Step:  AIC=273.41
# OS ~ age + stage + M
# 
#         Df Deviance    AIC
# <none>       261.41 273.41
# - age    1   263.69 273.69
# - M      1   265.48 275.48
# - stage  3   280.09 286.09
summary(model_backward)

#4.多因素-both
model_both<-step(model_enter,direction = "both")
# Start:  AIC=280.88
# OS ~ race + age + gender + stage + T + N + M
# 
#          Df Deviance    AIC
# - race    3   254.91 276.91
# - T       3   257.74 279.74
# - age     1   254.00 280.00
# - gender  1   254.08 280.08
# - stage   3   258.21 280.21
# - N       1   254.61 280.61
# <none>        252.88 280.88
# - M       1   257.61 283.61
# 
# Step:  AIC=276.91
# OS ~ age + gender + stage + T + N + M
# 
#          Df Deviance    AIC
# - T       3   259.67 275.67
# - stage   3   260.12 276.12
# - N       1   256.60 276.60
# - gender  1   256.74 276.74
# <none>        254.91 276.91
# - age     1   257.00 277.00
# - M       1   259.54 279.54
# + race    3   252.88 280.88
# 
# Step:  AIC=275.67
# OS ~ age + gender + stage + N + M
# 
#          Df Deviance    AIC
# - N       1   259.77 273.77
# - gender  1   261.25 275.25
# - age     1   261.49 275.49
# <none>        259.67 275.67
# + T       3   254.91 276.91
# - M       1   263.85 277.85
# + race    3   257.74 279.74
# - stage   3   277.28 287.28
# 
# Step:  AIC=273.77
# OS ~ age + gender + stage + M
# 
#          Df Deviance    AIC
# - gender  1   261.41 273.41
# - age     1   261.65 273.65
# <none>        259.77 273.77
# + N       1   259.67 275.67
# - M       1   264.12 276.12
# + T       3   256.60 276.60
# + race    3   257.84 277.84
# - stage   3   277.56 285.56
# 
# Step:  AIC=273.41
# OS ~ age + stage + M
# 
#          Df Deviance    AIC
# <none>        261.41 273.41
# - age     1   263.69 273.69
# + gender  1   259.77 273.77
# + N       1   261.25 275.25
# - M       1   265.48 275.48
# + T       3   258.28 276.28
# + race    3   258.92 276.92
# - stage   3   280.09 286.09
summary(model_both)

# 模型AIC比较
# 选择AIC值最小的模型
anova(model_enter,model_forward,test = "Chisq")
anova(model_enter,model_backward,test = "Chisq")
anova(model_enter,model_both,test = "Chisq")
anova(model_forward,model_backward,test = "Chisq")
anova(model_forward,model_both,test = "Chisq")
anova(model_backward,model_both,test = "Chisq")

# 计算多个模型的 AIC 值
aic_values <- AIC(model_enter, model_forward, model_backward, model_both)
# 打印 AIC 值
print(aic_values)
# 找到 AIC 值最小的模型
best_model <- aic_values[which.min(aic_values$AIC), ]
print(best_model)

# 使用MuMIn包
library(MuMIn)
# 将所有模型放入一个列表
model_list <- list(Enter = model_enter, Forward = model_forward, Backward = model_backward, Both = model_both)
# 使用 MuMIn 包的 model.sel() 函数进行比较
model_comparison <- model.sel(model_list)
# Model selection table 
#          (Intrc) age gendr     M     N race stage T df   logLik  AICc delta weight
# Backward  -1.524   +       30.84                +    6 -130.706 273.8  0.00  0.375
# Both      -1.524   +       30.84                +    6 -130.706 273.8  0.00  0.375
# Forward   -1.508   +       15.15                  +  6 -131.131 274.6  0.85  0.245
# Enter     -1.321   +     + 48.89 16.92    +     + + 14 -126.439 282.9  9.07  0.004
# 打印模型比较结果
print(model_comparison)
# 选择 AICc 值最小的模型
best_model <- get.models(model_comparison, 1)[[1]]
summary(best_model)

感觉在汇总结果的时候使用MuMIn包还是蛮方便的。

参考资料:

1、glm:https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm

2、autoReg: https://github.com/cardiomoon/autoReg

3、broom:https://github.com/tidymodels/broom

4、MuMIn:https://mumin-dataset.github.io/

:若对内容有疑惑或者有发现明确错误的朋友,请联系后台(欢迎交流)。更多内容可关注公众号:生信方舟

- END -

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • logstic分析流程
    • 1、输入数据
      • 2、check数据
        • 3、构建模型
          • 4、提取模型关键信息
            • 5、使用不同的方法来筛选变量
              • 6、多因素logstic回归筛选自变量
              • 参考资料:
              相关产品与服务
              腾讯云 BI
              腾讯云 BI(Business Intelligence,BI)提供从数据源接入、数据建模到数据可视化分析全流程的BI能力,帮助经营者快速获取决策数据依据。系统采用敏捷自助式设计,使用者仅需通过简单拖拽即可完成原本复杂的报表开发过程,并支持报表的分享、推送等企业协作场景。
              领券
              问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档