Calcular intervalos de confianza para una regresión logística

Estoy usando una regresión logística binomial para identificar si la exposición has_xo has_yimpacto tiene la probabilidad de que un usuario haga clic en algo. Mi modelo es el siguiente:

fit = glm(formula = has_clicked ~ has_x + has_y, 
          data=df, 
          family = binomial())

Este es el resultado de mi modelo:

Call:
glm(formula = has_clicked ~ has_x + has_y, 
    family = binomial(), data = active_domains)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9869  -0.9719  -0.9500   1.3979   1.4233  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -0.504737   0.008847 -57.050  < 2e-16 ***
has_xTRUE -0.056986   0.010201  -5.586 2.32e-08 ***
has_yTRUE  0.038579   0.010202   3.781 0.000156 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 217119  on 164182  degrees of freedom
Residual deviance: 217074  on 164180  degrees of freedom
AIC: 217080

Number of Fisher Scoring iterations: 4

Como cada coeficiente es significativo, usando este modelo puedo decir cuál es el valor de cualquiera de estas combinaciones usando el siguiente enfoque:

predict(fit, data.frame(has_x = T, has_y=T), type = "response")

No entiendo cómo puedo informar sobre el estándar. Error de la predicción.

¿Solo necesito usar $1.96*SE$ ? ¿O necesito convertir el $SE$ usando un enfoque descrito aquí ?
Si quiero entender el error estándar para ambas variables, ¿cómo lo consideraría?

A diferencia de esta pregunta , estoy interesado en comprender cuáles son los límites superior e inferior del error en un porcentaje. Por ejemplo, de mi predicción muestra un valor de 37% para True,True¿puedo calcular que esto es $+/- 0.3%$ para un $95\% CI$ ? (0.3% elegido para ilustrar mi punto)

regression logistic standard-error logit celenius
fuente

Duplicado: stats.stackexchange.com/questions/5304/…

kjetil b halvorsen

Posible duplicado de ¿Por qué hay una diferencia entre calcular manualmente un intervalo de confianza del 95% de regresión logística y usar la función confint () en R?

kjetil b halvorsen

@kjetilbhalvorsen ¿está seguro de que es un duplicado ya que el OP parece querer un intervalo de predicción pero parece estar funcionando en la escala OR en lugar de la escala logarítmica que puede ser la raíz del problema?

mdewey

Si desea evaluar qué tan buena predice una regresión logística, generalmente se usan diferentes medidas que predicción + SE. Una medida de evaluación popular es la curva ROC con las AUC respectivas

adibender el

¿Podría esto ser de alguna ayuda? stackoverflow.com/questions/47414842/…

Xavier Bourret Sicotte

Respuestas:

Su pregunta puede provenir del hecho de que se trata de Odds Ratios y Probabilidades, lo cual es confuso al principio. Dado que el modelo logístico es una transformación no lineal de la computación , los intervalos de confianza no son tan sencillos. $\beta^Tx$

Antecedentes

Recordemos que para el modelo de regresión logística

Probabilidad de : $(Y = 1)$ $p = \frac{e^{\alpha + \beta_1x_1 + \beta_2 x_2}}{1 + e^{ \alpha + \beta_1x_1 + \beta_2 x_2}}$
Probabilidades de : $(Y = 1)$ $\left( \frac{p}{1-p}\right) = e^{\alpha + \beta_1x_1 + \beta_2 x_2}$
Log Odds of : $(Y = 1)$ $\log \left( \frac{p}{1-p}\right) = \alpha + \beta_1x_1 + \beta_2 x_2$

Considere el caso en el que tiene un aumento de una unidad en la variable , es decir, , entonces las nuevas probabilidades son $x_1$ $x_1 + 1$

Odds (Y = 1) = e^{α + β_{1} (x_{1} + 1) + β_{2} x_{2}} = e^{α + β_{1} x_{1} + β_{1} + β_{2} x_{2}}

$\text{Odds}(Y = 1) = e^{\alpha + \beta_1(x_1 + 1) + \beta_2x_2} = e^{\alpha + \beta_1 x_1 + \beta_1 + \beta_2x_2 }$

Por lo tanto, la Odds Ratio (OR) es

\frac{Odds (x_{1} + 1)}{Odds (x_{1})} = \frac{e^{α + β_{1} (x_{1} + 1) + β_{2} x_{2}}}{e^{α + β_{1} x_{1} + β_{2} x_{2}}} = e^{β_{1}}

$\frac{\text{Odds}(x_1 + 1)}{\text{Odds}(x_1)} = \frac{e^{\alpha + \beta_1(x_1 + 1) + \beta_2x_2} }{e^{\alpha + \beta_1 x_1 + \beta_2x_2}} = e^{\beta_1}$

Log Odds Ratio = $\beta_1$
Riesgo relativo o (razón de probabilidad) = $\frac{ \frac{e^{\alpha + \beta_1x_1 + \beta_1 + \beta_2 x_2}}{1 + e^{ \alpha + \beta_1x_1 + \beta_1 + \beta_2 x_2}}}{ \frac{e^{\alpha + \beta_1x_1 + \beta_2 x_2}}{1 + e^{ \alpha + \beta_1x_1 + \beta_2 x_2}}}$

Coeficientes de interpretación

¿Cómo interpretaría el valor del coeficiente ? Suponiendo que todo lo demás permanece fijo: $\beta_j$

Por cada unidad de aumento en la relación log-odds aumenta en . $x_j$ $\beta_j$
Por cada unidad de aumento en la razón de posibilidades aumenta en . $x_j$ $e^{\beta_j}$
For every increase of $x_j$ from $k$ to $k + \Delta$ the odds ratio increases by $e^{\beta_j \Delta}$
If the coefficient is negative, then an increase in $x_j$ leads to a decrease in the odds ratio.

Confidence intervals for a single parameter $\beta_j$

Do I just need to use $1.96∗SE$ ? Or do I need to convert the SE using an approach described here?

Since the parameter $\beta_j$ is estimated using Maxiumum Likelihood Estimation, MLE theory tells us that it is asymptotically normal and hence we can use the large sample Wald confidence interval to get the usual

β_{j} \pm z^{*} S E (β_{j})

$\beta_j \pm z^* SE(\beta_j)$

Which gives a confidence interval on the log-odds ratio. Using the invariance property of the MLE allows us to exponentiate to get

e^{β_{j} \pm z^{*} S E (β_{j})}

$e^{\beta_j \pm z^* SE(\beta_j)}$

which is a confidence interval on the odds ratio. Note that these intervals are for a single parameter only.

If I want to understand the standard-error for both variables how would I consider that?

If you include several parameters you can use the Bonferroni procedure, otherwise for all parameters you can use the confidence interval for probability estimates

Bonferroni procedure for several parameters

If $g$ parameters are to be estimated with family confidence coefficient of approximately $1 - \alpha$ , the joint Bonferroni confidence limits are

β_{g} \pm z_{(1 - \frac{α}{2 g})} S E (β_{g})

$\beta_g \pm z_{(1 - \frac{\alpha}{2g})}SE(\beta_g)$

Confidence intervals for probability estimates

The logistic model outputs an estimation of the probability of observing a one and we aim to construct a frequentist interval around the true probability $p$ such that $Pr(p_{L} \leq p \leq p_{U}) = .95$

One approach called endpoint transformation does the following:

Compute the upper and lower bounds of the confidence interval for the linear combination $x^T\beta$ (using the Wald CI)
Apply a monotonic transformation to the endpoints $F(x^T\beta)$ to obtain the probabilities.

Since $Pr(x^T\beta) = F(x^T\beta)$ is a monotonic transformation of $x^T\beta$

[P r (x^{T} β)_{L} \leq P r (x^{T} β) \leq P r (x^{T} β)_{U}] = [F (x^{T} β)_{L} \leq F (x^{T} β) \leq F (x^{T} β)_{U}]

$[Pr(x^T\beta)_L \leq Pr(x^T\beta) \leq Pr(x^T\beta)_U] = [F(x^T\beta)_L \leq F(x^T\beta) \leq F(x^T\beta)_U]$

Concretely this means computing $\beta^Tx \pm z^* SE(\beta^Tx)$ and then applying the logit transform to the result to get the lower and upper bounds:

[\frac{e^{x^{T} β - z^{*} S E (x^{T} β)}}{1 + e^{x^{T} β - z^{*} S E (x^{T} β)}}, \frac{e^{x^{T} β + z^{*} S E (x^{T} β)}}{1 + e^{x^{T} β + z^{*} S E (x^{T} β)}},]

$[\frac{e^{x^T\beta - z^* SE(x^T\beta)}}{1 + e^{x^T\beta - z^* SE(x^T\beta)}}, \frac{e^{x^T\beta + z^* SE(x^T\beta)}}{1 + e^{x^T\beta + z^* SE(x^T\beta)}},]$

The estimated approximate variance of $x^T\beta$ can be calculated using the covariance matrix of the regression coefficients using

V a r (x^{T} β) = x^{T} Σ x

$Var(x^T\beta) = x^T \Sigma x$

The advantage of this method is that the bounds cannot be outside the range $(0,1)$

There are several other approaches as well, using the delta method, bootstrapping etc.. which each have their own assumptions, advantages and limits.

Sources and info

My favorite book on this topic is "Applied Linear Statistical Models" by Kutner, Neter, Li, Chapter 14

Otherwise here are a few online sources:

Xavier Bourret Sicotte
fuente

Much of this is about CI for the coefficients which is a fine thing for the OP to know about but are we sure that is what he needs? You later section seems to me more relevant but perhaps the distinctions may be missed if read too quickly?

mdewey

Yes you are probably right - but understanding odds, log odds and probabilities for log regression is something I struggled with in the past - I hope this post summarises the topic well enough to such that it might help someone in the future. Perhaps I could answer the question more explicitly by providing a CI but we would need the covariance matrix

Xavier Bourret Sicotte

To get the 95% confidence interval of the prediction you can calculate on the logit scale and then convert those back to the probability scale 0-1. Here is an example using the titanic dataset.

library(titanic)
data("titanic_train")

titanic_train$Pclass = factor(titanic_train$Pclass, levels = c(1,2,3), labels = c('First','Second','Third'))

fit = glm(Survived ~ Sex + Pclass, data=titanic_train, family = binomial())

inverse_logit = function(x){
  exp(x)/(1+exp(x))
}

predicted = predict(fit, data.frame(Sex='male', Pclass='First'), type='link', se.fit=TRUE)

se_high = inverse_logit(predicted$fit + (predicted$se.fit*1.96))
se_low = inverse_logit(predicted$fit - (predicted$se.fit*1.96))
expected = inverse_logit(predicted$fit)

The mean and low/high 95% CI.

> expected
        1 
0.4146556 
> se_high
        1 
0.4960988 
> se_low
        1 
0.3376243

And the output from just using type='response', which only gives the mean

predict(fit, data.frame(Sex='male', Pclass='First'), type='response')
        1 
0.4146556

Shawn
fuente

predict(fit, data.frame(Sex='male', Pclass='First'), type='response', se.fit=TRUE) will work.

Tony416