El estadístico de prueba para la prueba de Hosmer-Lemeshow (HLT) para la bondad de ajuste (GOF) de un modelo de regresión logística se define de la siguiente manera:
La muestra se divide en deciles, , por decil se calculan las siguientes cantidades:
- , es decir, el número observado de casos positivos en el decil ;
- , es decir, el número observado de casos negativos en el decil ;
- , es decir, el número estimado de casos positivos en decil ;
- , es decir, el número estimado de casos negativos en decil ;
donde es el resultado binario observada para el observación -ésimo y pi i la probabilidad estimada para que la observación.
Luego, la estadística de prueba se define como:
donde π g es la probabilidad estimada promedio en el decil g y se deja n g es el número de empresas en el decil.
Según Hosmer-Lemeshow (ver este enlace ), esta estadística tiene (bajo ciertos supuestos) una distribución de con ( d - 2 ) grados de libertad .
Por otro lado , si definiera una tabla de contingencia con filas (correspondientes a los deciles) y 2 columnas (correspondientes al resultado binario verdadero / falso), entonces el estadístico de prueba para la prueba para esta tabla de contingencia sería igual que el definido anteriormente, sin embargo, en el caso de la tabla de contingencia, este estadístico de prueba es con grados de libertad . ¡Entonces un grado de libertad más !
¿Cómo se puede explicar esta diferencia en el número de grados de libertad?
EDITAR: adiciones después de leer los comentarios:
@whuber
Dicen (ver Hosmer DW, Lemeshow S. (1980), Una prueba de bondad de ajuste para el modelo de regresión logística múltiple. Communications in Statistics, A10, 1043-1069 ) que hay un teorema demostrado por Moore y Spruill a partir del cual se deduce que si (1) los parámetros se estiman utilizando funciones de probabilidad para datos no agrupados y (2) las frecuencias en la tabla 2xg dependen de los parámetros estimados, es decir, las celdas son aleatorias, no fijas, que luego, en condiciones de regularidad apropiadas La estadística de bondad de ajuste bajo (1) y (2) es la de un chi-cuadrado central con la reducción usual de grados de libertad debido a parámetros estimados más una suma de variables de chi-cuadrado ponderadas.
Luego, si entiendo bien su trabajo, intentan encontrar una aproximación para este 'término de corrección' que, si lo entiendo bien, es esta suma ponderada de variables aleatorias de chi-cuadrado, y lo hacen haciendo simulaciones, pero yo debo admitir que no entiendo completamente lo que dicen allí, de ahí mi pregunta; ¿Por qué estas células son aleatorias? ¿Cómo influye eso en los grados de libertad? ¿Sería diferente si arreglara los bordes de las celdas y luego clasificara las observaciones en celdas fijas en función de la puntuación estimada, en ese caso las celdas no son aleatorias, aunque el 'contenido' de la celda sí lo es?
@ Frank Harell: ¿no podría ser que las 'deficiencias' de la prueba de Hosmer-Lemeshow que mencionas en tus comentarios a continuación, son solo una consecuencia de la aproximación de la suma ponderada de los cuadrados de chi ?
rms
packageresiduals.lrm
andval.prob
functions.Respuestas:
Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069 show that:
(Note: the necessary conditions are not explicitly in Theorem 2 on page 1052 but if one attentively reads the paper and the proof then these pop up)
The second term∑p+1i=1λiχ2i(1) results from the fact that the grouping is based on estimated - i.e. random - quantities (Hosmer,Lemeshow, 1980, p.1051)
Using simulations they showed that the second term can be (in the cases used in the simualtion) approximated by aχ2(p−1) (Hosmer,Lemeshow, 1980, p.1060)
See also Hosmer Lemeshow (1980) Paper - Theorem 2
fuente
The theorem that you refer to (the usual reduction part "usual reduction of degrees of freedom due to estimated parameters") has been mostly advocated by R.A. Fisher. In 'On the interpretation of Chi Square from Contingency Tables, and the Calculation of P' (1922) he argued to use the(R−1)∗(C−1) rule and in 'The goodness of fit of regression formulae' (1922) he argues to reduce the degrees of freedom by the number of parameters used in the regression to obtain expected values from the data. (It is interesting to note that people misused the chi-square test, with wrong degrees of freedom, for more than twenty years since it's introduction in 1900)
Your case is of the second kind (regression) and not of the former kind (contingency table) although the two are related in that they are linear restrictions on the parameters.
Because you model the expected values, based on your observed values, and you do this with a model that has two parameters, the 'usual' reduction in degrees of freedom is two plus one (an extra one because the O_i need to sum up to a total, which is another linear restriction, and you end up effectively with a reduction of two, instead of three, because of the 'in-efficiency' of the modeled expected values).
The chi-square test uses aχ2 as a distance measure to express how close a result is to the expected data. In the many versions of the chi-square tests the distribution of this 'distance' is related to the sum of deviations in normal distributed variables (which is true in the limit only and is an approximation if you deal with non-normal distributed data).
For the multivariate normal distribution the density function is related to theχ2 by
with|Σ| the determinant of the covariance matrix of x
andχ2=(x−μ)TΣ−1(x−μ) is the mahalanobis distance which reduces to the Euclidian distance if Σ=I .
In his 1900 article Pearson argued that theχ2 -levels are spheroids and that he can transform to spherical coordinates in order to integrate a value such as P(χ2>a) . Which becomes a single integral.
It is this geometrical representation,χ2 as a distance and also a term in density function, that can help to understand the reduction of degrees of freedom when linear restrictions are present.
First the case of a 2x2 contingency table. You should notice that the four valuesOi−EiEi are not four independent normal distributed variables. They are instead related to each other and boil down to a single variable.
Lets use the table
then if the expected values
where fixed then∑oij−eijeij would be distributed as a chi-square distribution with four degrees of freedom but often we estimate the eij based on the oij and the variation is not like four independent variables. Instead we get that all the differences between o and e are the same
and they are effectively a single variable rather than four. Geometrically you can see this as theχ2 value not integrated on a four dimensional sphere but on a single line.
Note that this contingency table test is not the case for the contingency table in the Hosmer-Lemeshow test (it uses a different null hypothesis!). See also section 2.1 'the case whenβ0 and β–– are known' in the article of Hosmer and Lemshow. In their case you get 2g-1 degrees of freedom and not g-1 degrees of freedom as in the (R-1)(C-1) rule. This (R-1)(C-1) rule is specifically the case for the null hypothesis that row and column variables are independent (which creates R+C-1 constraints on the oi−ei values). The Hosmer-Lemeshow test relates to the hypothesis that the cells are filled according to the probabilities of a logistic regression model based on four parameters in the case of distributional assumption A and p+1 parameters in the case of distributional assumption B.
Second the case of a regression. A regression does something similar to the differenceo−e as the contingency table and reduces the dimensionality of the variation. There is a nice geometrical representation for this as the value yi can be represented as the sum of a model term βxi and a residual (not error) terms ϵi . These model term and residual term each represent a dimensional space that is perpendicular to each other. That means the residual terms ϵi can not take any possible value! Namely they are reduced by the part which projects on the model, and more particular 1 dimension for each parameter in the model.
Maybe the following images can help a bit
Below are 400 times three (uncorrelated) variables from the binomial distributionsB(n=60,p=1/6,2/6,3/6) . They relate to normal distributed variables N(μ=n∗p,σ2=n∗p∗(1−p)) . In the same image we draw the iso-surface for χ2=1,2,6 . Integrating over this space by using the spherical coordinates such that we only need a single integration (because changing the angle does not change the density), over χ results in ∫a0e−12χ2χd−1dχ in which this χd−1 part represents the area of the d-dimensional sphere. If we would limit the variables χ in some way than the integration would not be over a d-dimensional sphere but something of lower dimension.
The image below can be used to get an idea of the dimensional reduction in the residual terms. It explains the least squares fitting method in geometric term.
In blue you have measurements. In red you have what the model allows. The measurement is often not exactly equal to the model and has some deviation. You can regard this, geometrically, as the distance from the measured point to the red surface.
The red arrowsmu1 and mu2 have values (1,1,1) and (0,1,2) and could be related to some linear model as x = a + b * z + error or
so the span of those two vectors(1,1,1) and (0,1,2) (the red plane) are the values for x that are possible in the regression model and ϵ is a vector that is the difference between the observed value and the regression/modeled value. In the least squares method this vector is perpendicular (least distance is least sum of squares) to the red surface (and the modeled value is the projection of the observed value onto the red surface).
So this difference between observed and (modelled) expected is a sum of vectors that are perpendicular to the model vector (and this space has dimension of the total space minus the number of model vectors).
In our simple example case. The total dimension is 3. The model has 2 dimensions. And the error has a dimension 1 (so no matter which of those blue points you take, the green arrows show a single example, the error terms have always the same ratio, follow a single vector).
I hope this explanation helps. It is in no way a rigorous proof and there are some special algebraic tricks that need to be solved in these geometric representations. But anyway I like these two geometrical representations. The one for the trick of Pearson to integrate theχ2 by using the spherical coordinates, and the other for viewing the sum of least squares method as a projection onto a plane (or larger span).
I am always amazed how we end up witho−ee , this is in my point of view not trivial since the normal approximation of a binomial is not a devision by e but by np(1−p) and in the case of contingency tables you can work it out easily but in the case of the regression or other linear restrictions it does not work out so easily while the literature is often very easy in arguing that 'it works out the same for other linear restrictions'. (An interesting example of the problem. If you performe the following test multiple times 'throw 2 times 10 times a coin and only register the cases in which the sum is 10' then you do not get the typical chi-square distribution for this "simple" linear restriction)
fuente