Modelo lineal clásico - selección de modelo

16

Tengo un modelo lineal clásico, con 5 posibles regresores. No están correlacionados entre sí y tienen una correlación bastante baja con la respuesta. Llegué a un modelo donde 3 de los regresores tienen coeficientes significativos para su estadística t (p <0.05). Agregar una o las dos variables restantes da valores de p> 0.05 para el estadístico t, para las variables agregadas. Esto me lleva a creer que el modelo de 3 variables es el "mejor".

Sin embargo, usando el comando anova (a, b) en R donde a es el modelo de 3 variables yb es el modelo completo, el valor p para el estadístico F es <0.05, lo que me dice que prefiera el modelo completo sobre la variable 3 modelo. ¿Cómo puedo conciliar estas aparentes contradicciones?

Gracias PS Edit: algunos antecedentes más. Esta es la tarea, por lo que no publicaré detalles, pero no se nos dan detalles de lo que representan los regresores, solo están numerados del 1 al 5. Se nos pide que "obtengamos un modelo apropiado, dando justificación".

P Sellaz
fuente
66
An appropriate model might be taken to mean a model that effectively uses all pre-specified variables (accounting for nonlinearity, etc.). I hope your instructor understands that stepwise variable selection is invalid. Few do.
Frank Harrell
1
Hola de nuevo y gracias. Perdón por todo el ida y vuelta. Las instrucciones también dicen "No hay necesariamente un" mejor "modelo, y no necesariamente tiene que incluir todos los predictores". Además, no hay colinealidad o no linealidad. En realidad, los 5 predictores son generados por simulaciones independientes de una distribución normal. En consecuencia, las correlaciones entre los predictores y la respuesta también son pequeñas (la más grande es menor que 0.1). Francamente, mi intuición dice que el "mejor" modelo puede ser la media muestral (el r cuadrado ajustado es menor que 0.03)
P Sellaz
@P Sellaz: given that this is homework using simulated data, your intuition might serve you well here. Write up a well-reasoned explanation for your intuition.
Zach
1
R2
1
En general, es correcto que uno no tenga que incluir todos los predictores para hacer un buen trabajo. Pero los datos son incapaces de decirte qué predictores usar.
Frank Harrell

Respuestas:

18

The problem began when you sought a reduced model and used the data rather than subject matter knowledge to pick the predictors. Stepwise variable selection without simultaneous shinkage to penalize for variable selection, though often used, is an invalid approach. Much has been written about this. There is no reason to trust that the 3-variable model is "best" and there is no reason not to use the original list of pre-specified predictors. P-values computed after using P-values to select variables is not valid. This has been called "double dipping" in the functional imaging literature.

Here is an analogy. Suppose one is interested in comparing 6 treatments, but uses pairwise t-tests to pick which treatments are "different", resulting in a reduced set of 4 treatments. The analyst then tests for an overall difference with 3 degrees of freedom. This F test will have inflated type I error. The original F test with 5 d.f. is quite valid.

Consulte http://www.stata.com/support/faqs/stat/stepwise.html y por para obtener más información.

Frank Harrell
fuente
1
Thanks for your reply. I have added an edit the original question. I hope that is OK. Any further advice would be most welcome.
P Sellaz
6

One answer would be "this cannot be done without subject matter knowledge". Unfortunately, that would likely get you an F on your assignment. Unless I was your professor. Then it would get an A.

But, given your statement that R2 is 0.03 and there are low correlations among all variables, I'm puzzled that any model is significant at all. What is N? I'm guessing it's very large.

Then there's

all the 5 predictors are generated by independent simulations from a normal distribution.

Well, if you KNOW this (that is, your instructor told you) and if by "independent" you mean "not related to the DV" then you know that the best model is one with no predictors, and your intuition is correct.

Peter Flom - Reinstate Monica
fuente
Hi Peter, and thanks. N is 900. The data were all produced by simulation. I KNOW this because we had to do the simulatons ourselves. They are supposed to represent real data, as far as this homework is concerned. 100 simulations were conducted, and the 5 with the largest correlations to the response (also simulated but only once) were chosen as the candidate regressors.
P Sellaz
1
Just be certain that you were to simulate no connection between any X and Y. Then as others have said a regression model is irrelevant and the overall mean is sufficient.
Frank Harrell
1
Yes, they are completely independent. We chose the data with the largest 5 correlations as the candidate regressors, from which we have to "derive an appropriate model, giving justification" but we "do not necessarily have to include all 5 predictors".
P Sellaz
It sounds like your professor is either a) Completely confused or b) doing something quite interesting. Hard to tell which. If he/she intended this to show the sort of thing that @FrankHarrell and I and others have been pointing out, then good! (that would be b). OTOH, if he/she is intending this to be a "real" regression, then uh-oh it's a).
Peter Flom - Reinstate Monica
1
I'll let you know which it is when the papers are marked :)
P Sellaz
4

You might try doing cross validation. Choose a subset of your sample, find the "best" model for that subset using F or t tests, then apply it to the full data set (full cross validation can get more complicated than this, but this would be a good start). This helps to alleviate some of the stepwise testing problems.

See A Note on Screening Regression Equations by David Freedman for a cute little simulation of this idea.

Charlie
fuente
2

I really like the method used in the caret package: recursive feature elimination. You can read more about it in the vignette, but here's the basic process: Variable Selection

The basic idea is to use a criteria (such as t statistics) to eliminate unimportant variables and see how that improves the predictive accuracy of the model. You wrap the entire thing in a resampling loop, such as cross-validation. Here is an example, using a linear model to rank variables in a manner similar to what you've described:

#Setup
set.seed(1)
p1 <- rnorm(50)
p2 <- rnorm(50)
p3 <- rnorm(50)
p4 <- rnorm(50)
p5 <- rnorm(50)
y <- 4*rnorm(50)+p1+p2-p5

#Select Variables
require(caret)
X <- data.frame(p1,p2,p3,p4,p5)
RFE <- rfe(X,y, sizes = seq(1,5), rfeControl = rfeControl(
                    functions = lmFuncs,
                    method = "repeatedcv")
                )
RFE
plot(RFE)

#Fit linear model and compare
fmla <- as.formula(paste("y ~ ", paste(RFE$optVariables, collapse= "+")))
fullmodel <- lm(y~p1+p2+p3+p4+p5,data.frame(y,p1,p2,p3,p4,p5))
reducedmodel <- lm(fmla,data.frame(y,p1,p2,p3,p4,p5))
summary(fullmodel)
summary(reducedmodel)

In this example, the algorythm detects that there are 3 "important" variables, but it only gets 2 of them.

Zach
fuente