Tengo un modelo lineal clásico, con 5 posibles regresores. No están correlacionados entre sí y tienen una correlación bastante baja con la respuesta. Llegué a un modelo donde 3 de los regresores tienen coeficientes significativos para su estadística t (p <0.05). Agregar una o las dos variables restantes da valores de p> 0.05 para el estadístico t, para las variables agregadas. Esto me lleva a creer que el modelo de 3 variables es el "mejor".
Sin embargo, usando el comando anova (a, b) en R donde a es el modelo de 3 variables yb es el modelo completo, el valor p para el estadístico F es <0.05, lo que me dice que prefiera el modelo completo sobre la variable 3 modelo. ¿Cómo puedo conciliar estas aparentes contradicciones?
Gracias PS Edit: algunos antecedentes más. Esta es la tarea, por lo que no publicaré detalles, pero no se nos dan detalles de lo que representan los regresores, solo están numerados del 1 al 5. Se nos pide que "obtengamos un modelo apropiado, dando justificación".
fuente
Respuestas:
The problem began when you sought a reduced model and used the data rather than subject matter knowledge to pick the predictors. Stepwise variable selection without simultaneous shinkage to penalize for variable selection, though often used, is an invalid approach. Much has been written about this. There is no reason to trust that the 3-variable model is "best" and there is no reason not to use the original list of pre-specified predictors. P-values computed after using P-values to select variables is not valid. This has been called "double dipping" in the functional imaging literature.
Here is an analogy. Suppose one is interested in comparing 6 treatments, but uses pairwise t-tests to pick which treatments are "different", resulting in a reduced set of 4 treatments. The analyst then tests for an overall difference with 3 degrees of freedom. This F test will have inflated type I error. The original F test with 5 d.f. is quite valid.
Consulte http://www.stata.com/support/faqs/stat/stepwise.html y regresión por pasos para obtener más información.
fuente
One answer would be "this cannot be done without subject matter knowledge". Unfortunately, that would likely get you an F on your assignment. Unless I was your professor. Then it would get an A.
But, given your statement thatR2 is 0.03 and there are low correlations among all variables, I'm puzzled that any model is significant at all. What is N? I'm guessing it's very large.
Then there's
Well, if you KNOW this (that is, your instructor told you) and if by "independent" you mean "not related to the DV" then you know that the best model is one with no predictors, and your intuition is correct.
fuente
You might try doing cross validation. Choose a subset of your sample, find the "best" model for that subset using F or t tests, then apply it to the full data set (full cross validation can get more complicated than this, but this would be a good start). This helps to alleviate some of the stepwise testing problems.
See A Note on Screening Regression Equations by David Freedman for a cute little simulation of this idea.
fuente
I really like the method used in the
caret
package: recursive feature elimination. You can read more about it in the vignette, but here's the basic process:The basic idea is to use a criteria (such as t statistics) to eliminate unimportant variables and see how that improves the predictive accuracy of the model. You wrap the entire thing in a resampling loop, such as cross-validation. Here is an example, using a linear model to rank variables in a manner similar to what you've described:
In this example, the algorythm detects that there are 3 "important" variables, but it only gets 2 of them.
fuente