Pruebas A / B: prueba z vs prueba t vs chi cuadrado vs prueba exacta de fisher

38

Estoy tratando de entender el razonamiento eligiendo un enfoque de prueba específico al tratar con una prueba A / B simple (es decir, dos variaciones / grupos con una respuesta binaria (convertida o no). Como ejemplo, usaré los datos a continuación

Version  Visits  Conversions
A        2069     188
B        1826     220

La respuesta principal aquí es excelente y habla sobre algunos de los supuestos subyacentes para las pruebas de z, t y chi cuadrado. Pero lo que me parece confuso es que los diferentes recursos en línea citarán diferentes enfoques, y ¿usted pensaría que las suposiciones para una prueba básica A / B deberían ser más o menos las mismas?

  1. Por ejemplo, este artículo usa puntuación z :enter image description here
  2. Este artículo usa la siguiente fórmula (que no estoy seguro si es diferente del cálculo de zscore):

enter image description here

  1. Este artículo hace referencia a la prueba t (p 152):

enter image description here

Entonces, ¿qué argumentos se pueden hacer a favor de estos enfoques diferentes? ¿Por qué uno tendría una preferencia?

Para incluir a un candidato más, la tabla anterior se puede reescribir como una tabla de contingencia 2x2, donde se puede usar la prueba exacta de Fisher (p5)

              Non converters  Converters  Row Total
Version A     1881            188         2069  
Versions B    1606            220         1826
Column Total  3487            408         3895

Pero de acuerdo con este hilo, la prueba exacta de Fisher solo debe usarse con tamaños de muestra más pequeños (¿cuál es el límite?)

Y luego están las pruebas t y z pareadas, la prueba f (y la regresión logística, pero quiero dejar eso fuera por ahora) ... Siento que me estoy ahogando en diferentes enfoques de prueba, y solo quiero poder Haga algún tipo de argumento para los diferentes métodos en este simple caso de prueba A / B.

Usando los datos de ejemplo obtengo los siguientes valores p

  1. https://vwo.com/ab-split-test-significance-calculator/ da un valor p de 0.001 (puntaje z)

  2. http://www.evanmiller.org/ab-testing/chi-squared.html (usando la prueba de chi cuadrado) da un valor p de 0.00259

  3. Y en R fisher.test(rbind(c(1881,188),c(1606,220)))$p.valueda un valor p de 0.002785305

Que supongo que están muy cerca ...

De todos modos, solo espero una discusión saludable sobre qué enfoques usar en las pruebas en línea donde los tamaños de muestra generalmente son miles y las relaciones de respuesta a menudo son del 10% o menos. Mi instinto me dice que use chi-cuadrado, pero quiero poder responder exactamente por qué lo estoy eligiendo entre las otras muchas formas de hacerlo.

L Xandor
fuente
zt
I found this demonstration pretty helpful. Which shows that z test for proportions is essentially equivalent to chi-square test of homogeneity on the 2x2 contingency table. rinterested.github.io/statistics/chi_square_same_as_z_test.html
yueyanw

Respuestas:

24

We use these tests for different reasons and under different circumstances.

  1. z-test. A z-test assumes that our observations are independently drawn from a Normal distribution with unknown mean and known variance. A z-test is used primarily when we have quantitative data. (i.e. weights of rodents, ages of individuals, systolic blood pressure, etc.) However, z-tests can also be used when interested in proportions. (i.e. the proportion of people who get at least eight hours of sleep, etc.)

  2. t-test. A t-test assumes that our observations are independently drawn from a Normal distribution with unknown mean and unknown variance. Note that with a t-test, we do not know the population variance. This is far more common than knowing the population variance, so a t-test is generally more appropriate than a z-test, but practically there will be little difference between the two if sample sizes are large.

With z- and t-tests, your alternative hypothesis will be that your population mean (or population proportion) of one group is either not equal, less than, or greater than the population mean (or proportion) or the other group. This will depend on the type of analysis you seek to do, but your null and alternative hypotheses directly compare the means/proportions from the two groups.

  1. Chi-squared test. Whereas z- and t-tests concern quantitative data (or proportions in the case of z), chi-squared tests are appropriate for qualitative data. Again, the assumption is that observations are independent of one another. In this case, you aren't seeking a particular relationship. Your null hypothesis is that no relationship exists between variable one and variable two. Your alternative hypothesis is that a relationship does exist. This doesn't give you specifics as to how this relationship exists (i.e. In which direction does the relationship go) but it will provide evidence that a relationship does (or does not) exist between your independent variable and your groups.

  2. Fisher's exact test. One drawback to the chi-squared test is that it is asymptotic. This means that the p-value is accurate for very large sample sizes. However, if your sample sizes are small, then the p-value may not be quite accurate. As such, Fisher's exact test allows you to exactly calculate the p-value of your data and not rely on approximations that will be poor if your sample sizes are small.

I keep discussing sample sizes - different references will give you different metrics as to when your samples are large enough. I would just find a reputable source, look at their rule, and apply their rule to find the test you want. I would not "shop around," so to speak, until you find a rule that you "like."

Ultimately, the test you choose should be based on a) your sample size and b) what form you want your hypotheses to take. If you are looking for a specific effect from your A/B test (for example, my B group has higher test scores), then I would opt for a z-test or t-test, pending sample size and the knowledge of the population variance. If you want to show that a relationship merely exists (for example, my A group and B group are different based on the independent variable but I don't care which group has higher scores), then the chi-squared or Fisher's exact test is appropriate, depending on sample size.

Does this make sense? Hope this helps!

Matt Brems
fuente
Thanks for the detailed answer! I'm going to go through it in detail - I'm sure I'll have a few questions!
L Xandor
Could you further explain how the chi-squared and Fisher exact test don't indicate the direction of an effect? If all inferential statistics tests provide a confidence level around whether two samples sets are drawn from different populations or the same population, then what is it about the mathematical theory that won't let you say the directional difference in mean values would hold (B group has higher score)?
Chris F
For clarity, chi-squared test and Fisher's exact test are doing the same thing but the p-value is calculated slightly differently. (It's an approximation under chi-squared and an exact calculation under Fisher's exact.) I'll address chi-squared and it will generalize to Fisher's. The issue here is the premise. "If all inferential statistics tests provides a confidence level around whether two samples are drawn from..." - that's not what the chi-squared test does. The null hypothesis for the chi-squared test is that there is no association and the alternative hypothesis...
Matt Brems
...is that there is some association between the two categorical variables. You're merely testing for the existence of an association and not pre-specifying a certain direction. (There are some lesser-known statistics out there that DO specify a certain relationship, so it is possible; however this is not what the chi-squared test is designed to do.) To infer afterward that there is a particular directional relationship based on a p-value that was calculated under a different set of hypotheses designed to just test for the existence of an association would be a mistake.
Matt Brems
As an example, consider the hypotheses H0:μ=0 versus HA:μ0 and say you perform a t-test and get a p-value of 0.04. You would reject the null hypothesis and conclude there's a difference. If your estimate for μ was above 0, you might be tempted to conclude that the true average μ is above 0. However, if you had considered the hypotheses H0:μ0 versus HA:μ>0 with the same data, your p-value would be 0.08 and you wouldn't reject the null, assuming α=0.05, meaning that you could not conclude that μ is greater than 0.
Matt Brems