Lo que es tan intrigante sobre este resultado es cuánto se parece a la distribución de un coeficiente de correlación. Hay una razón.
Suponga que (X,Y) es bivariada normal con correlación cero y varianza común σ2 para ambas variables. Dibuja una muestra iid (x1,y1),…,(xn,yn) . Es bien sabido y fácilmente establecido geométricamente (como lo hizo Fisher hace un siglo) que la distribución del coeficiente de correlación de la muestra
r=∑ni=1(xi−x¯)(yi−y¯)(n−1)SxSy
es
f(r)=1B(12,n2−1)(1−r2)n/2−2, −1≤r≤1.
(Aquí, como de costumbre, y ˉ y son medias muestrales y S x y S y son las raíces cuadradas de los estimadores de varianza insesgados). B es la función Beta , para la cualx¯y¯SxSyB
1B(12,n2−1)=Γ(n−12)Γ(12)Γ(n2−1)=Γ(n−12)π−−√Γ(n2−1).(1)
Para calcular , podemos explotar su invariancia bajo rotaciones en R n alrededor de la línea generada por ( 1 , 1 , ... , 1 ) , junto con la invariancia de la distribución de la muestra bajo las mismas rotaciones, y elegir y i / S y para ser cualquier vector unitario cuyos componentes sumen cero. Uno de esos vectores es proporcional a v = ( n - 1 , - 1 , … , - 1 ) . Su desviación estándar esrRn(1,1,…,1)yi/Syv=(n−1,−1,…,−1)
Sv=1n−1((n−1)2+(−1)2+⋯+(−1)2)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√=n−−√.
En consecuencia, debe tener la misma distribución quer
∑ni=1(xi−x¯)(vi−v¯)(n−1)SxSv=(n−1)x1−x2−⋯−xn(n−1)Sxn−−√=n(x1−x¯)(n−1)Sxn−−√=n−−√n−1Z.
Por lo tanto, todo lo que necesitamos es reescalar para encontrar la distribución de Z :rZ
fZ(z)=∣∣n−−√n−1∣∣f(n−−√n−1z)=1B(12,n2−1)n−−√n−1(1−n(n−1)2z2)n/2−2
for |z|≤n−1n√. Formula (1) shows this is identical to that of the question.
Not entirely convinced? Here is the result of simulating this situation 100,000 times (with n=4, where the distribution is uniform).
The first histogram plots the correlation coefficients of (xi,yi),i=1,…,4 while the second histogram plots the correlation coefficients of (xi,vi),i=1,…,4) for a randomly chosen vector vi that remains fixed for all iterations. They are both uniform. The QQ-plot on the right confirms these distributions are essentially identical.
Here's the R
code that produced the plot.
n <- 4
n.sim <- 1e5
set.seed(17)
par(mfrow=c(1,3))
#
# Simulate spherical bivariate normal samples of size n each.
#
x <- matrix(rnorm(n.sim*n), n)
y <- matrix(rnorm(n.sim*n), n)
#
# Look at the distribution of the correlation of `x` and `y`.
#
sim <- sapply(1:n.sim, function(i) cor(x[,i], y[,i]))
hist(sim)
#
# Specify *any* fixed vector in place of `y`.
#
v <- c(n-1, rep(-1, n-1)) # The case in question
v <- rnorm(n) # Can use anything you want
#
# Look at the distribution of the correlation of `x` with `v`.
#
sim2 <- sapply(1:n.sim, function(i) cor(x[,i], v))
hist(sim2)
#
# Compare the two distributions.
#
qqplot(sim, sim2, main="QQ Plot")
Reference
R. A. Fisher, Frequency-distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10, 507. See Section 3. (Quoted in Kendall's Advanced Theory of Statistics, 5th Ed., section 16.24.)
I'd like to suggest this way to get the pdf of Z by directly calculating the MVUE ofP(X≤c) using Bayes' theorem although it's handful and complex.
SinceE[I(−∞,c)(X1)]=P(X1≤c) and Z1=X¯ , Z2=S2 are joint complete sufficient statistic, MVUE of P(X≤c) would be like this:
Now using Bayes' theorem, we get
The denominatorfZ1,Z2(z1,z2)=fZ1(z1)fZ2(z2) can be written in closed form because Z1∼N(μ,σ2n) , Z2∼Γ(n−12,2σ2n−1) are independent of each other.
To get the closed form of numerator, we can adopt these statistics:
which is the mean and the sample variance ofX2,X3,...,Xn and they are independent of each other and also independent of X1 . We can express these in terms of Z1,Z2 .
We can use transformation whileX1=x1 ,
SinceW1∼N(μ,σ2n−1) , W2∼Γ(n−22,2σ2n−2) we can get the closed form of this.
Note that this holds only for w2≥0 which restricts x1 to z1−n−1n√z2−−√≤x1≤z1+n−1n√z2−−√ .
So put them all together, exponential terms would disappear and you'd get,
From this,at this point, we can get the pdf ofZ=X1−z1z2√ using transformation.
By the way, the MVUE would be like this :
I am not a native English speaker and there could be some awkward sentences. I am studying statistics by myself with text book introduction to mathmatical statistics by Hogg. So there could be some grammatical or mathmatical conceptual mistakes. It would be appreciated if someone correct them.
Thank you for reading.
fuente