LASSO y la cresta desde la perspectiva bayesiana: ¿qué pasa con el parámetro de ajuste?

Se estima que los estimadores de regresión penalizados como LASSO y ridge corresponden a estimadores bayesianos con ciertos antecedentes.

Si, eso es correcto. Siempre que tengamos un problema de optimización que implique la maximización de la función log-verosimilitud más una función de penalización en los parámetros, esto es matemáticamente equivalente a la maximización posterior donde la función de penalización se considera el logaritmo de un núcleo anterior. Para ver esto, supongamos que tenemos una función de penalización usando un parámetro de ajuste . La función objetivo en estos casos se puede escribir como: $^\dagger$ $w$ $\lambda$

\begin{aligned} H_{x} (θ | λ) & = ℓ_{x} (θ) - w (θ | λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ)}{\int L_{x} (θ) π (θ | λ) d θ}) + const \\ = \ln π (θ | x, λ) + const, \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

donde usamos el anterior $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$ . Observe aquí que el parámetro de ajuste en la optimización se trata como un hiperparámetro fijo en la distribución anterior. Si está realizando una optimización clásica con un parámetro de ajuste fijo, esto es equivalente a realizar una optimización bayesiana con un hiperparámetro fijo. Para la regresión LASSO y Ridge, las funciones de penalización y los equivalentes anteriores correspondientes son:

\begin{aligned} LASSO Regression & π (θ | λ) & = \prod_{k = 1}^{m} Laplace (0, \frac{1}{λ}) = \prod_{k = 1}^{m} \frac{λ}{2} \cdot \exp (- λ | θ_{k} |), \\ Ridge Regression & π (θ | λ) & = \prod_{k = 1}^{m} Normal (0, \frac{1}{2 λ}) = \prod_{k = 1}^{m} \sqrt{λ / π} \cdot \exp (- λ θ_{k}^{2}) . \end{aligned}

$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$

El primer método penaliza los coeficientes de regresión de acuerdo con su magnitud absoluta, que es el equivalente a imponer un Laplace ubicado previamente en cero. El último método penaliza los coeficientes de regresión de acuerdo con su magnitud al cuadrado, que es el equivalente a imponer un previo normal ubicado en cero.

Ahora un frecuentista optimizaría el parámetro de ajuste mediante validación cruzada. ¿Existe un equivalente bayesiano de hacerlo, y se utiliza en absoluto?

Mientras el método frecuentista se pueda plantear como un problema de optimización (en lugar de decir, incluyendo una prueba de hipótesis, o algo así), habrá una analogía bayesiana con un equivalente anterior. Del mismo modo que los frecuentistas pueden tratar el parámetro de ajuste $\lambda$ como desconocido y estimar esto a partir de los datos, el Bayesiano puede tratar de manera similar el hiperparámetro $\lambda$ como desconocido. En un análisis bayesiano completo, esto implicaría dar al hiperparámetro su propio previo y encontrar el máximo posterior bajo este previo, lo que sería análogo a maximizar la siguiente función objetivo:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - h (λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ)) \cdot \exp (- h (λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ) π (λ)}{\int L_{x} (θ) π (θ | λ) π (λ) d θ}) + const \\ = \ln π (θ, λ | x) + const . \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$

De hecho, este método se utiliza en el análisis bayesiano en los casos en que el analista no se siente cómodo eligiendo un hiperparámetro específico para su previo, y busca hacer que el anterior sea más difuso al tratarlo como desconocido y darle una distribución. (Tenga en cuenta que esta es solo una forma implícita de dar una forma más difusa antes del parámetro de interés $\theta$ .)

(Comentario de statslearner2 a continuación) Estoy buscando estimaciones numéricas de MAP equivalentes. Por ejemplo, para una cresta de penalización fija hay un previo gaussiano que me dará la estimación de MAP exactamente igual a la estimación de cresta. Ahora, para k-fold CV ridge, ¿cuál es el hiper-previo que me daría la estimación MAP que es similar a la estimación CV-ridge?

Antes de proceder a analizar la validación cruzada de $K$ -pliegues, primero vale la pena señalar que, matemáticamente, el método de máximo a posteriori (MAP) es simplemente una optimización de una función del parámetro $\theta$ y los datos $\mathbf{x}$ . Si está dispuesto a permitir antecedentes inadecuados, el alcance encapsula cualquier problema de optimización que implique una función de estas variables. Por lo tanto, cualquier método frecuente que pueda enmarcarse como un problema de optimización único de este tipo tiene una analogía MAP, y cualquier método frecuente que no pueda enmarcarse como una optimización única de este tipo no tiene una analogía MAP.

En la forma de modelo anterior, que involucra una función de penalización con un parámetro de sintonización, la validación cruzada $K$ -doble se usa comúnmente para estimar el parámetro de sintonización $\lambda$ . Para este método de particionar los datos del vector $\mathbb{x}$ en $K$ sub-vectores $\mathbf{x}_1,...,\mathbf{x}_K$ . Para cada uno de sub-vector $k=1,...,K$ ajusta el modelo con los datos de "entrenamiento" $\mathbf{x}_{-k}$ y luego mide el ajuste del modelo con los datos de "prueba" $\mathbf{x}_k$ . In each fit you get an estimator for the model parameters, which then gives you predictions of the testing data, which can then be compared to the actual testing data to give a measure of "loss":

\begin{matrix} Estimator & \hat{θ} (x_{- k}, λ), \\ Predictions & {\hat{x}}_{k} (x_{- k}, λ), \\ Testing loss & L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ) . \end{matrix}

$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$

The loss measures for each of the $K$ "folds" can then be aggregated to get an overall loss measure for the cross-validation:

L (x, λ) = \sum_{k} L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ)

$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$

One then estimates the tuning parameter by minimising the overall loss measure:

\hat{λ} \equiv \hat{λ} (x) \equiv \underset{λ}{arg min} L (x, λ) .

$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$

We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above for $\theta$ , and the one described here for $\lambda$ ). Since the latter optimisation does not involve $\theta$ , we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ), \end{aligned}

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$ -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$ -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ) \\ = \ln (\frac{L_{x}^{*} (θ, λ) π (θ, λ)}{\int L_{x}^{*} (θ, λ) π (θ, λ) d θ}) + const, \end{aligned}

$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$ , with a fixed (and very large) hyper-parameter $\delta$ .

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

Reinstate Monica
fuente

Ok +1 already, but for the bounty I'm looking for these more precise answers.

statslearner2

1. I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? 2. Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP?

Richard Hardy

3. Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point 2. suggest that the Bayesian approach is somehow inferior?

Richard Hardy

@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding 3., yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. 2. (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not?

Richard Hardy

@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case.

Richard Hardy

LASSO y la cresta desde la perspectiva bayesiana: ¿qué pasa con el parámetro de ajuste?

Respuestas: