¿Existen algoritmos para calcular parámetros de regresión logística o lineal "en ejecución"?

32

Un documento "Cálculo preciso de la varianza de carrera" en http://www.johndcook.com/standard_deviation.html muestra cómo calcular la media de carrera, la varianza y las desviaciones estándar.

¿Existen algoritmos en los que los parámetros de un modelo de regresión lineal o logística puedan actualizarse "dinámicamente" de manera similar a medida que se proporciona cada nuevo registro de entrenamiento?

chl
fuente
1
Con un gran conjunto de entrenamiento o un flujo continuo de datos de entrada, puede usar algoritmos iterativos como el Descenso de gradiente estocástico y tomar la entrada en pequeños lotes a medida que avanza. ¿Es eso lo que preguntabas?
andreister
1
Buscar algoritmo RLS y sus variantes. en.wikipedia.org/wiki/Recursive_least_squares_filter
Memming

Respuestas:

20

The linear régression coefficients of y=ax+b are a=cov(x,y)/var(x) and b=mean(y)amean(x).

So all you really need is an incremental method to compute cov(x,y). From this value and the variance of x and the mean of both y and x you can compute the parameters a and b. As you will see in the pseudo code given below incremental computation of cov(x,y) is very similar to incremental computation of var(x). This shouldn't be a surprise because var(x)=cov(x,x).

Here is the pseudo code you are probably looking for:

init(): meanX = 0, meanY = 0, varX = 0, covXY = 0, n = 0

update(x,y):
n += 1
dx = x - meanX
dy = y - meanY
varX += (((n-1)/n)*dx*dx - varX)/n
covXY += (((n-1)/n)*dx*dy - covXY)/n
meanX += dx/n
meanY += dy/n

getA(): return covXY/varX
getB(): return meanY - getA()*meanX

I found this question while searching for an equivalent algorithm incrementally computing a multi variate regression as R=(XX)1XY so that XR=Y+ϵ

chmike
fuente
4
Thank you for your contribution! The part of the question about linear regression actually is a duplicate of stats.stackexchange.com/questions/6920/… whose answers shows how to update a multiple linear regression model. The present thread is allowed to stand because the logistic regression part of the question is independently of interest. Actually, even the logistic part has been duplicated at stats.stackexchange.com/questions/59174/….
whuber
1
I thought this answer would be useful considering the reference text given in the question. Thank you for the link. However it's not what I'm looking for. My use case is apparently particular.
chmike
3
I believe it may be useful and is unique in offering working code.
whuber
May I ask you why you let dx*dy times (n-1)/n ?
FavorMylikes
Can you improve the code to calculate the p-value?
Nathan
12

For your two specific examples:

Linear Regression The paper "Online Linear Regression and Its Application to Model-Based Reinforcement Learning" by Alexander Strehl and Michael Littman describes an algorithm called "KWIK Linear Regression" (see algorithm 1) which provides an approximation to the linear regression solution using incremental updates. Note that this is not regularised (i.e. it is not Ridge Regression). I'm pretty sure that the method of Strehl & Littman cannot extend to that setting.

Logistic Regression

This thread sheds some light on the matter. Quoting:

Even without a regularization constraint, logistic regression is a nonlinear optimization problem. Already this does not have an analytic solution, which is usually a prerequisite to deriving an update solution. With a regularization constraint, it becomes a constrained optimization problem. This introduces a whole new set of non-analytic complications on top of the ones that the unconstrained problem already had.

There are however other online (or incremental) methods for regression that you might want to look at, for example Locally Weighted Projection Regression (LWPR)

tdc
fuente
About logistic regression, I think you're unnecessarily pessimistic. Logistic regression is equivalent to computing posterior class probabilities for a two-class problem with each class Gaussian distributed, with different means and a shared covariance. The MLE for the covariance is just a weighted sum of the per-class covariances, so the sufficient statistics are just the per-class count, sum, and sum of squares. Obviously it's easy to contrive an exact update using the sufficient statistics.
Robert Dodier
3
@RobertDodier You've described linear discriminant analysis, not logistic regression. The last paragraph of the introductory section here clarifies the relationship.
ahfoss
@ahfoss Even if the per-class data are not normally distributed, one can still construct a model equivalent to logistic regression via per-class covariances.
Robert Dodier
1
@RobertDodier What is the equivalent model? You seem to be implying there is an obvious solution to a substantially difficult problem. Your solution is either more brilliant than you realize, or much less so.
ahfoss
11

As a general principle:

0) you keep the sufficient statistics and the current ML estimates

1) when you get new data, update the sufficient statistics and the estimates

2) When you don't have sufficient statistics you'll need to use all of the data.

3) Typically you don't have closed-form solutions; use the previous MLEs as the starting point, use some convenient optimization method to find the new optimum from there. You may need to experiment a bit to find which approaches make the best tradeoffs for your particular kinds of problem instances.

If your problem has a special structure, you can probably exploit it.

A couple of potential references that may or may not have some value:

McMahan, H. B. and M. Streeter (2012),
Open Problem: Better Bounds for Online Logistic Regression,
JMLR: Workshop and Conference Proceedings, vol 23, 44.1–44.3

Penny, W.D. and S.J. Roberts (1999),
Dynamic Logistic Regression,
Proceedings IJCNN '99

Glen_b -Reinstate Monica
fuente
I agree with the idea of keeping sufficient statistics (if they exist for the problem), but doesn't the presence of sufficient statistics make the other stuff unnecessary? If you have the sufficient statistics, you can compute updated parameters exactly as if you used the entire data set. There is no need to take the current parameters into account and no need to experiment with optimization methods.
Robert Dodier
2
It's important to note that having sufficient statistics doesn't imply you have a solution to the equations, though.
Glen_b -Reinstate Monica
8

Adding to tdc's answer, there are no known methods to compute exact estimates of the coefficients at any point in time with just constant time per iteration. However, there are some alternatives which are reasonable and interesting.

The first model to look at is the online learning setting. In this setting, the world first announces a value of x, your algorithm predicts a value for y, the world announces the true value y', and your algorithm suffers a loss l(y,y'). For this setting it is known that simple algorithms (gradient descent and exponentiated gradient, among others) achieve sublinear regret. This means that as you see more examples the number of extra mistakes your algorithm makes (when compared to the best possible linear predictor) does not grow with the number of examples. This works even in adversarial settings. There is a good paper explaining one popular strategy to prove these regret bounds. Shai Shalev-Schwartz's lecture notes are also useful.

There is an extension of the online learning setting called the bandit setting where your algorithm is only given a number representing how wrong it was (and no pointer to the right answer). Impressively, many results from online learning carry over to this setting, except here one is forced to explore as well as exploit, which leads to all sorts of interesting challenges.

Alexandre Passos
fuente
6

Other answers have pointed to the world of machine learning, and that is certainly one place where this problem has been addressed.

However, another approach that may be better suited to your needs is the use of the QR factorization with with low rank updates. Approaches to doing this and using it to solve least squares problems are given in:

Updating the QR factorization and the least squares problem by Hammerling and Lucas.


fuente
5

You can use some standard Kalman Filter package in R for this - sspir, dlm, kfas, etc. I feel that KF is a much more developed area than online-learning, so it may be more practical. You may use a model

yt=βtxt+εt,βt=βt1+ηt
to allow your regression coefficients to slowly vary with time and KF will re-estimate them on each step (with constant time cost) based on most recent data. Alternatively, you can set them constant
βt=βt1
and KF will still re-estimate them on each step but this time assuming they are constant and just incorporating new observed data to produce better and better estimates of same coefficients values.

You can formulate similar model for logistic regression,

yt=logit(βtxt+εt),βt=βt1+ηt
as it will be non-linear, you will need to use non-linear filtering method from above packages - EKF or UKF.
Kochede
fuente
2

This is to add to @chmike answer.

The method appears to be similar to B. P. Welford’s online algorithm for standard deviation which also calculates the mean. John Cook gives a good explanation here. Tony Finch in 2009 provides a method for an exponential moving average and standard deviation:

diff := x – mean 
incr := alpha * diff 
mean := mean + incr 
variance := (1 - alpha) * (variance + diff * incr)

Peering at the previously posted answer and expanding upon it to include a exponential moving window:

init(): 
    meanX = 0, meanY = 0, varX = 0, covXY = 0, n = 0,
    meanXY = 0, varY = 0, desiredAlpha=0.01 #additional variables for correlation

update(x,y):
    n += 1
    alpha=max(desiredAlpha,1/n) #to handle initial conditions

    dx = x - meanX
    dy = y - meanY
    dxy = (x*y) - meanXY #needed for cor

    varX += ((1-alpha)*dx*dx - varX)*alpha
    varY += ((1-alpha)*dy*dy - varY)*alpha #needed for corXY
    covXY += ((1-alpha)*dx*dy - covXY)*alpha

    #alternate method: varX = (1-alpha)*(varX+dx*dx*alpha)
    #alternate method: varY = (1-alpha)*(varY+dy*dy*alpha) #needed for corXY
    #alternate method: covXY = (1-alpha)*(covXY+dx*dy*alpha)

    meanX += dx * alpha
    meanY += dy * alpha
    meanXY += dxy  * alpha

getA(): return covXY/varX
getB(): return meanY - getA()*meanX
corXY(): return (meanXY - meanX * meanY) / ( sqrt(varX) * sqrt(varY) )

In the above "code", desiredAlpha could be set to 0 and if so, the code would operate without exponential weighting. It can be suggested to set desiredAlpha to 1/desiredWindowSize as suggested by Modified_moving_average for a moving window size.

Side question: of the alternative calculations above, any comments on which is better from a precision standpoint?

References:

chmike (2013) https://stats.stackexchange.com/a/79845/70282

Cook, John (n.d.) Accurately computing running variance http://www.johndcook.com/blog/standard_deviation/

Finch, Tony. (2009) Incremental calculation of weighted mean and variance. https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf

Wikipedia. (n.d) Welford’s online algorithm https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

Chris
fuente