Un documento "Cálculo preciso de la varianza de carrera" en http://www.johndcook.com/standard_deviation.html muestra cómo calcular la media de carrera, la varianza y las desviaciones estándar.
¿Existen algoritmos en los que los parámetros de un modelo de regresión lineal o logística puedan actualizarse "dinámicamente" de manera similar a medida que se proporciona cada nuevo registro de entrenamiento?
Respuestas:
The linear régression coefficients ofy=ax+b are a=cov(x,y)/var(x) and b=mean(y)−a⋅mean(x) .
So all you really need is an incremental method to computecov(x,y) . From this value and the variance of x and the mean of both y and x you can compute the parameters a and b .
As you will see in the pseudo code given below incremental computation of cov(x,y) is very similar to incremental computation of var(x) . This shouldn't be a surprise because var(x)=cov(x,x) .
Here is the pseudo code you are probably looking for:
I found this question while searching for an equivalent algorithm incrementally computing a multi variate regression asR=(X′X)−1X′Y so that XR=Y+ϵ
fuente
For your two specific examples:
Linear Regression The paper "Online Linear Regression and Its Application to Model-Based Reinforcement Learning" by Alexander Strehl and Michael Littman describes an algorithm called "KWIK Linear Regression" (see algorithm 1) which provides an approximation to the linear regression solution using incremental updates. Note that this is not regularised (i.e. it is not Ridge Regression). I'm pretty sure that the method of Strehl & Littman cannot extend to that setting.
Logistic Regression
This thread sheds some light on the matter. Quoting:
There are however other online (or incremental) methods for regression that you might want to look at, for example Locally Weighted Projection Regression (LWPR)
fuente
As a general principle:
0) you keep the sufficient statistics and the current ML estimates
1) when you get new data, update the sufficient statistics and the estimates
2) When you don't have sufficient statistics you'll need to use all of the data.
3) Typically you don't have closed-form solutions; use the previous MLEs as the starting point, use some convenient optimization method to find the new optimum from there. You may need to experiment a bit to find which approaches make the best tradeoffs for your particular kinds of problem instances.
If your problem has a special structure, you can probably exploit it.
A couple of potential references that may or may not have some value:
McMahan, H. B. and M. Streeter (2012),
Open Problem: Better Bounds for Online Logistic Regression,
JMLR: Workshop and Conference Proceedings, vol 23, 44.1–44.3
Penny, W.D. and S.J. Roberts (1999),
Dynamic Logistic Regression,
Proceedings IJCNN '99
fuente
Adding to tdc's answer, there are no known methods to compute exact estimates of the coefficients at any point in time with just constant time per iteration. However, there are some alternatives which are reasonable and interesting.
The first model to look at is the online learning setting. In this setting, the world first announces a value of x, your algorithm predicts a value for y, the world announces the true value y', and your algorithm suffers a loss l(y,y'). For this setting it is known that simple algorithms (gradient descent and exponentiated gradient, among others) achieve sublinear regret. This means that as you see more examples the number of extra mistakes your algorithm makes (when compared to the best possible linear predictor) does not grow with the number of examples. This works even in adversarial settings. There is a good paper explaining one popular strategy to prove these regret bounds. Shai Shalev-Schwartz's lecture notes are also useful.
There is an extension of the online learning setting called the bandit setting where your algorithm is only given a number representing how wrong it was (and no pointer to the right answer). Impressively, many results from online learning carry over to this setting, except here one is forced to explore as well as exploit, which leads to all sorts of interesting challenges.
fuente
Other answers have pointed to the world of machine learning, and that is certainly one place where this problem has been addressed.
However, another approach that may be better suited to your needs is the use of the QR factorization with with low rank updates. Approaches to doing this and using it to solve least squares problems are given in:
Updating the QR factorization and the least squares problem by Hammerling and Lucas.
fuente
You can use some standard Kalman Filter package in R for this - sspir, dlm, kfas, etc. I feel that KF is a much more developed area than online-learning, so it may be more practical. You may use a model
You can formulate similar model for logistic regression,
fuente
This is to add to @chmike answer.
The method appears to be similar to B. P. Welford’s online algorithm for standard deviation which also calculates the mean. John Cook gives a good explanation here. Tony Finch in 2009 provides a method for an exponential moving average and standard deviation:
Peering at the previously posted answer and expanding upon it to include a exponential moving window:
In the above "code", desiredAlpha could be set to 0 and if so, the code would operate without exponential weighting. It can be suggested to set desiredAlpha to 1/desiredWindowSize as suggested by Modified_moving_average for a moving window size.
Side question: of the alternative calculations above, any comments on which is better from a precision standpoint?
References:
chmike (2013) https://stats.stackexchange.com/a/79845/70282
Cook, John (n.d.) Accurately computing running variance http://www.johndcook.com/blog/standard_deviation/
Finch, Tony. (2009) Incremental calculation of weighted mean and variance. https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
Wikipedia. (n.d) Welford’s online algorithm https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
fuente