¿Quién inventó el descenso gradiente estocástico?

36

Estoy tratando de entender la historia del descenso de gradiente y el descenso de gradiente estocástico . El descenso del gradiente fue inventado en Cauchy en 1847. Méthode générale pour la résolution des systèmes d'équations simultanées . pp. 536–538 Para obtener más información al respecto, consulte aquí .

Desde entonces, los métodos de descenso de gradiente siguieron desarrollándose y no estoy familiarizado con su historia. En particular, estoy interesado en la invención del descenso de gradiente estocástico.

Una referencia que se puede utilizar en un trabajo académico de manera más que bienvenida.

DaL
fuente
3
Aprendí sobre SGD antes del aprendizaje automático, por lo que debe haber sido antes de todo esto
Aksakal
2
Bueno, Cauchy seguramente inventó GD antes del aprendizaje automático, por lo que no me sorprenderá que SGC también se haya inventado antes.
DaL
3
Kiefer-Wolfowitz estocástico aproximación en.wikipedia.org/wiki/Stochastic_approximation es la mayor parte del camino, aparte de no directamente "simulación" para el gradiente.
Mark L. Stone
3
"Descenso de gradiente estocástico" de ML es lo mismo que "Método de gradiente estocástico" de optimización convexa. Y los métodos de los subgraduados se descubrieron durante 1960-1970 en la URSS, Moscú. Quizás también en Estados Unidos. Vi un video donde Boris Polyak (es autor del método de pelota pesada) dijo que él (y todas las personas) comienzan a pensar en los métodos de los subgraduados en 1970. ( youtube.com/watch?v=2PcidcPxvyk&t=1963s ) ...
bruziuz

Respuestas:

27

Stochastic Gradient Descent is preceded by Stochastic Approximation as first described by Robbins and Monro in their paper, A Stochastic Approximation Method. Kiefer and Wolfowitz subsequently published their paper, Stochastic Estimation of the Maximum of a Regression Function which is more recognizable to people familiar with the ML variant of Stochastic Approximation (i.e Stochastic Gradient Descent), as pointed out by Mark Stone in the comments. The 60's saw plenty of research along that vein -- Dvoretzky, Powell, Blum all published results that we take for granted today. It is a relatively minor leap to get from the Robbins and Monro method to the Kiefer Wolfowitz method, and merely a reframing of the problem to then get to Stochastic Gradient Descent (for regression problems). The above papers are widely cited as being the antecedents of Stochastic Gradient Descent, as mentioned in this review paper by Nocedal, Bottou, and Curtis, which provides a brief historical perspective from a Machine Learning point of view.

I believe that Kushner and Yin in their book Stochastic Approximation and Recursive Algorithms and Applications suggest that the notion had been used in control theory as far back as the 40's, but I don't recall if they had a citation for that or if it was anecdotal, nor do I have access to their book to confirm this.

Herbert Robbins and Sutton Monro A Stochastic Approximation Method The Annals of Mathematical Statistics, Vol. 22, No. 3. (Sep., 1951), pp. 400-407.

J. Kiefer and J. Wolfowitz Stochastic Estimation of the Maximum of a Regression Function Ann. Math. Statist. Volume 23, Number 3 (1952), 462-466

Leon Bottou and Frank E. Curtis and Jorge Nocedal Optimization Methods for Large-Scale Machine Learning, Technical Report, arXiv:1606.04838

David Kozak
fuente
Can you give exact references? And for the invention of SGD, it seems to be in the 40's but it is not clear by who and where?
DaL
Certainly it's widely believed to be Robbins and Monro in 1951 with Stochastic Approximation Algorithms. I have heard that something similar showed up in the control theory literature in the 40's (like I said, I think from Kushner and Yin but I don't have that book handy), but aside from that one place everyone seems to cite Robbins and Monro, including the Nocedal et al. reference I linked to.
David Kozak
So our leading candidate now is H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951., as written in Nocedal, Bottou, and Curtis in pdfs.semanticscholar.org/34dd/…
DaL
I so it is referred to as the origin of SGD but in the summary (actually abstract in today terms) it is written "M(x) is assumed to he a monotone function of x but is unkno~vn to the experimenter, and it is desired to find the solution x = 0 of thc equation M(x)= a, where a is a given constant." If M(x) is unknown, one cannot derive it. Maybe it is another ancient ancestor?
DaL
Agreed, in some sense. Kiefer Wolfowitz used the analysis of this to come up with their paper which is more recognizable in the form we see today. As mentioned above by Mark Stone. Their paper can be found here: projecteuclid.org/download/pdf_1/euclid.aoms/1177729392.
David Kozak
14

See

Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review. 1958 Nov;65(6):386.

I am not sure if SGD was invented before this in optimization literature—probably was—but here I believe he describes an application of SGD to train a perceptron.

If the system is under a state of positive reinforcement, then a positive AV is added to the values of all active A-units in the source-sets of "on" responses, while a negative A V is added to the active units in the source- sets of "off" responses.

He calls these "two types of reinforcement".

He also references a book with more on these "bivalent systems".

Rosenblatt F. The perceptron: a theory of statistical separability in cognitive systems (Project Para). Cornell Aeronautical Laboratory; 1958.

user0
fuente
1
A good step ahead, thanks! I find the first reference online here citeseerx.ist.psu.edu/viewdoc/… I'll go over it. However, I expect to find the algorithm more explicit and formal.
DaL
3
+1 for the remark about optimization. Since it's used in Machine Learning to do optimization and since optimization became a big deal 40 or 50 years before ML -- and computers also entered the picture about the same time -- that seems like a good lead.
Wayne
I don't understand why you say that this quote describes SGD.
amoeba says Reinstate Monica
@amoeba hopefully I am not making a mistake, was just skimming the paper, but I though he was describing the perceptron update which is just SGD with constant learning rate.
user0
3
That's right. I am just saying that the stochastic aspect is not evident from the quote you chose. I mean, "stochastic" GD simply means that updates are done one training sample at a time (instead of computing gradient using all the available training samples). The algorithm given in en.wikipedia.org/wiki/Perceptron#Steps makes this "stochastic" aspect immediately clear in step #2.
amoeba says Reinstate Monica