Estoy tratando de entender la historia del descenso de gradiente y el descenso de gradiente estocástico . El descenso del gradiente fue inventado en Cauchy en 1847. Méthode générale pour la résolution des systèmes d'équations simultanées . pp. 536–538 Para obtener más información al respecto, consulte aquí .
Desde entonces, los métodos de descenso de gradiente siguieron desarrollándose y no estoy familiarizado con su historia. En particular, estoy interesado en la invención del descenso de gradiente estocástico.
Una referencia que se puede utilizar en un trabajo académico de manera más que bienvenida.
Respuestas:
Stochastic Gradient Descent is preceded by Stochastic Approximation as first described by Robbins and Monro in their paper, A Stochastic Approximation Method. Kiefer and Wolfowitz subsequently published their paper, Stochastic Estimation of the Maximum of a Regression Function which is more recognizable to people familiar with the ML variant of Stochastic Approximation (i.e Stochastic Gradient Descent), as pointed out by Mark Stone in the comments. The 60's saw plenty of research along that vein -- Dvoretzky, Powell, Blum all published results that we take for granted today. It is a relatively minor leap to get from the Robbins and Monro method to the Kiefer Wolfowitz method, and merely a reframing of the problem to then get to Stochastic Gradient Descent (for regression problems). The above papers are widely cited as being the antecedents of Stochastic Gradient Descent, as mentioned in this review paper by Nocedal, Bottou, and Curtis, which provides a brief historical perspective from a Machine Learning point of view.
I believe that Kushner and Yin in their book Stochastic Approximation and Recursive Algorithms and Applications suggest that the notion had been used in control theory as far back as the 40's, but I don't recall if they had a citation for that or if it was anecdotal, nor do I have access to their book to confirm this.
fuente
See
I am not sure if SGD was invented before this in optimization literature—probably was—but here I believe he describes an application of SGD to train a perceptron.
He calls these "two types of reinforcement".
He also references a book with more on these "bivalent systems".
fuente