How does stochastic gradient descent even work for neural nets?How does neural network recognise images?Convergence of neural network weightsNeural Network General Learning Dynamics of Gradient DescentClarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementationWhy use gradient descent with neural networks?How to set mini-batch size in SGD in kerasStochastic gradient descent for neural networks with tied weightsIn Neural Network back propagation, how are the weights for one training examples related to the weights for next training examples?How backpropagation through gradient descent represents the error after each forward passNesterov's Momentum (or Accelerated Gradient)
Can European countries bypass the EU and make their own individual trade deal with the U.S.?
What verb for taking advantage fits in "I don't want to ________ on the friendship"?
Ways to get SMD resistors from a strip
Security Patch SUPEE-11155 - Possible issues?
Closest Proximity of Oceans to Freshwater Springs
How do I create a new column in a dataframe from an existing column using conditions?
Is ALTER TABLE ... DROP COLUMN really a metadata only operation?
Bin Packing with Relational Penalization
How can I deal with extreme temperatures in a hotel room?
What was the point of separating stdout and stderr?
"I am [the / an] owner of a bookstore"?
Calculus, Water Poured into a Cone: Why is Derivative Non-linear?
Does a return economy-class seat between London and San Francisco release 5.28 tonnes of CO2 equivalents?
The Lucas argument vs the theorem-provers--who wins and why?
My colleague is constantly blaming me for his errors
How to securely dispose of a smartphone?
Can dual citizens open crypto exchange accounts where U.S. citizens are prohibited?
Cooking a nice pan seared steak for picky eaters
When casting a spell with a long casting time, what happens if you don't spend your action on a turn to continue casting?
Prime in substituted benzene
Adjective for 'made of pus' or 'corrupted by pus' or something of something of pus
Origin of the convolution theorem
Why would anyone even use a Portkey?
Reusable spacecraft: why still have fairings detach, instead of open/close?
How does stochastic gradient descent even work for neural nets?
How does neural network recognise images?Convergence of neural network weightsNeural Network General Learning Dynamics of Gradient DescentClarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementationWhy use gradient descent with neural networks?How to set mini-batch size in SGD in kerasStochastic gradient descent for neural networks with tied weightsIn Neural Network back propagation, how are the weights for one training examples related to the weights for next training examples?How backpropagation through gradient descent represents the error after each forward passNesterov's Momentum (or Accelerated Gradient)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?
Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?
So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?
To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.
Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.
machine-learning neural-networks optimization gradient-descent sgd
$endgroup$
add a comment |
$begingroup$
How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?
Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?
So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?
To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.
Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.
machine-learning neural-networks optimization gradient-descent sgd
$endgroup$
$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago
add a comment |
$begingroup$
How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?
Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?
So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?
To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.
Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.
machine-learning neural-networks optimization gradient-descent sgd
$endgroup$
How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?
Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?
So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?
To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.
Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.
machine-learning neural-networks optimization gradient-descent sgd
machine-learning neural-networks optimization gradient-descent sgd
edited 10 hours ago
Firebug
7,9692 gold badges33 silver badges81 bronze badges
7,9692 gold badges33 silver badges81 bronze badges
asked 11 hours ago
JaoodJaood
1112 bronze badges
1112 bronze badges
$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago
add a comment |
$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago
$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago
$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago
add a comment |
4 Answers
4
active
oldest
votes
$begingroup$
You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change
, so rather then oscillating, it would gradually average between different solutions. If learning_rate
is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate
is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate
.
$endgroup$
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
add a comment |
$begingroup$
The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.
To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.
$endgroup$
add a comment |
$begingroup$
Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.
The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population
$endgroup$
add a comment |
$begingroup$
The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.
For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.
After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f415053%2fhow-does-stochastic-gradient-descent-even-work-for-neural-nets%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change
, so rather then oscillating, it would gradually average between different solutions. If learning_rate
is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate
is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate
.
$endgroup$
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
add a comment |
$begingroup$
You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change
, so rather then oscillating, it would gradually average between different solutions. If learning_rate
is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate
is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate
.
$endgroup$
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
add a comment |
$begingroup$
You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change
, so rather then oscillating, it would gradually average between different solutions. If learning_rate
is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate
is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate
.
$endgroup$
You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change
, so rather then oscillating, it would gradually average between different solutions. If learning_rate
is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate
is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate
.
edited 10 hours ago
answered 11 hours ago
Tim♦Tim
62.4k9 gold badges140 silver badges236 bronze badges
62.4k9 gold badges140 silver badges236 bronze badges
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
add a comment |
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago
add a comment |
$begingroup$
The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.
To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.
$endgroup$
add a comment |
$begingroup$
The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.
To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.
$endgroup$
add a comment |
$begingroup$
The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.
To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.
$endgroup$
The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.
To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.
answered 11 hours ago
Bill ClarkBill Clark
17411 bronze badges
17411 bronze badges
add a comment |
add a comment |
$begingroup$
Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.
The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population
$endgroup$
add a comment |
$begingroup$
Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.
The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population
$endgroup$
add a comment |
$begingroup$
Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.
The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population
$endgroup$
Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.
The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population
edited 7 hours ago
answered 8 hours ago
AksakalAksakal
40.7k4 gold badges54 silver badges123 bronze badges
40.7k4 gold badges54 silver badges123 bronze badges
add a comment |
add a comment |
$begingroup$
The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.
For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.
After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).
$endgroup$
add a comment |
$begingroup$
The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.
For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.
After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).
$endgroup$
add a comment |
$begingroup$
The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.
For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.
After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).
$endgroup$
The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.
For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.
After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).
edited 7 hours ago
answered 8 hours ago
JoleTJoleT
2,14711 silver badges16 bronze badges
2,14711 silver badges16 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f415053%2fhow-does-stochastic-gradient-descent-even-work-for-neural-nets%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago