How does stochastic gradient descent even work for neural nets?How does neural network recognise images?Convergence of neural network weightsNeural Network General Learning Dynamics of Gradient DescentClarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementationWhy use gradient descent with neural networks?How to set mini-batch size in SGD in kerasStochastic gradient descent for neural networks with tied weightsIn Neural Network back propagation, how are the weights for one training examples related to the weights for next training examples?How backpropagation through gradient descent represents the error after each forward passNesterov's Momentum (or Accelerated Gradient)

Can European countries bypass the EU and make their own individual trade deal with the U.S.?

What verb for taking advantage fits in "I don't want to ________ on the friendship"?

Ways to get SMD resistors from a strip

Security Patch SUPEE-11155 - Possible issues?

Closest Proximity of Oceans to Freshwater Springs

How do I create a new column in a dataframe from an existing column using conditions?

Is ALTER TABLE ... DROP COLUMN really a metadata only operation?

Bin Packing with Relational Penalization

How can I deal with extreme temperatures in a hotel room?

What was the point of separating stdout and stderr?

"I am [the / an] owner of a bookstore"?

Calculus, Water Poured into a Cone: Why is Derivative Non-linear?

Does a return economy-class seat between London and San Francisco release 5.28 tonnes of CO2 equivalents?

The Lucas argument vs the theorem-provers--who wins and why?

My colleague is constantly blaming me for his errors

How to securely dispose of a smartphone?

Can dual citizens open crypto exchange accounts where U.S. citizens are prohibited?

Cooking a nice pan seared steak for picky eaters

When casting a spell with a long casting time, what happens if you don't spend your action on a turn to continue casting?

Prime in substituted benzene

Adjective for 'made of pus' or 'corrupted by pus' or something of something of pus

Origin of the convolution theorem

Why would anyone even use a Portkey?

Reusable spacecraft: why still have fairings detach, instead of open/close?

How does stochastic gradient descent even work for neural nets?

How does neural network recognise images?Convergence of neural network weightsNeural Network General Learning Dynamics of Gradient DescentClarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementationWhy use gradient descent with neural networks?How to set mini-batch size in SGD in kerasStochastic gradient descent for neural networks with tied weightsIn Neural Network back propagation, how are the weights for one training examples related to the weights for next training examples?How backpropagation through gradient descent represents the error after each forward passNesterov's Momentum (or Accelerated Gradient)

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?

Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?

So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?

To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.

Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago

add a comment |

How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?

Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?

To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago

add a comment |

How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?

Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?

To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?

Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?

To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.

machine-learning neural-networks optimization gradient-descent sgd

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

edited 10 hours ago

Firebug

7,9692 gold badges33 silver badges81 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

asked 11 hours ago

Jaood

1112 bronze badges

$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago

add a comment |

$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago

typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.

– EngrStudent
11 hours ago

add a comment |

4 Answers
4

active

oldest

votes

You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change, so rather then oscillating, it would gradually average between different solutions. If learning_rate is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate.

edited 10 hours ago

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago

add a comment |

The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.

To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.

answered 11 hours ago

Bill Clark

17411 bronze badges

add a comment |

Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.

The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population

edited 7 hours ago

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

add a comment |

The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.

For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.

After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).

edited 7 hours ago

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f415053%2fhow-does-stochastic-gradient-descent-even-work-for-neural-nets%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

edited 10 hours ago

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago

add a comment |

edited 10 hours ago

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago

add a comment |

edited 10 hours ago

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

edited 10 hours ago

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

edited 10 hours ago

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

answered 11 hours ago

Tim♦

62.4k9 gold badges140 silver badges236 bronze badges

$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago

add a comment |

$begingroup$
This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
$endgroup$
– JoleT
7 hours ago

This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.

– JoleT
7 hours ago

add a comment |

answered 11 hours ago

Bill Clark

17411 bronze badges

add a comment |

answered 11 hours ago

Bill Clark

17411 bronze badges

add a comment |

answered 11 hours ago

Bill Clark

17411 bronze badges

answered 11 hours ago

Bill Clark

17411 bronze badges

answered 11 hours ago

Bill Clark

17411 bronze badges

answered 11 hours ago

Bill Clark

17411 bronze badges

answered 11 hours ago

Bill Clark

17411 bronze badges

add a comment |

edited 7 hours ago

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

add a comment |

edited 7 hours ago

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

add a comment |

edited 7 hours ago

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

edited 7 hours ago

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

edited 7 hours ago

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

answered 8 hours ago

Aksakal

40.7k4 gold badges54 silver badges123 bronze badges

add a comment |

edited 7 hours ago

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

add a comment |

edited 7 hours ago

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

add a comment |

edited 7 hours ago

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

edited 7 hours ago

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

edited 7 hours ago

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

answered 8 hours ago

JoleT

2,14711 silver badges16 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf

4 Answers
4

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Post as a guest

Popular posts from this blog

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

4 Answers
4

4 Answers
4

4 Answers
4