How does stochastic gradient descent even work for neural nets?How does neural network recognise images?Convergence of neural network weightsNeural Network General Learning Dynamics of Gradient DescentClarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementationWhy use gradient descent with neural networks?How to set mini-batch size in SGD in kerasStochastic gradient descent for neural networks with tied weightsIn Neural Network back propagation, how are the weights for one training examples related to the weights for next training examples?How backpropagation through gradient descent represents the error after each forward passNesterov's Momentum (or Accelerated Gradient)

Can European countries bypass the EU and make their own individual trade deal with the U.S.?

What verb for taking advantage fits in "I don't want to ________ on the friendship"?

Ways to get SMD resistors from a strip

Security Patch SUPEE-11155 - Possible issues?

Closest Proximity of Oceans to Freshwater Springs

How do I create a new column in a dataframe from an existing column using conditions?

Is ALTER TABLE ... DROP COLUMN really a metadata only operation?

Bin Packing with Relational Penalization

How can I deal with extreme temperatures in a hotel room?

What was the point of separating stdout and stderr?

"I am [the / an] owner of a bookstore"?

Calculus, Water Poured into a Cone: Why is Derivative Non-linear?

Does a return economy-class seat between London and San Francisco release 5.28 tonnes of CO2 equivalents?

The Lucas argument vs the theorem-provers--who wins and why?

My colleague is constantly blaming me for his errors

How to securely dispose of a smartphone?

Can dual citizens open crypto exchange accounts where U.S. citizens are prohibited?

Cooking a nice pan seared steak for picky eaters

When casting a spell with a long casting time, what happens if you don't spend your action on a turn to continue casting?

Prime in substituted benzene

Adjective for 'made of pus' or 'corrupted by pus' or something of something of pus

Origin of the convolution theorem

Why would anyone even use a Portkey?

Reusable spacecraft: why still have fairings detach, instead of open/close?



How does stochastic gradient descent even work for neural nets?


How does neural network recognise images?Convergence of neural network weightsNeural Network General Learning Dynamics of Gradient DescentClarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementationWhy use gradient descent with neural networks?How to set mini-batch size in SGD in kerasStochastic gradient descent for neural networks with tied weightsIn Neural Network back propagation, how are the weights for one training examples related to the weights for next training examples?How backpropagation through gradient descent represents the error after each forward passNesterov's Momentum (or Accelerated Gradient)






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1












$begingroup$


How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?



Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?



So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?



To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.



Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.










share|cite|improve this question











$endgroup$











  • $begingroup$
    typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
    $endgroup$
    – EngrStudent
    11 hours ago

















1












$begingroup$


How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?



Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?



So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?



To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.



Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.










share|cite|improve this question











$endgroup$











  • $begingroup$
    typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
    $endgroup$
    – EngrStudent
    11 hours ago













1












1








1





$begingroup$


How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?



Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?



So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?



To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.



Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.










share|cite|improve this question











$endgroup$




How does stochastic gradient descent (meaning where you backpropagate and adjust the weights and biases of the neural network after each single sample) even work?



Doesn't that just tell the neural network to learn that particular sample? Which isn't really what we want?



So instead of converging towards a solution that appropiately learns the entire training data, the neural net will oscillate between different solutions that are more optimal for the individual samples?



To give you a simple example: say I want my neural network to output $x = 1$ if the input is $1$ and I want it to output $x = 0$ if input is $0$.



Let's say I train it on the input $0$, then $1$, then $0$ again, and so on. Each time my input is 0, every weight will get adjusted so as to output 0. But then in the next iteration, every weight will get adjusted so as to output 1, hence counter-acting the previous iteration.







machine-learning neural-networks optimization gradient-descent sgd






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 10 hours ago









Firebug

7,9692 gold badges33 silver badges81 bronze badges




7,9692 gold badges33 silver badges81 bronze badges










asked 11 hours ago









JaoodJaood

1112 bronze badges




1112 bronze badges











  • $begingroup$
    typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
    $endgroup$
    – EngrStudent
    11 hours ago
















  • $begingroup$
    typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
    $endgroup$
    – EngrStudent
    11 hours ago















$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago




$begingroup$
typical: means subsample rows(samples) so instead of performing a weight-update using gradient derived from an error vector taken from all input-output pairs, you do it for a random-subset of input-output pairs, and change which ones you draw at each gradient step.
$endgroup$
– EngrStudent
11 hours ago










4 Answers
4






active

oldest

votes


















2












$begingroup$

You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change, so rather then oscillating, it would gradually average between different solutions. If learning_rate is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate.






share|cite|improve this answer











$endgroup$












  • $begingroup$
    This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
    $endgroup$
    – JoleT
    7 hours ago



















1












$begingroup$

The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.



To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.






share|cite|improve this answer









$endgroup$




















    1












    $begingroup$

    Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.



    The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population






    share|cite|improve this answer











    $endgroup$




















      0












      $begingroup$

      The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.



      For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.



      After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).






      share|cite|improve this answer











      $endgroup$















        Your Answer








        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "65"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f415053%2fhow-does-stochastic-gradient-descent-even-work-for-neural-nets%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        2












        $begingroup$

        You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change, so rather then oscillating, it would gradually average between different solutions. If learning_rate is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate.






        share|cite|improve this answer











        $endgroup$












        • $begingroup$
          This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
          $endgroup$
          – JoleT
          7 hours ago
















        2












        $begingroup$

        You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change, so rather then oscillating, it would gradually average between different solutions. If learning_rate is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate.






        share|cite|improve this answer











        $endgroup$












        • $begingroup$
          This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
          $endgroup$
          – JoleT
          7 hours ago














        2












        2








        2





        $begingroup$

        You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change, so rather then oscillating, it would gradually average between different solutions. If learning_rate is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate.






        share|cite|improve this answer











        $endgroup$



        You missed the fact that the optimization step is something like new_estimate = previous_estimate + learning_rate * change, so rather then oscillating, it would gradually average between different solutions. If learning_rate is small, then the increment towards new value will also be small, and there won't be "jumps". Oscillating between different solutions will happen if learning_rate is too big, so if you see this on the training history plot, this suggests that you should lower the learning_rate.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited 10 hours ago

























        answered 11 hours ago









        TimTim

        62.4k9 gold badges140 silver badges236 bronze badges




        62.4k9 gold badges140 silver badges236 bronze badges











        • $begingroup$
          This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
          $endgroup$
          – JoleT
          7 hours ago

















        • $begingroup$
          This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
          $endgroup$
          – JoleT
          7 hours ago
















        $begingroup$
        This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
        $endgroup$
        – JoleT
        7 hours ago





        $begingroup$
        This is a good point. I commonly start with a learning rate of 0.05, but have seen where this is too high -- and 0.02 and 0.03 often work well. The textbook recommendation for initial learning rate is $1/lambda_1$, the inverse of the first eigenvalue of covariance matrix of the input features. But this is usually way too great of a value.
        $endgroup$
        – JoleT
        7 hours ago














        1












        $begingroup$

        The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.



        To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.






        share|cite|improve this answer









        $endgroup$

















          1












          $begingroup$

          The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.



          To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.






          share|cite|improve this answer









          $endgroup$















            1












            1








            1





            $begingroup$

            The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.



            To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.






            share|cite|improve this answer









            $endgroup$



            The type of problem you're pointing out is what happens when you overfit the training data. You're actually describing a rather extreme case of overfitting in which the network is adjusted too much for each case, whereas in general people talk about overfitting for the entire set of training data — but the principle is the same.



            To avoid overfitting the data, you can adjust the "learning rate" which is essentially a measure of how much you adjust the weights in each step of learning. So using your example cases, every time the input is 0 the weights will only be adjusted so that the output is closer to 0 (how close is a factor of the learning rate.) The idea is that by making small enough adjustments, you'll avoid the kind of oscillating / overfitting behavior you're describing.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered 11 hours ago









            Bill ClarkBill Clark

            17411 bronze badges




            17411 bronze badges





















                1












                $begingroup$

                Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.



                The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population






                share|cite|improve this answer











                $endgroup$

















                  1












                  $begingroup$

                  Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.



                  The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population






                  share|cite|improve this answer











                  $endgroup$















                    1












                    1








                    1





                    $begingroup$

                    Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.



                    The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population






                    share|cite|improve this answer











                    $endgroup$



                    Your full data is almost always a sample itself unless you truly got the population which is rare. It should be easier now for you to get comfortable with an idea that a batch is the same in some sense as the full data set, except for the size.



                    The caveat here is that the batch should preserve the characteristics of the full data set. Imagine, that you're building a face recognition data set. If your batch first takes all black women, then white women, then white men and finally black men, it's not a good situation, because the batches will be have different characteristics than the sample. Ideally the batches are sampled randomly from the full set, and the full set itself is a random sample from the population







                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited 7 hours ago

























                    answered 8 hours ago









                    AksakalAksakal

                    40.7k4 gold badges54 silver badges123 bronze badges




                    40.7k4 gold badges54 silver badges123 bronze badges





















                        0












                        $begingroup$

                        The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.



                        For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.



                        After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).






                        share|cite|improve this answer











                        $endgroup$

















                          0












                          $begingroup$

                          The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.



                          For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.



                          After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).






                          share|cite|improve this answer











                          $endgroup$















                            0












                            0








                            0





                            $begingroup$

                            The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.



                            For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.



                            After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).






                            share|cite|improve this answer











                            $endgroup$



                            The outputs won't be 0 or 1. Even if you use a linear (identity) link on the output side, the output would be like a yhat from regression, but not 0 or 1. Further, the connection weights are e.g. initialized to small random numbers in the range [-0.5,0.5], so with random connection weights at the first iteration, how can the predicted output be 0 or 1? Same thing for logistic, tanh, and softmax activation functions on the output side.



                            For an identity link function on the output side if the result is binary (0,1), I usually use (-1,1) instead. ANNs also like inputs to be balanced in a range of [-1,1] or at least mean-zero standardized. So feeding 0,1's to an ANN could be throwing you off. You're also supposed to permute (randomly shuffle, or reorder) the input records that are fed to an ANN, so the ANN does not learn from the order of the data. I do this before each sweep (epoch) through the training data, since I don't want the ANN to learn from training object order.



                            After you train the ANN on the training data, you predict class (0,1) with objects held out of training that are in each CV fold. Prediction accuracy is never based on predicting the output for objects used in training. (they are used internally, but it's only for training error: MSE or cross-entropy).







                            share|cite|improve this answer














                            share|cite|improve this answer



                            share|cite|improve this answer








                            edited 7 hours ago

























                            answered 8 hours ago









                            JoleTJoleT

                            2,14711 silver badges16 bronze badges




                            2,14711 silver badges16 bronze badges



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Cross Validated!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f415053%2fhow-does-stochastic-gradient-descent-even-work-for-neural-nets%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

                                Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

                                Ласкавець круглолистий Зміст Опис | Поширення | Галерея | Примітки | Посилання | Навігаційне меню58171138361-22960890446Bupleurum rotundifoliumEuro+Med PlantbasePlants of the World Online — Kew ScienceGermplasm Resources Information Network (GRIN)Ласкавецькн. VI : Літери Ком — Левиправивши або дописавши її