Which likelihood function is used in linear regression?Maximizing: likelihood vs likelihood ratioWhen would maximum likelihood estimates equal least squares estimates?Comparing maximum likelihood estimation (MLE) and Bayes' TheoremMaximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceLikelihood - Why multiply?AIC only applicable to maximum likelihood fit (not least squares)?Why does Maximum Likelihood estimation maximizes probability density instead of probabilitylinear regression with gaussian distributionUnderstand a statement about likelihood function

How many different ways are there to checkmate in the early game?

How does The Fools Guild make its money?

(11 of 11: Meta) What is Pyramid Cult's All-Time Favorite?

How do we avoid CI-driven development...?

How can I tell if a flight itinerary is fake?

Who are these characters/superheroes in the posters from Chris's room in Family Guy?

How would I as a DM create a smart phone-like spell/device my players could use?

Looking for a new job because of relocation - is it okay to tell the real reason?

Look mom! I made my own (Base 10) numeral system!

In a topological space if there exists a loop that cannot be contracted to a point does there exist a simple loop that cannot be contracted also?

Why "ch" pronunciation rule doesn't occur for words such as "durch", "manchmal"?

Is it really ~648.69 km/s delta-v to "land" on the surface of the Sun?

Performance of a branch and bound algorithm VS branch-cut-heuristics

Are there any financial disadvantages to living significantly "below your means"?

In reversi, can you overwrite two chips in one move?

What does "sardine box" mean?

Was the 2019 Lion King film made through motion capture?

What does Apple mean by "This may decrease battery life"?

Optimal way to extract "positive part" of a multivariate polynomial

Write an interpreter for *

Generator for parity?

Converting Piecewise function to C code

Ex-contractor published company source code and secrets online

Why couldn't soldiers sight their own weapons without officers' orders?



Which likelihood function is used in linear regression?


Maximizing: likelihood vs likelihood ratioWhen would maximum likelihood estimates equal least squares estimates?Comparing maximum likelihood estimation (MLE) and Bayes' TheoremMaximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceLikelihood - Why multiply?AIC only applicable to maximum likelihood fit (not least squares)?Why does Maximum Likelihood estimation maximizes probability density instead of probabilitylinear regression with gaussian distributionUnderstand a statement about likelihood function






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








4












$begingroup$


When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$

All pages that I read on the internet use the first one.

I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

Is my point correct or not?Is there any difference?










share|cite|improve this question











$endgroup$




















    4












    $begingroup$


    When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
    $P(y|x,w)$
    $P(y,x|w)$

    All pages that I read on the internet use the first one.

    I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

    so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

    The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

    Is my point correct or not?Is there any difference?










    share|cite|improve this question











    $endgroup$
















      4












      4








      4


      1



      $begingroup$


      When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
      $P(y|x,w)$
      $P(y,x|w)$

      All pages that I read on the internet use the first one.

      I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

      so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

      The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

      Is my point correct or not?Is there any difference?










      share|cite|improve this question











      $endgroup$




      When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
      $P(y|x,w)$
      $P(y,x|w)$

      All pages that I read on the internet use the first one.

      I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

      so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

      The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

      Is my point correct or not?Is there any difference?







      regression mathematical-statistics maximum-likelihood






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 7 hours ago







      floyd

















      asked 8 hours ago









      floydfloyd

      3445 silver badges20 bronze badges




      3445 silver badges20 bronze badges























          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



          Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



          More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
          $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
          omitting some computation we obtain the minimizing $f$ to be
          $$f(x) = mathbbE(Y | X = x)$$
          so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



          However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






          share|cite|improve this answer









          $endgroup$














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago






          • 1




            $begingroup$
            I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
            $endgroup$
            – Drew N
            5 hours ago


















          1












          $begingroup$

          That's a good question since the difference is a bit subtle - hopefully this helps.



          The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



          Usually, in simple linear regression,



          $$Y = beta_0 + beta_1 X + epsilon$$



          you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



          Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



          $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



          If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






          share|cite|improve this answer









          $endgroup$














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago













          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f421598%2fwhich-likelihood-function-is-used-in-linear-regression%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



          Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



          More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
          $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
          omitting some computation we obtain the minimizing $f$ to be
          $$f(x) = mathbbE(Y | X = x)$$
          so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



          However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






          share|cite|improve this answer









          $endgroup$














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago






          • 1




            $begingroup$
            I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
            $endgroup$
            – Drew N
            5 hours ago















          1












          $begingroup$

          As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



          Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



          More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
          $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
          omitting some computation we obtain the minimizing $f$ to be
          $$f(x) = mathbbE(Y | X = x)$$
          so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



          However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






          share|cite|improve this answer









          $endgroup$














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago






          • 1




            $begingroup$
            I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
            $endgroup$
            – Drew N
            5 hours ago













          1












          1








          1





          $begingroup$

          As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



          Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



          More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
          $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
          omitting some computation we obtain the minimizing $f$ to be
          $$f(x) = mathbbE(Y | X = x)$$
          so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



          However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






          share|cite|improve this answer









          $endgroup$



          As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



          Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



          More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
          $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
          omitting some computation we obtain the minimizing $f$ to be
          $$f(x) = mathbbE(Y | X = x)$$
          so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



          However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 6 hours ago









          Drew N Drew N

          3462 silver badges8 bronze badges




          3462 silver badges8 bronze badges














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago






          • 1




            $begingroup$
            I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
            $endgroup$
            – Drew N
            5 hours ago
















          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago






          • 1




            $begingroup$
            I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
            $endgroup$
            – Drew N
            5 hours ago















          $begingroup$
          Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
          $endgroup$
          – floyd
          5 hours ago




          $begingroup$
          Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
          $endgroup$
          – floyd
          5 hours ago




          1




          1




          $begingroup$
          I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
          $endgroup$
          – Drew N
          5 hours ago




          $begingroup$
          I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
          $endgroup$
          – Drew N
          5 hours ago













          1












          $begingroup$

          That's a good question since the difference is a bit subtle - hopefully this helps.



          The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



          Usually, in simple linear regression,



          $$Y = beta_0 + beta_1 X + epsilon$$



          you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



          Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



          $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



          If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






          share|cite|improve this answer









          $endgroup$














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago















          1












          $begingroup$

          That's a good question since the difference is a bit subtle - hopefully this helps.



          The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



          Usually, in simple linear regression,



          $$Y = beta_0 + beta_1 X + epsilon$$



          you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



          Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



          $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



          If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






          share|cite|improve this answer









          $endgroup$














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago













          1












          1








          1





          $begingroup$

          That's a good question since the difference is a bit subtle - hopefully this helps.



          The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



          Usually, in simple linear regression,



          $$Y = beta_0 + beta_1 X + epsilon$$



          you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



          Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



          $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



          If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






          share|cite|improve this answer









          $endgroup$



          That's a good question since the difference is a bit subtle - hopefully this helps.



          The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



          Usually, in simple linear regression,



          $$Y = beta_0 + beta_1 X + epsilon$$



          you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



          Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



          $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



          If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 6 hours ago









          Samir Rachid ZaimSamir Rachid Zaim

          864 bronze badges




          864 bronze badges














          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago
















          • $begingroup$
            Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
            $endgroup$
            – floyd
            5 hours ago















          $begingroup$
          Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
          $endgroup$
          – floyd
          5 hours ago




          $begingroup$
          Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
          $endgroup$
          – floyd
          5 hours ago

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Cross Validated!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f421598%2fwhich-likelihood-function-is-used-in-linear-regression%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

          Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

          199年 目錄 大件事 到箇年出世嗰人 到箇年死嗰人 節慶、風俗習慣 導覽選單