Best way to visualize huge amount of dataMovieLens data setSklearn and PCA. Why is max n_row == max n_components?Pivot DataFrame while calculating new valuesBest way to represent data as features vectors in Pythonpanda grouping by month with transposeHow to measure correlation between several categorical features and a numerical label in Python?Histogram is extremely skewed to the leftPerform k-means clustering over multiple columnsProcessing csv file with more than 700K rows of data

Does the Way of Shadow monk's Shadow Step feature count as a magical ability?

Does every piano need tuning every year?

What are the consequences of high orphan block rate?

Why weren't the Death Star plans transmitted electronically?

Why are there two fundamental laws of logic?

what should be done first, handling missing data or dealing with data types?

Is there a recurrence relation which has no closed formula?

How do pilots align the HUD with their eyeballs?

How do you use the interjection for snorting?

A file manager to open a zip file like opening a folder, instead of extract it by using a archive manager

What is the meaning of word 'crack' in chapter 33 of A Game of Thrones?

Aesthetic proofs that involve Field Theory / Galois Theory

A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?

What benefits does the Power Word Kill spell have?

Comma Code - Automate the Boring Stuff with Python

Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?

Cut a cake into 3 equal portions with only a knife

Why does C++ have 'Undefined Behaviour' and other languages like C# or Java don't?

Does wetting a beer glass change the foam characteristics?

How to justify a team increase when the team is doing good?

Do I have advantage with Riposte when moving away from a flanked enemy and triggering an opportunity attack?

Do wheelchair aircraft exist?

Why, even after his imprisonment, people keep calling Hannibal Lecter "Doctor"?

I nicked the tip of the taper on a bottom bracket spindle. Is it still safe?



Best way to visualize huge amount of data


MovieLens data setSklearn and PCA. Why is max n_row == max n_components?Pivot DataFrame while calculating new valuesBest way to represent data as features vectors in Pythonpanda grouping by month with transposeHow to measure correlation between several categorical features and a numerical label in Python?Histogram is extremely skewed to the leftPerform k-means clustering over multiple columnsProcessing csv file with more than 700K rows of data






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?










share|improve this question









$endgroup$




















    2












    $begingroup$


    I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?










    share|improve this question









    $endgroup$
















      2












      2








      2





      $begingroup$


      I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?










      share|improve this question









      $endgroup$




      I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?







      pandas matplotlib seaborn






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 10 hours ago









      Taimur IslamTaimur Islam

      2561 gold badge4 silver badges11 bronze badges




      2561 gold badge4 silver badges11 bronze badges























          2 Answers
          2






          active

          oldest

          votes


















          2














          $begingroup$

          I have three suggestions that may help.



          1. Reduce the point size

          2. Make the points highly transparent

          3. Downsample the points

          Since you do not provide any sample data, I will use some random data to illustrate.



          ## The purpose of S1 is to intermix the two classes at random
          S1 = sample(3000000)
          x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
          y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
          z = c(rep(1,2000000), rep(2,1000000))[S1]

          plot(x,y, pch=20, col=rainbow(3)[z])


          Initial plot



          The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.



          plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z]) 


          Reduced point size and high transparency



          Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.



          If we downsample, we don't need quite as much transparency.



          ## The purpose of S2 is to downsample the data
          S2 = sample(3000000, 100000)
          plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])


          Downsampled data



          This gives a different view that provides a similar, but not identical understanding of the two distributions.



          These are not magic, but I think that they are helpful.






          share|improve this answer









          $endgroup$






















            1














            $begingroup$

            Assuming you're using Python, the datashader module was created to effectively display very large number of points.



            I however recommend using the hvplot package instead as it includes datashader support and provides a pandas compatible API.



            # import modules
            import pandas as pd
            import hvplot.pandas

            # read your data into dataframe (or whatever source).
            df = pd.read_csv('large_file.csv')

            # plot using hvplot; normally df.plot
            df.hvplot.scatter('x_column', 'y_column', datashade=True')


            datashader in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.



            Note



            If you're reading in larger than RAM datasets you may want to check out dask as well.






            share|improve this answer









            $endgroup$

















              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );














              draft saved

              draft discarded
















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f60555%2fbest-way-to-visualize-huge-amount-of-data%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              $begingroup$

              I have three suggestions that may help.



              1. Reduce the point size

              2. Make the points highly transparent

              3. Downsample the points

              Since you do not provide any sample data, I will use some random data to illustrate.



              ## The purpose of S1 is to intermix the two classes at random
              S1 = sample(3000000)
              x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
              y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
              z = c(rep(1,2000000), rep(2,1000000))[S1]

              plot(x,y, pch=20, col=rainbow(3)[z])


              Initial plot



              The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.



              plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z]) 


              Reduced point size and high transparency



              Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.



              If we downsample, we don't need quite as much transparency.



              ## The purpose of S2 is to downsample the data
              S2 = sample(3000000, 100000)
              plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])


              Downsampled data



              This gives a different view that provides a similar, but not identical understanding of the two distributions.



              These are not magic, but I think that they are helpful.






              share|improve this answer









              $endgroup$



















                2














                $begingroup$

                I have three suggestions that may help.



                1. Reduce the point size

                2. Make the points highly transparent

                3. Downsample the points

                Since you do not provide any sample data, I will use some random data to illustrate.



                ## The purpose of S1 is to intermix the two classes at random
                S1 = sample(3000000)
                x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
                y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
                z = c(rep(1,2000000), rep(2,1000000))[S1]

                plot(x,y, pch=20, col=rainbow(3)[z])


                Initial plot



                The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.



                plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z]) 


                Reduced point size and high transparency



                Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.



                If we downsample, we don't need quite as much transparency.



                ## The purpose of S2 is to downsample the data
                S2 = sample(3000000, 100000)
                plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])


                Downsampled data



                This gives a different view that provides a similar, but not identical understanding of the two distributions.



                These are not magic, but I think that they are helpful.






                share|improve this answer









                $endgroup$

















                  2














                  2










                  2







                  $begingroup$

                  I have three suggestions that may help.



                  1. Reduce the point size

                  2. Make the points highly transparent

                  3. Downsample the points

                  Since you do not provide any sample data, I will use some random data to illustrate.



                  ## The purpose of S1 is to intermix the two classes at random
                  S1 = sample(3000000)
                  x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
                  y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
                  z = c(rep(1,2000000), rep(2,1000000))[S1]

                  plot(x,y, pch=20, col=rainbow(3)[z])


                  Initial plot



                  The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.



                  plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z]) 


                  Reduced point size and high transparency



                  Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.



                  If we downsample, we don't need quite as much transparency.



                  ## The purpose of S2 is to downsample the data
                  S2 = sample(3000000, 100000)
                  plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])


                  Downsampled data



                  This gives a different view that provides a similar, but not identical understanding of the two distributions.



                  These are not magic, but I think that they are helpful.






                  share|improve this answer









                  $endgroup$



                  I have three suggestions that may help.



                  1. Reduce the point size

                  2. Make the points highly transparent

                  3. Downsample the points

                  Since you do not provide any sample data, I will use some random data to illustrate.



                  ## The purpose of S1 is to intermix the two classes at random
                  S1 = sample(3000000)
                  x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
                  y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
                  z = c(rep(1,2000000), rep(2,1000000))[S1]

                  plot(x,y, pch=20, col=rainbow(3)[z])


                  Initial plot



                  The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.



                  plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z]) 


                  Reduced point size and high transparency



                  Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.



                  If we downsample, we don't need quite as much transparency.



                  ## The purpose of S2 is to downsample the data
                  S2 = sample(3000000, 100000)
                  plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])


                  Downsampled data



                  This gives a different view that provides a similar, but not identical understanding of the two distributions.



                  These are not magic, but I think that they are helpful.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 5 hours ago









                  G5WG5W

                  2543 silver badges10 bronze badges




                  2543 silver badges10 bronze badges


























                      1














                      $begingroup$

                      Assuming you're using Python, the datashader module was created to effectively display very large number of points.



                      I however recommend using the hvplot package instead as it includes datashader support and provides a pandas compatible API.



                      # import modules
                      import pandas as pd
                      import hvplot.pandas

                      # read your data into dataframe (or whatever source).
                      df = pd.read_csv('large_file.csv')

                      # plot using hvplot; normally df.plot
                      df.hvplot.scatter('x_column', 'y_column', datashade=True')


                      datashader in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.



                      Note



                      If you're reading in larger than RAM datasets you may want to check out dask as well.






                      share|improve this answer









                      $endgroup$



















                        1














                        $begingroup$

                        Assuming you're using Python, the datashader module was created to effectively display very large number of points.



                        I however recommend using the hvplot package instead as it includes datashader support and provides a pandas compatible API.



                        # import modules
                        import pandas as pd
                        import hvplot.pandas

                        # read your data into dataframe (or whatever source).
                        df = pd.read_csv('large_file.csv')

                        # plot using hvplot; normally df.plot
                        df.hvplot.scatter('x_column', 'y_column', datashade=True')


                        datashader in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.



                        Note



                        If you're reading in larger than RAM datasets you may want to check out dask as well.






                        share|improve this answer









                        $endgroup$

















                          1














                          1










                          1







                          $begingroup$

                          Assuming you're using Python, the datashader module was created to effectively display very large number of points.



                          I however recommend using the hvplot package instead as it includes datashader support and provides a pandas compatible API.



                          # import modules
                          import pandas as pd
                          import hvplot.pandas

                          # read your data into dataframe (or whatever source).
                          df = pd.read_csv('large_file.csv')

                          # plot using hvplot; normally df.plot
                          df.hvplot.scatter('x_column', 'y_column', datashade=True')


                          datashader in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.



                          Note



                          If you're reading in larger than RAM datasets you may want to check out dask as well.






                          share|improve this answer









                          $endgroup$



                          Assuming you're using Python, the datashader module was created to effectively display very large number of points.



                          I however recommend using the hvplot package instead as it includes datashader support and provides a pandas compatible API.



                          # import modules
                          import pandas as pd
                          import hvplot.pandas

                          # read your data into dataframe (or whatever source).
                          df = pd.read_csv('large_file.csv')

                          # plot using hvplot; normally df.plot
                          df.hvplot.scatter('x_column', 'y_column', datashade=True')


                          datashader in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.



                          Note



                          If you're reading in larger than RAM datasets you may want to check out dask as well.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 3 hours ago









                          fswingsfswings

                          1212 bronze badges




                          1212 bronze badges































                              draft saved

                              draft discarded















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f60555%2fbest-way-to-visualize-huge-amount-of-data%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

                              Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

                              199年 目錄 大件事 到箇年出世嗰人 到箇年死嗰人 節慶、風俗習慣 導覽選單