Best way to visualize huge amount of dataMovieLens data setSklearn and PCA. Why is max n_row == max n_components?Pivot DataFrame while calculating new valuesBest way to represent data as features vectors in Pythonpanda grouping by month with transposeHow to measure correlation between several categorical features and a numerical label in Python?Histogram is extremely skewed to the leftPerform k-means clustering over multiple columnsProcessing csv file with more than 700K rows of data
Does the Way of Shadow monk's Shadow Step feature count as a magical ability?
Does every piano need tuning every year?
What are the consequences of high orphan block rate?
Why weren't the Death Star plans transmitted electronically?
Why are there two fundamental laws of logic?
what should be done first, handling missing data or dealing with data types?
Is there a recurrence relation which has no closed formula?
How do pilots align the HUD with their eyeballs?
How do you use the interjection for snorting?
A file manager to open a zip file like opening a folder, instead of extract it by using a archive manager
What is the meaning of word 'crack' in chapter 33 of A Game of Thrones?
Aesthetic proofs that involve Field Theory / Galois Theory
A famous scholar sent me an unpublished draft of hers. Then she died. I think her work should be published. What should I do?
What benefits does the Power Word Kill spell have?
Comma Code - Automate the Boring Stuff with Python
Is it acceptable to say that a reviewer's concern is not going to be addressed because then the paper would be too long?
Cut a cake into 3 equal portions with only a knife
Why does C++ have 'Undefined Behaviour' and other languages like C# or Java don't?
Does wetting a beer glass change the foam characteristics?
How to justify a team increase when the team is doing good?
Do I have advantage with Riposte when moving away from a flanked enemy and triggering an opportunity attack?
Do wheelchair aircraft exist?
Why, even after his imprisonment, people keep calling Hannibal Lecter "Doctor"?
I nicked the tip of the taper on a bottom bracket spindle. Is it still safe?
Best way to visualize huge amount of data
MovieLens data setSklearn and PCA. Why is max n_row == max n_components?Pivot DataFrame while calculating new valuesBest way to represent data as features vectors in Pythonpanda grouping by month with transposeHow to measure correlation between several categorical features and a numerical label in Python?Histogram is extremely skewed to the leftPerform k-means clustering over multiple columnsProcessing csv file with more than 700K rows of data
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?
pandas matplotlib seaborn
$endgroup$
add a comment
|
$begingroup$
I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?
pandas matplotlib seaborn
$endgroup$
add a comment
|
$begingroup$
I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?
pandas matplotlib seaborn
$endgroup$
I have a data set of around 3M row. I has only 2 category (category- 2:1 ratio). Now i want to visualize(scatter plot) it's distribution to understand can the data linearly separable or not(In order to choose model type).I already try this and the plot is not understandable. What will be the best way to visualize this data set?
pandas matplotlib seaborn
pandas matplotlib seaborn
asked 10 hours ago
Taimur IslamTaimur Islam
2561 gold badge4 silver badges11 bronze badges
2561 gold badge4 silver badges11 bronze badges
add a comment
|
add a comment
|
2 Answers
2
active
oldest
votes
$begingroup$
I have three suggestions that may help.
- Reduce the point size
- Make the points highly transparent
- Downsample the points
Since you do not provide any sample data, I will use some random data to illustrate.
## The purpose of S1 is to intermix the two classes at random
S1 = sample(3000000)
x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
z = c(rep(1,2000000), rep(2,1000000))[S1]
plot(x,y, pch=20, col=rainbow(3)[z])
The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.
plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z])
Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.
If we downsample, we don't need quite as much transparency.
## The purpose of S2 is to downsample the data
S2 = sample(3000000, 100000)
plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])
This gives a different view that provides a similar, but not identical understanding of the two distributions.
These are not magic, but I think that they are helpful.
$endgroup$
add a comment
|
$begingroup$
Assuming you're using Python, the datashader
module was created to effectively display very large number of points.
I however recommend using the hvplot
package instead as it includes datashader
support and provides a pandas
compatible API.
# import modules
import pandas as pd
import hvplot.pandas
# read your data into dataframe (or whatever source).
df = pd.read_csv('large_file.csv')
# plot using hvplot; normally df.plot
df.hvplot.scatter('x_column', 'y_column', datashade=True')
datashader
in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.
Note
If you're reading in larger than RAM datasets you may want to check out dask
as well.
$endgroup$
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f60555%2fbest-way-to-visualize-huge-amount-of-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I have three suggestions that may help.
- Reduce the point size
- Make the points highly transparent
- Downsample the points
Since you do not provide any sample data, I will use some random data to illustrate.
## The purpose of S1 is to intermix the two classes at random
S1 = sample(3000000)
x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
z = c(rep(1,2000000), rep(2,1000000))[S1]
plot(x,y, pch=20, col=rainbow(3)[z])
The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.
plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z])
Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.
If we downsample, we don't need quite as much transparency.
## The purpose of S2 is to downsample the data
S2 = sample(3000000, 100000)
plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])
This gives a different view that provides a similar, but not identical understanding of the two distributions.
These are not magic, but I think that they are helpful.
$endgroup$
add a comment
|
$begingroup$
I have three suggestions that may help.
- Reduce the point size
- Make the points highly transparent
- Downsample the points
Since you do not provide any sample data, I will use some random data to illustrate.
## The purpose of S1 is to intermix the two classes at random
S1 = sample(3000000)
x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
z = c(rep(1,2000000), rep(2,1000000))[S1]
plot(x,y, pch=20, col=rainbow(3)[z])
The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.
plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z])
Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.
If we downsample, we don't need quite as much transparency.
## The purpose of S2 is to downsample the data
S2 = sample(3000000, 100000)
plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])
This gives a different view that provides a similar, but not identical understanding of the two distributions.
These are not magic, but I think that they are helpful.
$endgroup$
add a comment
|
$begingroup$
I have three suggestions that may help.
- Reduce the point size
- Make the points highly transparent
- Downsample the points
Since you do not provide any sample data, I will use some random data to illustrate.
## The purpose of S1 is to intermix the two classes at random
S1 = sample(3000000)
x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
z = c(rep(1,2000000), rep(2,1000000))[S1]
plot(x,y, pch=20, col=rainbow(3)[z])
The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.
plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z])
Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.
If we downsample, we don't need quite as much transparency.
## The purpose of S2 is to downsample the data
S2 = sample(3000000, 100000)
plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])
This gives a different view that provides a similar, but not identical understanding of the two distributions.
These are not magic, but I think that they are helpful.
$endgroup$
I have three suggestions that may help.
- Reduce the point size
- Make the points highly transparent
- Downsample the points
Since you do not provide any sample data, I will use some random data to illustrate.
## The purpose of S1 is to intermix the two classes at random
S1 = sample(3000000)
x = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
y = c(rnorm(2000000, 0, 1), rnorm(1000000, 3,1))[S1]
z = c(rep(1,2000000), rep(2,1000000))[S1]
plot(x,y, pch=20, col=rainbow(3)[z])
The base plot without any adjustments is not very nice. Let's apply suggestions 1 and 2.
plot(x,y, pch=20, cex=0.4, col=rainbow(3, alpha=0.01)[z])
Reducing the point size and making the points highly transparent helps some. This gives a better idea of the overlap between the two distributions.
If we downsample, we don't need quite as much transparency.
## The purpose of S2 is to downsample the data
S2 = sample(3000000, 100000)
plot(x[S2],y[S2], pch=20, cex=0.4, col=rainbow(3, alpha=0.1)[z[S2]])
This gives a different view that provides a similar, but not identical understanding of the two distributions.
These are not magic, but I think that they are helpful.
answered 5 hours ago
G5WG5W
2543 silver badges10 bronze badges
2543 silver badges10 bronze badges
add a comment
|
add a comment
|
$begingroup$
Assuming you're using Python, the datashader
module was created to effectively display very large number of points.
I however recommend using the hvplot
package instead as it includes datashader
support and provides a pandas
compatible API.
# import modules
import pandas as pd
import hvplot.pandas
# read your data into dataframe (or whatever source).
df = pd.read_csv('large_file.csv')
# plot using hvplot; normally df.plot
df.hvplot.scatter('x_column', 'y_column', datashade=True')
datashader
in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.
Note
If you're reading in larger than RAM datasets you may want to check out dask
as well.
$endgroup$
add a comment
|
$begingroup$
Assuming you're using Python, the datashader
module was created to effectively display very large number of points.
I however recommend using the hvplot
package instead as it includes datashader
support and provides a pandas
compatible API.
# import modules
import pandas as pd
import hvplot.pandas
# read your data into dataframe (or whatever source).
df = pd.read_csv('large_file.csv')
# plot using hvplot; normally df.plot
df.hvplot.scatter('x_column', 'y_column', datashade=True')
datashader
in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.
Note
If you're reading in larger than RAM datasets you may want to check out dask
as well.
$endgroup$
add a comment
|
$begingroup$
Assuming you're using Python, the datashader
module was created to effectively display very large number of points.
I however recommend using the hvplot
package instead as it includes datashader
support and provides a pandas
compatible API.
# import modules
import pandas as pd
import hvplot.pandas
# read your data into dataframe (or whatever source).
df = pd.read_csv('large_file.csv')
# plot using hvplot; normally df.plot
df.hvplot.scatter('x_column', 'y_column', datashade=True')
datashader
in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.
Note
If you're reading in larger than RAM datasets you may want to check out dask
as well.
$endgroup$
Assuming you're using Python, the datashader
module was created to effectively display very large number of points.
I however recommend using the hvplot
package instead as it includes datashader
support and provides a pandas
compatible API.
# import modules
import pandas as pd
import hvplot.pandas
# read your data into dataframe (or whatever source).
df = pd.read_csv('large_file.csv')
# plot using hvplot; normally df.plot
df.hvplot.scatter('x_column', 'y_column', datashade=True')
datashader
in effect creates a series of images and only shows the data to the required resolution without over plotting. As you zoom in it updated the view with the refined detail.
Note
If you're reading in larger than RAM datasets you may want to check out dask
as well.
answered 3 hours ago
fswingsfswings
1212 bronze badges
1212 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f60555%2fbest-way-to-visualize-huge-amount-of-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown