Density scatter plot for huge dataset in matplotlib

Question

I wrote some code a while ago that used gaussian kde to make simple density scatter plots. However, for datasets larger than about 100,000 points, it just ran 'forever' (I killed it after a few days). A friend gave me some code in R that could create such a density plot in seconds (plot_fun.R), and it seems like matplotlib should be able to do the same thing.

I think the right place to look is 2d histograms, but I am struggling to get the density to be 'right'. I modified code I found at this question to accomplish this, but the density is not showing, it looks like only the densist posible points are getting any color.

Here is approximately the code I am using:

# initial data
x = -np.log10(np.random.random_sample(10000))
y = -np.log10(np.random.random_sample(10000))

#histogram definition
bins = [1000, 1000] # number of bins
thresh = 3  #density threshold

#data definition
mn = min(x.min(), y.min())
mx = max(x.max(), y.max())
mn = mn-(mn*.1)
mx = mx+(mx*.1)
xyrange = [[mn, mx], [mn, mx]]

# histogram the data
hh, locx, locy = np.histogram2d(x, y, range=xyrange, bins=bins)
posx = np.digitize(x, locx)
posy = np.digitize(y, locy)

#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = x[ind][hhsub < thresh] # low density points
ydat1 = y[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs

f, a = plt.subplots(figsize=(12,12))

c = a.imshow(
    np.flipud(hh.T), cmap='jet',
    extent=np.array(xyrange).flatten(), interpolation='none',
    origin='upper'
)
f.colorbar(c, ax=ax, orientation='vertical', shrink=0.75, pad=0.05)
s = a.scatter(
    xdat1, ydat1, color='darkblue', edgecolor='', label=None,
    picker=True, zorder=2
)

That produces this plot:

The KDE code is here:

f, a = plt.subplots(figsize=(12,12))
xy = np.vstack([x, y])
z = sts.gaussian_kde(xy)(xy)                                           
# Sort the points by density, so that the densest points are       
# plotted last                                                     
idx = z.argsort()                                                  
x2, y2, z = x[idx], y[idx], z[idx]                                 
s = a.scatter(                                                     
    x2, y2, c=z, s=50, cmap='jet',
    edgecolor='', label=None, picker=True, zorder=2                
)

That produces this plot:

The problem is, of course, that this code is unusable on large data sets.

My question is: how can I use the 2d histogram to produce a scatter plot like that? ax.hist2d does not produce a useful output, because it colors the whole plot, and all my efforts to get the above 2d histogram data to actually color the dense regions of the plot correctly have failed, I always end up with either no coloring or a tiny percentage of the densest points being colored. Clearly I just don't understand the code very well.

Using smaller marker ',' along with transparency alpha=0.1 (or any appropriate value) will give you similar visual info and should be very fast — Julien
I did, that is the first plot and set of code. My question is basically why it doesn't work, as you can see in the first plot, there is no density shading. — Mike Dacre

Julien Julien · Accepted Answer · 2018-04-06T01:38:16

Your histogram code assigns a unique color (color='darkblue') so what are you expecting? I think you are also over complicating things. This much simpler code works fine:

import numpy as np
import matplotlib.pyplot as plt

x, y = -np.log10(np.random.random_sample((2,10**6)))

#histogram definition
bins = [1000, 1000] # number of bins

# histogram the data
hh, locx, locy = np.histogram2d(x, y, bins=bins)

# Sort the points by density, so that the densest points are plotted last
z = np.array([hh[np.argmax(a<=locx[1:]),np.argmax(b<=locy[1:])] for a,b in zip(x,y)])
idx = z.argsort()
x2, y2, z2 = x[idx], y[idx], z[idx]

plt.figure(1,figsize=(8,8)).clf()
s = plt.scatter(x2, y2, c=z2, cmap='jet', marker='.')

Density scatter plot for huge dataset in matplotlib

1 Answers