I am trying to use DBSCAN sklearn implementation for anomaly detection. It works fine for small datasets (500 x 6). However, it runs into memory issues when I try to use a large dataset (180000 x 24). Is there something I can do to overcome this issue?
from sklearn.cluster import DBSCAN
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
data = pd.read_csv("dataset.csv")
# Drop non-continuous variables
data.drop(["x1", "x2"], axis = 1, inplace = True)
df = data
data = df.as_matrix().astype("float32", copy = False)
stscaler = StandardScaler().fit(data)
data = stscaler.transform(data)
print "Dataset size:", df.shape
dbsc = DBSCAN(eps = 3, min_samples = 30).fit(data)
labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
df['Labels'] = labels.tolist()
#print df.head(10)
print "Number of anomalies:", -1 * (df[df.Labels < 0]['Labels'].sum())