1
votes

I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.

How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).

eg:

X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)

If the dataset is padded with zeros for most feature values - then will pca be valid?

1
what are the features and why are then null? if they're like term frequencies from a text document then they should be zero not null; this is still a fine scenario for PCA; if they're some continuous values from sensors, then maybe you want to impute themmaxymoo
the features are physical analyses consisting of a single float value for each analysis type (eg hardness, element concentration, colour etc..) but for each row most only some are measured. The values cannot be imputed.Don Smythe
well sklearn can't deal with data containing nulls, so you'll have to do something with them ... if you set them to zero when they wouldn't have been zero if you'd measured them, then yes it will definitely distort the outcome. maybe you could use a decision trees to impute the null values?maxymoo
What do your features represent ?MMF
Its probably not going to solve your problem, but you could use TruncatedSVD, another PCA-like decomposition method that accepts sparse inputs. It would work with your data, but it probably wont do what you expect.Imanol Luengo

1 Answers

1
votes

I see 3 options, however none is a solution for your problem:

1) You replace the null values by 0, but that will definetly worsen your results;

2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;

3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.