I've read through all of the questions about these libraries previously posted, but they all seem to either involve training/test data, or are only asking about PCA() or SS(). I'm confused as to the difference and really want to make sure I'm processing my data correctly.
I have a dataset of neuron signals over time. I want each time-point as a point on the scatterplot. So I have each neuron in a column, and time as the index. Here is basically what I did:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler as ss
def get_pca(df: pd.DataFrame):
# Scale so that mean = 0 and st.deviation = 1.
df = ss().fit_transform(df)
# Need to fit_transform again?
pca = PCA()
df = pca.fit_transform(df)
# Pull out percentage of variance explained
variance = np.round(
pca.explained_variance_ratio_ * 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(variance) + 1)]
df = pd.DataFrame(df, columns=labels)
return df, variance, labels
# Make dummy dataframe as reproducible example
neurons = list('ABCD')
df = pd.DataFrame(
np.random.randint(0,100,size=(15, 4)),
columns=neurons)
df, loading_scores, components = get_pca(df)
I can't figure out what I'm doing to this data, why I'm scaling with StandardScalar().fit_transform and then again with PCA().fit_transform and if it's the proper method for what I'm trying to achieve. Can anyone give me some insight into this process?