0
votes

I've read through all of the questions about these libraries previously posted, but they all seem to either involve training/test data, or are only asking about PCA() or SS(). I'm confused as to the difference and really want to make sure I'm processing my data correctly.

I have a dataset of neuron signals over time. I want each time-point as a point on the scatterplot. So I have each neuron in a column, and time as the index. Here is basically what I did:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler as ss

def get_pca(df: pd.DataFrame):

    # Scale so that mean = 0 and st.deviation = 1.
    df = ss().fit_transform(df)

    # Need to fit_transform again? 
    pca = PCA()
    df = pca.fit_transform(df)

    # Pull out percentage of variance explained
    variance = np.round(
        pca.explained_variance_ratio_ * 100, decimals=1)
    labels = ['PC' + str(x) for x in range(1, len(variance) + 1)]
    
    df = pd.DataFrame(df, columns=labels)

    return df, variance, labels

# Make dummy dataframe as reproducible example
neurons = list('ABCD')

df = pd.DataFrame(
    np.random.randint(0,100,size=(15, 4)),
    columns=neurons)

df, loading_scores, components = get_pca(df)

I can't figure out what I'm doing to this data, why I'm scaling with StandardScalar().fit_transform and then again with PCA().fit_transform and if it's the proper method for what I'm trying to achieve. Can anyone give me some insight into this process?