Don't understand the output of Principal Component Analysis (PCA) in Python

Question

I did a PCA in Python on audio spectrograms and face the following problem: I have a matrix, where each row consists of flattened song features. After applying PCA it's clear to me, that the dimensions are reduced. BUT I can't find those dimensional data in the regular dataset.

import sys
import glob

from scipy.io.wavfile import read
from scipy import signal
from scipy.fftpack import fft
import numpy as np
import matplotlib.pyplot as plt
import pylab

# Read file to get samplerate and numpy array containing the signal 

files = glob.glob('../some/*.wav')

song_list = []

for wav in files:

    (fs, x) = read(wav)

    channels = [
        np.array(x[:, 0]),
        np.array(x[:, 1])
    ]

    # Combine channels to make a mono signal out of stereo
    channel =  np.mean(channels, axis=0)
    channel = channel[0:1024,]
    # Generate spectrogram 
    ## Freqs is the same with different songs, t differs slightly
    Pxx, freqs, t, plot = pylab.specgram(
        channel,
        NFFT=128, 
        Fs=44100, 
        detrend=pylab.detrend_none,
        window=pylab.window_hanning,
        noverlap=int(128 * 0.5))
    # Magnitude Spectrum to use
    Pxx = Pxx[0:2]
    X_flat = Pxx.flatten()
    song_list.append(X_flat)

song_matrix = np.vstack(song_list)

If I now apply PCA to the song_matrix...

import matplotlib
from matplotlib.mlab import PCA
from sklearn import decomposition


#test = matplotlib.mlab.PCA(song_matrix.T)

pca = decomposition.PCA(n_components=2)
song_matrix_pca = pca.fit_transform(song_matrix.T)


pca.components_ #These components should be most helpful to discriminate between the songs due to their high variance
pca.components_

...the final 2 components are the following: Final components - two dimensions from 15 wav-files The problem is, that I can't find those two vectors in the original dataset with all dimensions What am I doing wrong or am I misinterpreting the whole thing?

user2867432 user2867432 · Accepted Answer · 2015-10-19T23:54:08

PCA doesn't give you the vectors in your dataset. From Wikipedia : Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

Don't understand the output of Principal Component Analysis (PCA) in Python

2 Answers