1
votes

How do I create a correlation matrix in PCA on Python? Below, I create a DataFrame of the eigenvector loadings via pca.components_, but I do not know how to create the actual correlation matrix (i.e. how correlated these loadings are with the principal components). Any clues?

Further, I have realized that many these eigenvector loadings are negative in Python. I am trying to replicate a study conducted in Stata, and it curiosuly seems the Python loadings are negative when the Stata correlations are positive (please see attached correlation matrix image that I am attempting to replicate in Python). This is just something that I have noticed - what is going on here?

Stata-Created Correlation Matrix

Thanks in advance.

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from dateutil.relativedelta import relativedelta
import blpinterface.blp_interface as blp
from scipy.stats import zscore
from sklearn.decomposition import PCA

#Set dates for analysis
startDate = "20000101"

#Construct tickers for analysis
tickers = ["USGG2YR Index", "USGG5YR Index", "USGG10YR Index", "USGG30YR Index", "USGGT10Y Index", ".30YREAL Index",
       "USGGBE10 Index", "USGGBE30 Index", ".RATEVOL1 Index", ".RATEVOL2 Index", "SPX Index", "S5INDU Index", "S5CONS Index", "VIX Index",
       ".DMFX Index", ".EMFX Index", "CL1 Comdty", "HG1 Comdty", "XAU Curncy"]

#Begin dataframe construction
mgr = blp.BLPInterface()

df = mgr.historicalRequest(tickers, "PX_LAST", startDate, "20160317")
df = df.dropna()
df = df.apply(zscore)

#Conduct PCA analysis
pca=PCA(n_components=3)
pca.fit(df) #Estimates the eigenvectors of the dataframe with 18x variables for data dating back to 2000
print(pd.DataFrame(pca.components_, columns=tickersclean, index=["PC1", "PC2", "PC3"]).transpose()) #Eigenvectors with loadings, sorted from highest explained variance to lowest
print(pca.explained_variance_) #Eigenvalues (sum of squares of the distance between the projected data points and the origin along the eigenvector)
print(pca.explained_variance_ratio_) #Explained variance ratio (i.e. how much of the change in the variables in the time series is explained by change in the respective principal component); eigenvalue/(n variables)

#Project data onto the above loadings for each row in the time series
outputpca = pd.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(3)], index=df.index)
outputpca.columns = ["PC1", "PC2", "PC3"]
print(outputpca) #Principal component time series, projecting the data onto the above loadings; this is the sum product of the data and the eigenvector loadings for all three PCs for each row
outputpca.plot(title="Principal Components")
plt.show()
1

1 Answers

1
votes

You can use correlation existent in numpy module. Example:

cor_mat1 = np.corrcoef(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cor_mat1)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

This link presents a application using correlation matrix in PCA.