0
votes

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.

I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.

Now, I would like to use the loading factors from PC1 to construct an index that classifies my 2000 individuals for these 30 variables in 3 different groups.

Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?

  1. I've performed the following steps:

a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)

b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation

c) Removed all the variables for which the loading factors were close to 0.

  1. I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.

R code

df1 <- read.table(text=" 
          educ     call      house  merge_id    school  members       
A           1        0          1      12_3        0      0.9
B           0        0          0      13_3        1      0.8
C           1        1          1      14_3        0      1.1
D           0        0          0      15_3        1      0.8 
E           1        1          1      16_3        3      3.2", header=T)


## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)

## Extract loadings
PCA_loadings <- PCA_outcome$rotation


## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).

Expected results: - Get a rank score for each individual - Subsequently, assign a category 1-3 to each individual.

1
Do you have to use PCA? If you wanted to divide your individuals into three groups why not use a clustering approach, like k-means with k = 3?Lyngbakr
Unfortunately, I have to.msgh
I'm not sure I understand your question. If you want the PC score for PC1 for each individual, you can use PCA_outcome$x[, 1].Lyngbakr
Thank you very much for your reply @Lyngbakr. Is the PC score equivalent to an index? If yes, how is this PC score assembled? Simply by summing up the loading factors for all variables for each individual? The issue I have is that the data frame I use to run the PCA only contains information on households. However, I would need to merge each household with another dataset for individuals (to rank individuals according to their household scores). Thus, I need a merge_id in my PCA data frame. Is there a way to perform the PCA while keeping the merge_id in my data frame (see edited df above).msgh
"Is the PC score equivalent to an index?" - what I mean by this is: If the variables selected for the PCA indicated individuals' socio-economic status, would the PC give me a ranking for socio-economic status for each individual?msgh

1 Answers

0
votes

I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.

First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.

It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.

# Create data frame
df <- read.table(text = "educ     call      house  merge_id    school  members       
A           1        0          1      12_3        0      0.9
B           0        0          0      13_3        1      0.8
C           1        1          1      14_3        0      1.1
D           0        0          0      15_3        1      0.8 
E           1        1          1      16_3        3      3.2", header = TRUE)

# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)

# Add PC1
df$PC1 <- PCA$x[, 1]

# Look at new data frame
print(df)
#>   educ call house merge_id school members        PC1
#> A    1    0     1     12_3      0     0.9  0.1000145
#> B    0    0     0     13_3      1     0.8  1.6610864
#> C    1    1     1     14_3      0     1.1 -0.8882381
#> D    0    0     0     15_3      1     0.8  1.6610864
#> E    1    1     1     16_3      3     3.2 -2.5339491

Created on 2019-05-30 by the reprex package (v0.2.1.9000)

As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.