R change factors levels of a variable, and remove old ones

Question

I have a large data set, which is read from SPSS file. It contains several rows and columns, read from many small SPSS files. The SPSS file contained some mistakes, which I want to correct in R. When the data is read, it and has all noises in factor levels, but data is ok in SPSS. I cannot change factor levels in many individual files in SPSS. Following is the small sample of data that I have

data
    a  b                   c                  d    e
[1] 3  5 1 Very dissatisfied                  5    5
[2] 8  3                  10         Don't Know    1
[3] 7  5                   3                  8    6
[4] 3  5                   9                  6   99
[5] 9  4                   8  10 Very Satisfied    3
[6] 5 NA       99 Don't Know     Very Satisfied   10

levels(data[,1])
 [1] "1 Very Dissatisfied" "2"                 "3"             "4"                
 [5] "5"                   "6"                 "7"             "8"                
 [9] "9"                   "1" "10 Very Satisfied" "99 Don't know"
[12] "1 Very Bad"        "99"       "2 Satisfied"             "10"

The levels contains many mistakes. I want to correct them to something like following

x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied",
"5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied",
"99 Dont Know"))

levels(x)
[1] "1 Very Dissatisfied"  "2 Satisfied"         "3 Satisfied"    "4 Satisfied"      
[5] "5 Satisfied"          "6 Satisfied"         "7 Satisfied"    "8 Satisfied"      
[9] "9 Satisfied"          "10 Very Satisfied"  "99 Dont Know"

I tried following code

for(j in c(1,2,5)){
    data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
    for(i in 2:9){
        data[grep(i,data[,j]),j] <- paste(i,"Satisfied")}
}

This does not work. Please show me where I am wrong, and what should I do.

Even after this code works, I have to remove unused garbage factors that the variable contains. How to do it?

field210 field210 · Accepted Answer · 2014-11-09T05:15:23

Clean your data. This will only leave numbers and NA.

data=apply(data,1:2,function(x) gsub("[^0-9]", "",x))

Data will be like this:

      a   b   c    d    e   

[1,] "3" "5" "1"  "5"  "5"     
[2,] "8" "3" "10" "99" "1"   
[3,] "7" "5" "3"  "8"  "6"   
[4,] "3" "5" "9"  "6"  "99"  
[5,] "9" "4" "8"  "10" "3"   
[6,] "5" NA  "99" "10" "10"

Recode your string.

# Install the car package
install.packages("car")


# Load the car package     
library("car")

replace_string=function(x) {  
recode(x,'1="1 Very Dissatisfied";  
          2="2 Satisfied";  
          3="3 Satisfied";  
          4="4 Satisfied";   
          5="5 Satisfied";  
          6="6 Satisfied";  
          7="7 Satisfied";  
          8="8 Satisfied";  
          9="9 Satisfied";  
         10="10 Very Satisfied";   
         99="99 Dont Know"')  
 }  

 data=apply(data,1:2,replace_string)

R change factors levels of a variable, and remove old ones

3 Answers