0
votes

I have a large data set, which is read from SPSS file. It contains several rows and columns, read from many small SPSS files. The SPSS file contained some mistakes, which I want to correct in R. When the data is read, it and has all noises in factor levels, but data is ok in SPSS. I cannot change factor levels in many individual files in SPSS. Following is the small sample of data that I have

data
    a  b                   c                  d    e
[1] 3  5 1 Very dissatisfied                  5    5
[2] 8  3                  10         Don't Know    1
[3] 7  5                   3                  8    6
[4] 3  5                   9                  6   99
[5] 9  4                   8  10 Very Satisfied    3
[6] 5 NA       99 Don't Know     Very Satisfied   10

levels(data[,1])
 [1] "1 Very Dissatisfied" "2"                 "3"             "4"                
 [5] "5"                   "6"                 "7"             "8"                
 [9] "9"                   "1" "10 Very Satisfied" "99 Don't know"
[12] "1 Very Bad"        "99"       "2 Satisfied"             "10"

The levels contains many mistakes. I want to correct them to something like following

x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied",
"5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied",
"99 Dont Know"))

levels(x)
[1] "1 Very Dissatisfied"  "2 Satisfied"         "3 Satisfied"    "4 Satisfied"      
[5] "5 Satisfied"          "6 Satisfied"         "7 Satisfied"    "8 Satisfied"      
[9] "9 Satisfied"          "10 Very Satisfied"  "99 Dont Know"

I tried following code

for(j in c(1,2,5)){
    data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
    for(i in 2:9){
        data[grep(i,data[,j]),j] <- paste(i,"Satisfied")}
}

This does not work. Please show me where I am wrong, and what should I do.

Even after this code works, I have to remove unused garbage factors that the variable contains. How to do it?

3

3 Answers

2
votes
  1. Clean your data. This will only leave numbers and NA.

    data=apply(data,1:2,function(x) gsub("[^0-9]", "",x))
    

    Data will be like this:

          a   b   c    d    e   
    
    [1,] "3" "5" "1"  "5"  "5"     
    [2,] "8" "3" "10" "99" "1"   
    [3,] "7" "5" "3"  "8"  "6"   
    [4,] "3" "5" "9"  "6"  "99"  
    [5,] "9" "4" "8"  "10" "3"   
    [6,] "5" NA  "99" "10" "10"  
    
  2. Recode your string.

    # Install the car package
    install.packages("car")
    
    
    # Load the car package     
    library("car")
    
    replace_string=function(x) {  
    recode(x,'1="1 Very Dissatisfied";  
              2="2 Satisfied";  
              3="3 Satisfied";  
              4="4 Satisfied";   
              5="5 Satisfied";  
              6="6 Satisfied";  
              7="7 Satisfied";  
              8="8 Satisfied";  
              9="9 Satisfied";  
             10="10 Very Satisfied";   
             99="99 Dont Know"')  
     }  
    
     data=apply(data,1:2,replace_string)  
    
1
votes

I would suggest leaving SPSS attributes as is by not using value labels from SPSS:

temp <- read.spss(file, use.value.labels = FALSE)

Then I would use ifelse to correct the labels based on your for loop:

temp$c <- ifelse(as.numeric(temp$c) %in% 1:9, paste(temp$c, "Satisfied", sep=" "), temp$c)
0
votes

The point where I made mistake was in grep. I used grep(^i$,data) instead of grep(i,data). This captured both 1 and 10, also 9 and 99. I used ^i$ to exactly match the character so that ^9$ captured only 9 and not 99.

To remove unused levels in factor and use it as ordinal variable, i used ordered(data) at the end and that solved the problem.

I used exact following code to correct myself:

Step1: Define levels of the factor

x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied","5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied","Dont Know"))

Step2: Now loop through all the data column and row wise.

I used following code:

for(j in c(28,29,32)){
    data[,j]<-factor(data[,j])
    #add required levels so that when introduced later, does not introduce NA
    data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
    #Now remove and correct noises
    data[grep("99",data[,j]),j] <- "Dont Know"
    data[grep("Don",data[,j]),j] <- "Dont Know"
    data[grep("Very [Ss]",data[,j]),j] <- "10 Very Satisfied"
    data[grep("10",data[,j]),j] <- "10 Very Satisfied"
    data[grep("Very [Dd]",data[,j]),j] <- "1 Very Dissatisfied"
    data[grep("^1$",data[,j]),j] <- "1 Very Dissatisfied"
    #Loop through remaining data and correct
    for(i in 2:9){
       data[grep(paste("^",i,"$",sep=""),data[,j]),j] <- paste(i,"Satisfied")
    }
    #to remove unused factors, ordered
    data[,j]<-ordered(data[,j],levels(x))
}