0
votes

I have a multi-label classification problem. I have a dataset available at the following link: dataset

This data set is originally from siam competition 2007. The dataset comprises of aviation safety reports describing the problem(s) which occurred in certain flights. It is a multi-classification, high dimensional problem. It has 21519 rows and 30438 columns.

The dataset contains .svm format file. I have read the file with the help of "read.delim" in R. After that I got following output:

head(data[,1])

1 18 2:0.136082763488 6:0.136082763488 7:0.136082763488 12:0.136082763488 20:0.136082763488 23:0.136082763488 32:0.136082763488 37:0.136082763488 39:0.136082763488 43:0.136082763488 53:0.136082763488 57:0.136082763488 58:0.136082763488 59:0.136082763488 60:0.136082763488 61:0.136082763488 62:0.136082763488 63:0.136082763488 64:0.136082763488 65:0.136082763488 66:0.136082763488 67:0.136082763488 68:0.136082763488 69:0.136082763488 70:0.136082763488 71:0.136082763488 72:0.136082763488 73:0.136082763488 74:0.136082763488 75:0.136082763488 76:0.136082763488 77:0.136082763488 78:0.136082763488 79:0.136082763488 80:0.136082763488 81:0.136082763488 82:0.136082763488 83:0.136082763488 84:0.136082763488 85:0.136082763488 86:0.136082763488 87:0.136082763488 88:0.136082763488 89:0.136082763488 90:0.136082763488 91:0.136082763488 92:0.136082763488 93:0.136082763488 94:0.136082763488 95:0.136082763488 96:0.136082763488 97:0.136082763488 98:0.136082763488 99:0.136082763488
[2] 1,12,13,18,20 2:0.0916698497028 4:0.0916698497028 6:0.0916698497028 12:0.0916698497028 14:0.0916698497028 16:0.0916698497028 19:0.0916698497028 23:0.0916698497028 26:0.0916698497028 31:0.0916698497028 32:0.0916698497028 33:0.0916698497028 37:0.0916698497028 53:0.0916698497028 57:0.0916698497028 66:0.0916698497028 71:0.0916698497028 72:0.0916698497028 81:0.0916698497028 83:0.0916698497028 84:0.0916698497028 86:0.0916698497028 90:0.0916698497028 92:0.0916698497028 100:0.0916698497028 101:0.0916698497028 102:0.0916698497028 103:0.0916698497028 104:0.0916698497028 105:0.0916698497028 106:0.0916698497028 107:0.0916698497028 108:0.0916698497028 109:0.0916698497028 110:0.0916698497028 111:0.0916698497028 112:0.0916698497028 113:0.0916698497028 114:0.0916698497028 115:0.0916698497028 116:0.0916698497028 117:0.0916698497028 118:0.0916698497028 119:0.0916698497028 120:0.0916698497028 121:0.0916698497028 122:0.0916698497028 123:0.0916698497028 124:0.0916698497028 125:0.0916698497028 126:0.0916698497028 127:0.0916698497028 128:0.0916698497028 129:0.0916698497028 130:0.0916698497028 131:0.0916698497028 132:0.0916698497028 133:0.0916698497028 134:0.0916698497028 135:0.0916698497028 136:0.0916698497028 137:0.0916698497028 138:0.0916698497028 139:0.0916698497028 140:0.0916698497028 141:0.0916698497028 142:0.0916698497028 143:0.0916698497028 144:0.0916698497028 145:0.0916698497028 146:0.0916698497028 147:0.0916698497028 148:0.0916698497028 149:0.0916698497028 150:0.0916698497028 151:0.0916698497028 152:0.0916698497028 153:0.0916698497028 154:0.0916698497028 155:0.0916698497028 156:0.0916698497028 157:0.0916698497028 158:0.0916698497028 159:0.0916698497028 160:0.0916698497028 161:0.0916698497028 162:0.0916698497028 163:0.0916698497028 164:0.0916698497028 165:0.0916698497028 166:0.0916698497028 167:0.0916698497028 168:0.0916698497028 169:0.0916698497028 170:0.0916698497028 171:0.0916698497028 172:0.0916698497028 173:0.0916698497028 174:0.0916698497028 175:0.0916698497028 176:0.0916698497028 177:0.0916698497028 178:0.0916698497028 179:0.0916698497028 180:0.0916698497028 181:0.0916698497028 182:0.0916698497028 183:0.0916698497028 184:0.0916698497028 185:0.0916698497028 186:0.0916698497028 187:0.0916698497028 188:0.0916698497028 189:0.0916698497028 190:0.0916698497028 191:0.0916698497028 192:0.0916698497028 193:0.0916698497028 194:0.0916698497028

How can I convert it into the regular dataset?

Any other method than read.delim for reading ".svm" file in R will also be helpful.

1
From the URL you posted, it looks like the svm file is supposed to be read by the LIBSVM software referenced at the top of the page you used to download the data. Have you tried to read the file with the LIBSVM software?Len Greski
Ya, I have tried but it won't help. It works with few datasets available on that website. But not with this one. @LenGreskiAbhishek Agnihotri

1 Answers

0
votes

Maybe the solution contains a number of loops. But it solved my problem.

Below is the R-code :

rm(list=ls())

data <- read.delim(file.choose(),header=F)

# Now using strsplit function to create a regular dataser

temp <- list()

for(i in 1:length(data$V1)){
temp[i] <- strsplit(as.character(data$V1[i]),c(" "))
}

response <- list()

for(i in 1:length(temp)){
response[[i]] <- as.numeric(strsplit(temp[[i]][1],",")[[1]])
}

# Now working for responses
l.response <- 0

for (i in 1:length(response)){
l.response[i] <- length(response[[i]])
}

col.names <- paste(rep("R",22),1:22,sep="")



l.r <- length(temp)

df.response <- data.frame(R1=rep(0,l.r),R2=rep(0,l.r),R3=rep(0,l.r),R4=rep(0,l.r),R5=rep(0,l.r)
                         ,R6=rep(0,l.r),R7=rep(0,l.r),R8=rep(0,l.r),R9=rep(0,l.r),R10=rep(0,l.r)
                         ,R11=rep(0,l.r),R12=rep(0,l.r),R13=rep(0,l.r),R14=rep(0,l.r),R15=rep(0,l.r)
                         ,R16=rep(0,l.r),R17=rep(0,l.r),R18=rep(0,l.r),R19=rep(0,l.r),R20=rep(0,l.r)
                         ,R21=rep(0,l.r),R22=rep(0,l.r))



for(i in 1:length(response)){
df.response[i,(response[[i]]+1)] <- 1
}

feature <- c(0)
value <- c(0)

v.l <- 21519

v.list <- list()
list.name <- paste(rep("V",v.l),1:v.l,sep="")

f.vec <- 0
v.vec <- 0

for(i in 1:length(temp)){
for(j in 2:length(temp[[i]])){

f.vec[j-1] <- as.numeric(strsplit(temp[[i]][j],":")[[1]])[1]
v.vec[j-1] <- as.numeric(strsplit(temp[[i]][j],":")[[1]])[2]

}

v.list[[i]] <- data.frame(f.vec,v.vec)

}

feature.name <- paste(rep("V",30438),1:30438,sep="")

v.l <- 21519

variables <- data.frame(temp = rep(0,v.l))

for(i in 1:length(feature.name)){

variables[,feature.name[i]] <- rep(0,v.l)

}


variables <- variables[,-1]

copy.variables <- variables

for(i in 1:100){

pos <- v.list[[i]][,"f.vec"]
replace <- v.list[[i]][,"v.vec"]

if(length(unique(pos))!=length(pos)){
repeat{

uni <- as.numeric(attr(which(table(pos)>1), "names"))

for(k in 1:length(uni)){

t.pos <- which(pos==uni[k])

pos <- pos[-t.pos[1]]

replace <- replace[-t.pos[1]]
}

if(length(unique(pos))==length(pos)) break
}
}
variables[i,pos]<- replace


}


dim(df.response)
dim(variables)

Below code will give the final data with 100 rows and 100 columns.

final.data <- cbind(variables[1:100,],df.response[1:100,])

Welcome for other solutions. @LenGreski