read complicated dataset in R

Question

My dataset look something like given below. The first number is the feature number and then colon and then the value associated with that specific feature. I am not sure how to import this dataset in R. Anyone has any ideas?

236:24 500:163 732:234 869:117 885:106 1249:103 1280:158 1889:119 2015:55 2718:126 3307:137 3578:25 3770:26 4139:128 4723:114 4957:82 5128:50 5420:124 5603:135 5897:34 5946:117 6069:154 6153:55 6347:87 6372:77 6666:109 6866:223 6984:39 7709:253 7950:87 8078:38 8945:141 9316:111 9948:103 9989:68 10276:43 10530:76 10532:55 10799:15 10802:20 10848:82 11347:16 11871:51 11883:105 12534:133 12601:13 12781:178 12798:116 12842:106 12916:7 12935:51 12968:154 13028:58 13330:105 13384:2 13568:47 13641:632 13829:18 13964:62 14385:93 14392:272 15280:140 15424:119 15492:52 15523:31 16311:23 16464:69 16478:94 16584:102 16586:107 16705:272 17138:108 17181:150 17526:280 17540:163 18007:114 18050:53 18180:2 18806:160 18943:73 19055:41 19255:88 19774:59 19889:72 19921:45 101:68 572:57 732:63 962:120 1304:61 1831:60 1889:58 1973:105 2518:161 2629:228 2990:158 3147:75 3578:11 3860:88 4011:18 4623:141 4684:411 4758:69 4820:120 6149:102 6234:134 6306:118 6866:147 6927:89 6988:51 7048:178 7193:31 7257:61 7709:229 8061:125 8202:188 8272:17 8759:165 9104:77 9325:135 9860:97 10055:684 10532:180 10735:64 10744:267 10820:120 10848:186 10923:128 10936:129 11203:160 11303:144 11668:87 11867:97 11871:207 12191:83 12238:193 12380:51 12968:164 13369:58 13929:39 14531:102 14800:130 14931:99 15314:91 15632:62 16165:7 16353:120 16584:137 17216:172 18372:31 18893:75 19133:93 19154:101 19165:133 19607:20 19784:141 19889:97 19921:60

Is this entire data set on a single line? What perhaps bothers me more is which machine learning method you plan to use which can handle ~10K features. — Tim Biegeleisen
You essentially have a 2 column dataset separated by : - e.g.: x <- "236:24 500:163 732:234 869:117" and then read.table(text=scan(text=x, what=""), sep=":") works. — thelatemail

Ben Fasoli Ben Fasoli · Accepted Answer · 2017-09-04T02:09:28

Assuming your data is stored in input.txt,

input <- scan('input.txt', what = 'character')

data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of  2 variables:
#   $ Feature: num  236 24 500 163 732 234 869 117 885 106 ...
#   $ Value  : num  18943 73 19055 41 19255 ...

Alternatively, you can use read.table to parse the input rather than manually splitting the strings which is slightly slower but more readable.

data <- read.table(text = input, sep = ':')
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of  2 variables:
#   $ Feature: num  236 24 500 163 732 234 869 117 885 106 ...
#   $ Value  : num  18943 73 19055 41 19255 ...

Edit: adapted for your dataset. Reads your Feature/Value pairs into a data frame.

url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/dexter/DEXTER/dexter_test.data'
input <- scan(url, what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature','Value')
str(data)
# 'data.frame': 192449 obs. of  2 variables:
#  $ Feature: num  236 24 500 163 732 234 869 117 885 106 ...
#  $ Value  : num  79 10848 105 11018 76 ...

read complicated dataset in R

1 Answers