1
votes

I read this data set and I want to join the data for the training set and the test set (I should mention that this is part of a coursera course exercise).

I have read both data sets and gave all columns names,the training data have 7352 rows and 562 columns and the test set have 2947 rows and 562 columns. The names of the columns of both data sets are the same.

When I try to join the data with bind_rows I get a data set with 10299 rows but with 478 columns, not 562.

When I use rbind I get the correct result, but I need to cast it again using tbl_df so I prefer doing it using bind_rows.

The following is the script I wrote, running it from a folder containing the unzipped data from the above ling (e.g the folder "UCI HAR Dataset") reproduces the problem.

## Setting the script folder to be current directory 
CurrentScriptDirectory = script.dir <- dirname(sys.frame(1)$ofile)
setwd(CurrentScriptDirectory)

library(dplyr)

#Readin the data
train_x <- tbl_df(read.table("./UCI HAR Dataset/train/X_train.txt"))
train_y <- tbl_df(read.table("./UCI HAR Dataset/train/y_train.txt"))
test_x <- tbl_df(read.table("./UCI HAR Dataset/test/X_test.txt"))
test_y <- tbl_df(read.table("./UCI HAR Dataset/test/y_test.txt"))

#Giving the y's proper names
colnames(train_y) <- c("Activity Name")
colnames(test_y) <- c("Activity Name")

#Reading features names
featuerNames<-read.table("./UCI HAR Dataset/features.txt")
featuerNames<-featuerNames[,2]

#Giving the training and test data proper names
colnames(train_x) <- featuerNames
colnames(test_x) <- featuerNames

labeledTrainingSet <- bind_cols(train_x,train_y)
labeledTestSet <- bind_cols(test_x,test_y)

labledDataSet <- bind_rows(labeledTrainingSet,labeledTestSet)

Can someone help me understand what I'm doing wrong ?

2

2 Answers

4
votes

I've worked with that dataset and ran into the same issue. As others mentioned, there are duplicate features.

  1. Rename duplicate columns and make them legal. You can use:

make.names(X, unique = TRUE, allow_ = TRUE)

where X is a character vector. The function will add to existing column names so you don't lose original nomenclature. see http://www.inside-r.org/r-doc/base/make.names for more details

  1. After all of your column names are unique dplyr::bind_rows() will work!
2
votes

Just checked it out. You have duplicated names in your featureNames set. These are dropped by bind_rows.

test1<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test1)<- c("A","B","B")

test2<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test2)<- c("A","B","B")


test3 <-bind_rows(test1, test2)