1
votes

There is a given dummy dataset. I am trying to import it into the SAS environment and i also tried it in RStudio as well. But i was unable to grab the correct output. the dataset provided is as follows:

1    "name, age, salary, zipcode"
2    "A, 1, 100, 10010 B, 2, 200, 10011 C, 3, 300, 10012 D, 4, 400, 10014"

i copied the data from a CSV file. The line 1 was in the First cell of the CSV and Line 2 was in the second cell just below the first cell.

PS : 10010 is the zipcode where as B is the Name that is B is the starting of next observation and similiarly C is the next observation and D is the next observation.

The Desired output is:

Name   Age   Salary   Zipcode
 A      1     100      10010
 B      2     200      10011
 C      3     300      10012
 D      4     400      10014

i am stuck with this problem since a day ago. I am just able to move around the dataset in order to move the observations. Moreover the double qouted values are making this problem very complex. A working Solution on SAS or R would work great.

EDIT: The following code has been tried by me in SAS. Thanks SMW for pointing. :)

data new;
length Name $2 Age 8. Salary 8. Zipcode 8.;
infile 'book1.csv' dsd dlm = ',' firstobs = 2 LRECL =17 ;
input Name $ Age Salary Zipcode @@;
run;

Regards

2
Believe you need either the trailing "@" or you can use "\" to control what records go to what observation. Once you post an attempt of code I will know for sure, but not 100% certain of your question at this point.SMW
@SMW Added the Edit in the query. Thanks much. Please let me know for any Catch in the code.desmond.carros
Does the file literally have the quotes? Does it really have all of the data on the second line? You could probably get SAS to read it by telling SAS that space, comma and quotes are delimiters DLM=' ,"'. But if any of the values are blank or if the names have embedded spaces then it will get confused.Tom
@Tom Yes the file literally have the quotes. and all the data is on the second line. But the space delimiter is at the end of each OBSERVATION. Not at every single value. That's the reason the above code didn't worked efficiently. Thanks for the reply.desmond.carros

2 Answers

3
votes

Import the column names and prepare it

nam <- read.table("csv.csv", header=F, nrow=1, stringsAsFactors=F)
nam <- gsub(",", "", nam)
nam <- nam[2]                             # resolve " " in original data 
nam <- strsplit(nam, split=" ")           #

Capitalize the colnames

library(Hmisc)                   # edited
nam <- capitalize(nam)           # edited

Import the remaining file

dd <- read.table("csv.csv", skip=1, sep="", stringsAsFactors=F)

Reorder the file, extract first column

col_1 <- gsub(",", "", as.vector(dd[seq(from=1, to=16,by=4)])[1,])

Extract the remaining columns

col_all <- as.data.frame(t(matrix(dd, ncol=4)))

Clean commas

col_all <- apply(apply(col_all[-1], 2, gsub, patt=",", replace=""), 2, as.numeric)

and set the final object

data <- as.data.frame(cbind(col_1, col_all))
names(data) <- nam
data  # ouput

  Name Age Salary Zipcode
1    A   1    100   10010
2    B   2    200   10011
3    C   3    300   10012
4    D   4    400   10014

Edited: Capitalize only with base functions. Avoid library(Hmisc)and the following code line.

nam <- unlist(lapply(nam, function(x) {                    
   paste(toupper(substring(x, 1, 1)), substring(x, 2, max(nchar(nam))), sep="")
}))
0
votes

You can ask SAS to treat comma, blank and quote as delimiters and use trail @@ to allow it to read multiple observations from one line.

Let's build your example data file.

filename example temp;
data _null_;
  file example;
  put '"name, age, salary, zipcode"'
    / '"A, 1, 100, 10010 B, 2, 200, 10011 C, 3, 300, 10012 D, 4, 400, 10014"'
 ;
run;

And now read it.

data want;
  length Name $2 Age 8. Salary 8. Zipcode 8.;
  infile example dlm = ' ,"' firstobs = 2 ;
  input Name $ Age Salary Zipcode @@;
run;