0
votes

I'm trying to read in an sort data from a CSV file and I'm having trouble with the long strings involved. I've attached screenshots of part of my output from SAS and from the original dataset. (The dataset is from Kaggle, about Ted Talks.)

I'm having trouble with the variable "tags". Basically, I'm looking to read in the tags and sort the data accordingly (i.e. tags that mention children or education will be put into an Education category). So far I'm stuck with just reading them in though. Any help would be appreciated!

This is my code so far:

data tedtalks;
infile 'O:\ted_main1.csv' dlm = ',' firstobs = 2;
informat name $80.; 
informat main_speaker $20.; 
informat speaker_occupation $60.;
informat title $80.;
input comments duration event $ film_date languages 
main_speaker $ name $ num_speaker published_date 
speaker_occupation $ tags $ title $ views
;
run; 

proc print data=tedtalks; 
run; 

First few lines of CSV data:

comments    duration    event   film_date   languages   main_speaker    name    num_speaker published_date  speaker_occupation  tags    title   views
4553 1164 TED2006   1140825600  60  Ken Robinson    Ken Robinson: Do schools kill creativity?   1   1151367060  Author/educator ['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'teaching']    Do schools kill creativity? 47227110
265 977 TED2006 1140825600  43  Al Gore Al Gore: Averting the climate crisis    1   1151367060  Climate advocate    ['alternative energy', 'cars', 'climate change', 'culture', 'environment', 'global issues', 'science', 'sustainability', 'technology']  Averting the climate crisis 3200520
124 1286    TED2006 1140739200  26  David Pogue David Pogue: Simplicity sells   1   1151367060  Technology columnist    ['computers', 'entertainment', 'interface design', 'media', 'music', 'performance', 'simplicity', 'software', 'technology'] Simplicity sells    1636292
200 1116    TED2006 1140912000  35  Majora Carter   Majora Carter: Greening the ghetto  1   1151367060  Activist for environmental justice  ['MacArthur grant', 'activism', 'business', 'cities', 'environment', 'green', 'inequality', 'politics', 'pollution']    Greening the ghetto 1697550

SAS Output

Excel CSV Data

1
Seeing a picture of CSV data imported into Excel isn't helpful. Suggest you paste a few lines of the csv data into your question. Formatted as code.Quentin
Doesn't look like a comma separated file. Perhaps those a TABS between the values?Tom
For some reason when I open a CSV in Excel, it never shows the commas. The commas are there though. If I open it in Notepad, it suddenly shows them. shrugsKelsey Arthur
If you let Excel automatically open a CSV without telling it how to handle each field it will by default convert strings that look numbers or dates, even if the field is a character field.Tom

1 Answers

0
votes

If you have a delimited file then you probably want to use the DSD option on the INFILE statement. Also make sure to use the right delimiter. The INFORMAT statement is not for defining variables, it is for attaching instructions that will help SAS know how to read the variable from text. SAS already knows how to read character variables so there is no need to attach $xx. informats. Instead DEFINE your variables using LENGTH or ATTRIB statement.

32,767 is the longest you can make a character variable. Make sure to set the LRECL long enough also.

data tedtalks;
  infile 'O:\ted_main1.csv' dsd dlm='09'x firstobs=2 truncover lrecl=100000;
  length comments 8 duration 8 event $20 film_date languages 8
         main_speaker $100 name $200 num_speaker 8 
         published_date 8 speaker_occupation $100 
         tags $32767 title $200 views 8
  ;
  input comments -- views ;
run;

Not sure what format those dates are in but perhaps you can convert them to dates.

The tags look to be JSON type lists. It would be difficult to use the tags to "sort" the data, but pretty easy to use them to filter the data. You can leave them as is and search using functions like INDEXW() or FIND(). You could parse them into multiple variables or multiple observations using SCAN() function.