1
votes

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many) of the observations appear on two lines in the CSV file. Most of the entries occur on only one line. The troublesome observations that take up 2 lines still follow the same pattern as far as being delimited by commas. But in the Stata dataset, the observation shows up on two rows, both rows containing only part of the total data.

I used import delimited to import the data. Is there anything that can be done at the data import stage of the process in Stata? I would prefer to not have to deal with this in the original CSV file if possible.

***Update

Here is an example of what the csv file looks like:

var1,var2,var3,var4,var5 
text 1,    text 2,text 3   ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1     
         2,text 13,text14,text15
text16,text17,text18,text19,text20 

Notice that there is no comma at the end of the line. Also notice that the problem is with the observation that begins with text 11.

This is basically how it shows up in Stata:

    var1     var2     var3     var4     var5 
1   text 1   text 2   text 3   text 4   text 5
2   text 6   text 7   text 8   text9    text10
3   text 11  text 1
4   2        text 13  text14  text15
5   text16   text17   text18   text19   text20

That sometimes the number is right next to text isn't a mistake - it is just to illustrate that the data is more complex than is shown here.

Of course, this is how I need the data:

    var1     var2     var3     var4     var5 
1   text 1   text 2   text 3   text 4   text 5
2   text 6   text 7   text 8   text9    text10
3   text 11  text 12  text 13  text14   text15
4   text16   text17   text18   text19   text20
5
Can you post a snippet of how your .csv file looks like? Make it such that when saved to a .csv file and imported to Stata, it reproduces your problem.Roberto Ferrer
Great idea - I just added what you asked for. Hopefully it clarifies things.bill999
If missings are "predictable", you can maybe do some data management based on the number of missings on each row. Another option is to modify the input (with quotes) and use the bindquote() option.Roberto Ferrer
They do in fact seem to be predictable. What did you have in mind with the data management? Would it make sense to try and do this in something like python?bill999

5 Answers

3
votes

A convoluted way is (comments inline):

clear
set more off

*----- example data -----

// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)

list

*----- what you want -----

// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
    - length(subinstr(var1var2var3var4var5, ",", "", .))

// save all data
tempfile orig
save "`orig'"

// keep observations that are fine
drop if numcom != 4

// save fine data
tempfile origfine
save "`origfine'"

*-----

// load all data
use "`orig'", clear

// keep offending observations
drop if numcom == 4

// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n

// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4

// no longer necessary
drop numcom check

// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)

// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5

// append new observations with original good ones
append using "`origfine'"

// split
split var1var2var3var4var5, parse(,) gen(var)

// we're "done"
drop var1var2var3var4var5 numcom
list

But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.

Note: the file test.csv looks like

var1,var2,var3,var4,var5 
text 1,    text 2,text 3   ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1     
         2,text 13,text14,text15
text16,text17,text18,text19,text20

Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.

Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

2
votes

I would try the following strategy.

  1. Import as a single string variable.
  2. Count commas on each line and combine following lines if lines are incomplete.
  3. Delete redundant material.

The comma count will be

length(variable) - length(subinstr(variable, ",", "", .)) 
1
votes

If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.

1
votes

A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n and \r terms.

1
votes

I can't offer any code at the moment, but I suggest you take a good look at help import. The infile and infix commands state:

An observation can be on more than one line.

(I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)

Check also the manuals if the examples and notes in the help files turn out to be insufficient.