Stata - inputting data from .txt with "" and ,

Question

I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.

The original data are of the form:

 First Name: Allen
 Last Name: Von Schmidt
 Birth Year: 1965
 Location: District 1, Ocean City, Cape May, New Jersey, USA

 First Name: Lee Roy
 Last Name: McBride
 Birth Year: 1967
 Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA

The goal is to create the variables in Stata:

  First Name: Allen
  Last Name: Von Schmidt
  Birth Year: 1965
  County: Cape May
  State: New Jersey

  First Name: Allen
  Last Name: McBride
  Birth Year: 1967
  County: Cook
  State: Illinois

What possible .txt might lead to such, and how would I load it into Stata?

Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.

At the moment, I am putting "", around each variable from the table for the .txt.

 "Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
 "Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"

Is there a better way to format the .txt? How would I create the corresponding variables in Stata?

Thank you for your help!

P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""

Dimitriy V. Masterov Dimitriy V. Masterov · Accepted Answer · 2013-02-15T22:38:12

There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:

#delimit;
input
str100(first_name last_name yob geo);
 "Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;

compress;
destring, replace;

split geo, parse(,);

rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;

The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:

 #delimit;
 insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;

Then clean it up as in the first example.

Stata - inputting data from .txt with "" and ,

2 Answers