0
votes

I'm using Stata 12.0.

I have a CSV file of exposures for days of the year e.g. 01/11/2002 (DMY).

I want these imported into Stata and it to recognise that it is a date variable. I've been using:

 insheet using "FILENAME", comma 

But by doing this I am only getting the dates as labels rather than names of the variables. I guess this is because Stata doesn't allow variable names to start with numbers. I have tried to reformat the cells as Dates in Excel and import but then Stata thinks the whole column is a Date and changes the exposure data into dates.

Any advice on the best course of action is appreciated...

2
Can you show an example of the first few rows of data? It sounds to me like you have wide data in which dates are the column names. In this case, I would append a letter to the column name (e.g. "D") and then reshape from wide to long (or just keep it wide) after importing into Stata. You cannot have a variable name 12/03/12 in Stata. By the way, Stata has extensive documentation on dates and times with a whole section on Excel dates! - ChrisP
As I write, this has generated two quite different guesses at what your problem is. @Roberto Ferrer's guess now seems to me more plausible than mine, but either way we really need clarification, as also requested by ChrisP. Asking a question and then ignoring comments and answers is a poor way to treat the forum, regardless of whether you now understand your problem. - Nick Cox
Apologies, I have been travelling with intermittent internet access. I am grateful for all your responses. My data looks very similar to the screenshots used by @Roberto Ferrer, however I have ~9000 rows... even so, his advice has been very helpful. Although there does seem to be a limitation in that if there are over 3 months of days stata turns unresponsive after the reshape command. 3 months creates 2741784 observations and although I have read that stata/se has the capability to store unlimited observations, I'm guessing I may not have the computing power to do so? - Marko
I've tested on some old computer and with that amount of data it takes around 8 minutes. There is probably some difference in your data structure not accounted for in this exercise. I've added a note in my answer with details. - Roberto Ferrer

2 Answers

3
votes

As commented elsewhere, I too think you probably have a dataset that is best formatted as panel data. However, I address first the specific problem I think you have according to your question. Then I show some code in case you are interested in switching to a panel structure.

Here is an example CSV file open as a spreadsheet:

csv file in spreadsheet

And here the same file, open in a text editor. Imagine the ; are ,. This is related to my system's language settings.

csv file in text editor

Running this (substitute delimiter(";") for comma, in your case):

clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")

results in

problem

which I think is the problem you describe: dates as variable labels. You would like to have the dates as variable names. One solution is to use a loop and strtoname() to rename the variables based on the variable labels. The following goes after importing with insheet:

foreach var of varlist * {
    local j = "`: variable l `var''"
    local newname = strtoname("`j'", 1)
    rename `var' `newname'
}

The result is

renamed variables

The function strtoname() will substitute out the ilegal characters for _'s. See help strtoname.

Now, if you want to work with a panel structure, one way would be:

clear all
set more off

insheet using "D:\xlsdates.csv", delimiter(";")

* Rename variables
foreach var of varlist * {
    local j = "`: variable l `var''"
    local newname = strtoname("`j'", 1)
    rename `var' `newname'
}

* Generate ID
generate id = _n

* Change to long format
reshape long _, i(id) j(dat) string

* Sensible name
rename _ metric

* Generate new date variable
gen dat2 = date(dat,"DMY", 2050)
format dat2 %d

list, sepby(id)

As you can see, there's no need to do anything beforehand in Excel or in an editor. Stata seems to be enough in this case.

Note: I've reused code from http://www.stata.com/statalist/archive/2008-09/msg01316.html.

A further note on performance: A CSV file with 122 variables or days (columns) and 10,000 observations or subjects (rows) + 1 header row, will produce 1,220,000 observations after the reshape. I have tested this on some old machine with a 1.79 GHz AMD processor and 640 MB RAM and the reshape takes approximately 8 minutes. Stata 12 has a hard-limit of 2,147,483,647 observations (although available RAM determines if you can actually achieve it) and Stata SE of 32,767 variables.

1
votes

There seems to be some confusion here between the names that variables may have, the values that variables may have and the types that they may have.

Thus, the statement "Stata doesn't allow variables to start with numbers" appears to be a reference to Stata's rules for variable names; if it were true, numeric variables would be impossible.

Stata has no variable (i.e. storage) type that is a date. Strictly, it has no concept of a date variable, but dates may be held as strings or numbers. Dates may be held as strings insofar as any text indicating a date is likely to be a string that Stata can hold. This is flexible, but not especially useful. For almost all useful work, dates need to be converted to integers and then assigned a display format that matches their content to be readable by people. Stata has various conventions here, e.g. that daily dates are held as integers with 0 meaning 1 January 1960.

It seems likely in your case that daily dates are being imported as strings: if so, the function date() (also known as daily()) may be used to convert to an integer date. The example here just uses the minimal default display format for daily dates: friendlier formats exist.

. set obs 1
obs was 0, now 1

. gen sdate = "12/03/12"

. gen ndate = daily(sdate, "DMY", 2050)

. format ndate %td

. l

    +----------------------+
    |    sdate       ndate |
    |----------------------|
 1. | 12/03/12   12mar2012 |
    +----------------------+

If your variable names are being misread, as guessed by @ChrisP, you may need to tell us more. A short and concrete example is worth more than a longer verbal description.