0
votes

I have problems in reshaping data from wide to long format:

  1. I have no identifier variable for the wide variables.
  2. My dataset is quite wide. I do have about 7000 variables.
  3. The number of variables per ID is not constant, so for some IDs I have 5 and for others I have 10 variables.

I was hoping that this Stata FAQ could help me, but unfortunately this does not work properly (see following code snippets).

So I do have data that looks like the following example:

clear
input str45 Year
"2010"
"2011"
"2012"
"2014"
end
input str45 A101Meas0010
"1.50"
"1.70"
"1.71"
"1.71"
input str45 A101Meas0020
"50"
"60"
"65"
"64"
input str45 A101Meas0020A
"51"
"62"
"64"
"68"
input str45 FE123Meas0010
"1.60"
"1.75"
"1.92"
"1.94"
input str45 FE123Meas0020
"60"
"72"
"88"
"92"
list

     +-------------------------------------------------------------+
     | Year   A10~0010   A10~0020   A1~0020A   FE1~0010   FE1~0020 |
     |-------------------------------------------------------------|
  1. | 2010       1.50         50         51       1.60         60 |
  2. | 2011       1.70         60         62       1.75         72 |
  3. | 2012       1.71         65         64       1.92         88 |
  4. | 2014       1.71         64         68       1.94         92 |
     +-------------------------------------------------------------+

The final table I want to achieve would look something like this:

     +--------------------------------------------------+
     | Year         ID   Meas0010   Meas0020  Meas0020A |
     |--------------------------------------------------|
  1. | 2010       A101       1.50         50          . |
  2. | 2010      FE123       1.60         51         60 |
  3. | 2011       A101       1.70         60          . |
  4. | 2011      FE123       1.75         62         72 |
  5. | 2012       A101       1.71         65          . |
  6. | 2012      FE123       1.92         64         88 |
  7. | 2014       A101       1.71         64          . |
  8. | 2014      FE123       1.94         68         92 |
     +--------------------------------------------------+

I tried following code snippet close to the example from the Stata FAQ, but this throws an error:

unab vars : *Meas*
local stubs : subinstr local vars  "Meas0010" "", all
local stubs : subinstr local stubs "Meas0020" "", all
local stubs : subinstr local stubs "Meas0020A" "", all
reshape long "`stubs'", i(Year) j(Measurement) string
(note: j = Meas0010 Meas0020 Meas0020A)
(note: A101AMeas0010 not found)
variable A101Meas0010 not found
r(111);

Any ideas how to reshape this? I never had to reshape such an odd structure before.

Additional Question: In the example above I did have to specify the Measurement-Names Meas0010, Meas0020 and Meas0020A. Is it possible to automate this as well? All measurement names start with the keyword Meas, so the variable names are always of the structure _ID+MeasName, e.g. A101Meas0020A stands for ID A101 and Measurement Meas0020A.

The annoying thing is: I do know how to do this in MATLAB, but I am forced to use Stata here.

1
It's best to avoid negative comments on any software when you want help on it, even if you intend them humorously.Nick Cox
Cross-posted at statalist.org/forums/forum/general-stata-discussion/general/… Telling people about cross-posting is polite in any forum.Nick Cox
Forgetting to mention the Cross-Post is my fault. I'm sorry for that. But in no way did I misspeak about Stata in my post and I thus do not accept the hint of criticism mentioned. I am 100% fluent in MATLAB and could have solved this in minutes. Since in this project I do not have access to a MATLAB License, I need to solve this with Stata - a software I am not yet sufficiently acquainted with (and took me 3 hours without any progress). Please don't get me wrong - for specific applications as the one I am working on here, Stata is among the best tools to go for.EliteTUM
Thanks. Good to hear that "I am forced to use STATA here. :(" was not intended negatively.Nick Cox

1 Answers

1
votes

Your variable name structure is a little awkward, but there is a syntax to match. It's better covered in the help for reshape, and is only barely mentioned in the FAQ you cite (which I wrote, so I can be emphatic that it's intended as a supplement to the help, not the first line of documentation).

Your example yields to

clear
input str4 (Year  A101Meas0010  A101Meas0020  A101Meas0020A  FE123Meas0010  FE123Meas0020)  
"2010" "1.50" "50" "51" "1.60" "50"
"2011" "1.70" "60" "62" "1.75" "60"
"2012" "1.71" "65" "64" "1.92" "65"
"2014" "1.71" "64" "68" "1.94" "64"
end
reshape long @Meas0010 @Meas0020 @Meas0020A, i(Year) j(ID) string 
destring, replace 
sort Year ID 
list, sepby(Year) 


     +-----------------------------------------------+
     | Year      ID   Meas0010   Meas0020   Me~0020A |
     |-----------------------------------------------|
  1. | 2010    A101        1.5         50         51 |
  2. | 2010   FE123        1.6         50          . |
     |-----------------------------------------------|
  3. | 2011    A101        1.7         60         62 |
  4. | 2011   FE123       1.75         60          . |
     |-----------------------------------------------|
  5. | 2012    A101       1.71         65         64 |
  6. | 2012   FE123       1.92         65          . |
     |-----------------------------------------------|
  7. | 2014    A101       1.71         64         68 |
  8. | 2014   FE123       1.94         64          . |
     +-----------------------------------------------+

It seems bizarre that your example enters everything as string: note the destring in my code.

Without access to your dataset, I'd say that you should be able to find the more general syntax without automation. You know that there are at most about 10 measurements in the fullest case. In any event you are already showing the syntax tricks needed to remove strings you don't need.