1
votes

I have a dataset that should be reshaped to a wide format.

Data is currently long, with observations identifying an "area" for each individual in a given school. Problems with conventional reshaping code appear because data have two layers: It should first be reshaped wide so that each observation uniquely identify a person and a school (with multiple areas). Second, we should finally get one observation for each person (containing multiple schools and multiple areas).

Here is an example of how data looks like now:

 * Example generated by -dataex-. To install: ssc install dataex
clear
input str4 id str2 school_code str1 area
"a111" "1x" "a"
"a111" "1x" "b"
"a111" "1x" "c"
"a111" "1y" "a"
"a111" "1y" "b"
"a111" "1y" "c"
"x222" "1z" "d"
"x222" "1z" "e"
"x222" "1z" "f"
"x222" "1k" "g"
"x222" "1k" "h"
"x222" "1k" "i"
end

And here is a tentative example of how I wanted the dataset to be:

 * Example generated by -dataex-. To install: ssc install dataex
clear
input str4 id str2(school_code_1 school_code_2) str1(school1_area1 school1_area2 school1_area3 school2_area1 school2_area2 school2_area3)
"a111" "1x" "1y" "a" "b" "c" "a" "b" "c"
"x222" "1z" "1k" "d" "e" "f" "g" "h" "i"
end
1

1 Answers

0
votes

Thanks for the data examples using dataex (SSC).

This is a standard reshape once you note the hint in this FAQ that you may need to create a new identifier.

clear
input str4 id str2 school_code str1 area
"a111" "1x" "a"
"a111" "1x" "b"
"a111" "1x" "c"
"a111" "1y" "a"
"a111" "1y" "b"
"a111" "1y" "c"
"x222" "1z" "d"
"x222" "1z" "e"
"x222" "1z" "f"
"x222" "1k" "g"
"x222" "1k" "h"
"x222" "1k" "i"
end

sort id, stable 
by id: gen j = _n 
reshape wide school_code area, i(id) j(j) 

list 

which produces what you ask.

All that said, it is hard to imagine that this changed structure will make later Stata processing easier than the original data structure. Also, the different new variables are grouped by arbitrary variable names. If your schools and areas were in different order, what ends up in *1 *2 *3 would differ.

Small terminological point: The word "format" is heavily overloaded in computing, covering (at least) file formats, display formats, data structures and data (storage) types, whether rightly or wrongly so far as any particular software's formal terminology is concerned. In a Stata context there is a format command underlying one primary sense, of display format. There is also a formal idea of file formats (e.g. http://www.stata.com/help.cgi?dta). So, although ambiguity rarely lasts long, I'd recommend talking of data layout or structure here (although the latter term also is overloaded...).