2
votes

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).

I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.

For example:

levelsof economist, local(liste)

foreach gars of local liste {
    display "`gars'"
    tabsplit SubjectCategory if economist=="`gars'", p(;) sort 
    return list
    replace nbcateco = r(r) if economist == "`gars'"
}

For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.

I did the same for the date so I can have the evolution of r(r) over time:

levelsof temps99, local(liste23)

foreach time of local liste23 {
    display "`time'"
    tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
    return list
    replace nbcattime = r(r) if temps99 == "`time'"
}

Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).

What I want is to be able to have my r(r) for each of my subgroups over time.

2

2 Answers

1
votes

This is an example of the XY problem, I think. See http://xyproblem.info/

tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.

You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.

local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .)) 

Then (e.g.) table or tabstat will let you explore further by groups of interest.

To see the counting idea, consider 3 categories with 2 semi-colons.

. display length("frog;toad;newt")
14

. display length(subinstr("frog;toad;newt", ";", "", .))
12

If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.

That said, a way to extend your approach might be

egen class = group(economist temps99), label 
su class, meanonly 
local nclass = r(N)
gen result = . 

forval i = 1/`nclass' {
    di "`: label (class) `i''" 
    tabsplit SubjectCategory if class == `i', p(;) sort
    return list
    replace result = r(r) if class == `i'
}

Using statsby would be even better. See also this FAQ.

2
votes

Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.

I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.

* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345

* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart" 
"Janet Currie"       
"Asli Demirguc-Kunt" 
"Esther Duflo"       
"Marianne Bertrand"  
"Claudia Goldin"     
"Bronwyn Hughes Hall"
"Serena Ng"          
"Anne Case"          
"Valerie Ann Ramey"  
end

expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
          NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
          NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
    replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
        if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign   // from SSC
* ------------------------------------------------------------------------------


program distinct_categories
  dis _n _n _dup(80) "-"
  dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]

  // if there are no subjects for the group, exit now to avoid a no obs error
  qui count if !mi(trim(SubjectCategory))
  if r(N) == 0 exit

  // split categories, reshape to a long layout, and reduce to unique values
  preserve
  keep pubid SubjectCategory
  quietly {
    split SubjectCategory, parse(;) gen(cat)
    reshape long cat, i(pubid)
    bysort cat: keep if _n == 1
    drop if mi(cat)
  }

  // show results and generate the wanted variable
  list cat
  local distinct = _N
  dis _n as txt "distinct = " as res `distinct'
  restore
  gen wanted = `distinct'
end

runby distinct_categories, by(economist temps99) verbose