2
votes

I've inherited some old Stata code (Stata11) that uses the xtile function to categorize observations in a vector by their quantiles (in this case, just the standard 5 quintiles, 20%, 40%, 60%, 80%, 100%).

I'm trying to replicate a piece of the code in Python and I am using the SciPy.stats.mstats function mquantiles() for the computation.

As near as I can tell from Stata documentation and searching online, the Stata xtile method tries to invert the empirical CDF of the data, and uses the equal-weighted average of all observations for which the CDF is flat to make the cutpoint. This seems like a very poor way to categorize quantiles, but it is what it is and I am sure there are cases where this is the right thing to do.

My question is how to make mquantiles() produce the same sort of breaking convention. I noticed that this function has two parameters, alphap and betap (the documentation calls them alpha and beta but you need the extra 'p' to get it to work, at least I do... I get an error if I just use 'alpha' and 'beta' with Python 2.7.1 and SciPy 0.10.0). But even in the SciPy docs, I can't see whether there's a combination of these parameters that produces the mean over flat CDF ranges.

I see what looks like the option to compute as the median or mode of this range, but not mean (it's also not clear if these SciPy median/mode options with alpha and beta are computed as the median/mode of the observations or of the range that would produce the flat CDF value.)

Any help disambiguating these different options and finding some documentation that helps me recreate the Stata convention in Python would be great. Please refrain from answers that just say "write your own quantile function." Firstly, that doesn't help me understand the conventions of either Stata or SciPy, and secondly, given these numerical libraries, writing my own quantile function should be a last resort. I can certainly do it, but it would be bad all around if I need to.

1
Well, at least I was able to reproduce Stata steps. If you were to say now that R documentation is terrific, I will just throw my hands up in despair ;)StasK
EMS, I don't understand what you are referring to about "poor documentation": not only is xtile open source (it is implemented within an ado-file), but it is extensively documented in the help with numerous examples, technical notes, and a full page of mathematical explanation in the "Methods and Formulas" section.whuber
As of Stata 13 (June 2013) entire documentation for Stata is publicly accessible stata.com/manuals13/dpctile.pdf is the pertinent manual entry.Nick Cox
I record my dissent with @EMS's persistent and incorrect references to an xtile function in Stata. Commands and functions are both defined in Stata but are quite disjoint. There is no xtile function in Stata. Stata's terminology is surely pertinent to Stata questions. Using some terminology out of personal preference or greater familiarity with other software is inappropriate.Nick Cox
You get to keep your original wording as a courtesy to you as original poster, but it seems that you have convinced no-one. The assertions about what is relevant here seem largely personal views. Cross-communication between people using different languages can't be helped by using incorrect terminology; programming is all about getting the details exactly right. There could be an xtile function in Stata, but there isn't one; that is a matter of fact.Nick Cox

1 Answers

7
votes

The scipy.stats.mquantiles documentation was poor and wrong in places, fixed now so that might be helpful... http://docs.scipy.org/scipy/docs/scipy.stats.mstats_basic.mquantiles/. That process started when you pointed out the alpha/beta, alphap/betap discrepancy. Thank you.

The implementation of mquantiles follow R.

The biggest difference comes from that R has 9 discrete types, where because scipy.stats.mquantiles calculates 'm' from 'alphap' and 'betap', scipy has a continuous range of "types" (for lack of a better word).

I admit that I do not understand all of the ins and outs of the statistics involved so I settled on a brute force evaluation. I found an xtile example at http://www.biostat.sdu.dk/~biostat/StataReferenceManual/StataRef.pdf and was able to match the results with alphap=0.5, and betap=0.5 (piecewise linear). Not definitive nor exhaustive, but all I have right now.

In [1]: import scipy.stats as st

In [9]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.5],alphap=0.5,betap=.5)
Out[9]: array([ 61.5])

In [10]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.33,0.66],alphap=0.5,betap=.5)
Out[10]: array([ 38.84,  81.72])

In [11]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.25,0.5,0.75],alphap=0.5,betap=.5)
Out[11]: array([ 23. ,  61.5,  99. ])

The last is a little problematic since two of the division points are exactly on values in the data set. Stata/xtile (at least in the examples that I found) does not give the split points for the quantiles but gives the quantiles themselves. Given the sorted data set [17,23,56,67,99,123], Stata/xtile gave the categorization as [1,1,2,3,3,4] which means that for scipy.stat.mquantiles to match the upper bound of a quantile is greater than or equal to all values in that quantile.