I'm trying to compare two datasets where an attribute is stratified differently. Is it possible to do stratified random sampling in one dataset but using the stratification of another? To clarify, an example:
Dataset A has 1M records, with the attribute color. The entire dataset has 50% blue 50% red.
Now I have another dataset, Dataset B, of 100k records with the same attribute, color, but with a 20% blue 80% red distribution.
Is it possible for me to conduct stratified random sampling on Dataset A so that I get 100k records with 20% blue and 80% red?
I don't have any code written yet, simple because I don't know where to start. I've looked at the documentation for proc surveyselect
and it seems that it wouldn't be possible with that.
Right now, I'm looking to manually do it, where I'll cut Dataset A by color, and get a 20k random sample from the blue and 80k random sample from the red.
But given that in my real dataset, I want to stratify by 2 attributes with more than 2 levels, I would love to think that there's a more efficient way of doing this.