0
votes

I have a fictional dataframe like so, including continuous and categorical variables

library(dplyr)
library(ggplot2)
library(tidyr)


df <- tibble(
  # each sample gets id from 1:1000
  id = 1:1000,
  # sex,categorical, either "f" or "m"
  sex = ifelse(runif(1000, 0, 1) < 0.5, "f","m"),
  # disease stage, categorical, either 1 or 2
  stage = ifelse(runif(1000,0,1) < 0.5, 1,2),
  # age, continuous
  age = runif(1000,20,80),
  # blood, continuous
  blood = runif(1000,10,1000)
)

The categorical variables have a nearly 50:50 distribution

prop.table(table(df$sex))
prop.table(table(df$stage))

And the continuous have a rather arbitrary, non normal distribution

df %>% 
  gather(test, result, 4:5) %>%   
  ggplot(aes(result)) +
  geom_density() +
  facet_wrap(test ~ ., scale="free")

Non normal distribution of age and blood

If I now take 100 samples from the df, the resulting distributions are entirely different from the initial distribution

sample_df <- sample_n(df, 100, replace=F)

sample_df %>% 
  gather(test, result, 4:5) %>%   
  ggplot(aes(result)) +
  geom_density() +
  facet_wrap(test ~ ., scale="free")

Distribution of n=100 samples

My question is now, how would I sample from df so that my sample_df follows the distribution and propability of all of my parameters (sex, age, stage, blood). I thought about fitting a regression model to the df and picking samples based on the residuals, hence the distance of each sample to the regression line.

The actual underlying problem is a large cohort of patient data from which I want to pick a subcohort while preserving the distribution and propability of certain patient and disease characteristics.

Any help is highly appreciated.

Edit

I g Know that a sample of 1/10 of the population is too small and that increasing the sample size will make the distribuate approximate that of the population it was drawn from. To make my situation more clear, for me it is not manageable to use more than let's say 1/4 of my population. And with every draw from the population there is some risk that I pick a very unrepresentative cohort (sampling error). So basically I'm looking for a method to minimize this risk and to maximize the chance that my sample is the most accurate representation of the population.

2
100 samples is pretty small, it will very easily end up looking different from a uniform distribution. Run the following to compare the effect of sample sizes: x1 <- runif(100000,0,1); hist(sample(x1, 100)); hist(sample(x1, 10000)) - rg255
The samples are different because they are just that - samples. If you had a 100 balls with half white and half black and drew samples of 6 many would not be not be 3 white and 3 black. So it depends on what you want to use the samples for. If for a simulation, then the model would be designed for that stochastic randomness. Also note the Central Limit Theorem. I.e. the means of many samples will be approximately normally distributed. - SteveM
you are running the wrong simulation. age = runif(1000,20,80) , this means a uniform (flat) distribution from 20 to 80. If you want a normal distribution it should be rnorm(1000,<mean>,<sd>) - StupidWolf

2 Answers

0
votes

Your base population is sampled from a uniform distribution. Even with a 1000 individuals, you can see from your figures that there is some "non-uniformness" to it. Your sample population is then just 100 individuals. By chance you will sample something that resembles but does not perfectly reflect your base population or a uniform distribution. The code below shows a comparison between sample populations of 100 individuals and 20000 individuals.

x1 <- runif(100000,0,1)
plot(NULL, xlim = c(0,1), ylim = c(0,1.2))

for(i in 1:20){
  points(density(sample(x1, 100)), typ = "l", col = "red")
  points(density(sample(x1, 20000)), typ = "l", col = "black")
}
0
votes

Okay, I think I figured what I actually wanted. Stratified sampling. Basically define strata based on the frequency of certain parameters and sample from them.

Here's some further reading on that