I have a fictional dataframe like so, including continuous and categorical variables
library(dplyr)
library(ggplot2)
library(tidyr)
df <- tibble(
# each sample gets id from 1:1000
id = 1:1000,
# sex,categorical, either "f" or "m"
sex = ifelse(runif(1000, 0, 1) < 0.5, "f","m"),
# disease stage, categorical, either 1 or 2
stage = ifelse(runif(1000,0,1) < 0.5, 1,2),
# age, continuous
age = runif(1000,20,80),
# blood, continuous
blood = runif(1000,10,1000)
)
The categorical variables have a nearly 50:50 distribution
prop.table(table(df$sex))
prop.table(table(df$stage))
And the continuous have a rather arbitrary, non normal distribution
df %>%
gather(test, result, 4:5) %>%
ggplot(aes(result)) +
geom_density() +
facet_wrap(test ~ ., scale="free")
If I now take 100 samples from the df, the resulting distributions are entirely different from the initial distribution
sample_df <- sample_n(df, 100, replace=F)
sample_df %>%
gather(test, result, 4:5) %>%
ggplot(aes(result)) +
geom_density() +
facet_wrap(test ~ ., scale="free")
My question is now, how would I sample from df so that my sample_df follows the distribution and propability of all of my parameters (sex, age, stage, blood). I thought about fitting a regression model to the df and picking samples based on the residuals, hence the distance of each sample to the regression line.
The actual underlying problem is a large cohort of patient data from which I want to pick a subcohort while preserving the distribution and propability of certain patient and disease characteristics.
Any help is highly appreciated.
Edit
I g Know that a sample of 1/10 of the population is too small and that increasing the sample size will make the distribuate approximate that of the population it was drawn from. To make my situation more clear, for me it is not manageable to use more than let's say 1/4 of my population. And with every draw from the population there is some risk that I pick a very unrepresentative cohort (sampling error). So basically I'm looking for a method to minimize this risk and to maximize the chance that my sample is the most accurate representation of the population.
x1 <- runif(100000,0,1); hist(sample(x1, 100)); hist(sample(x1, 10000))
- rg255