6
votes

I have a set of >2000 numbers, gathered from measurement. I want to sample from this data set, ~10 times in each test, while preserving probability distribution overall, and in each test (to extent approximately possible). For example, in each test, I want some small value, some middle class value, some big value, with the mean and variance approximately close to the original distribution. Combining all the tests, I also want the total mean and variance of all the samples, approximately close to the original distribution.

As my dataset is a long-tail probability distribution, the amount of data at each quantile are not the same:

Probability density

Fig 1. Density plot of ~2k elements of data.

I am using Java, and right now I am using a uniform distribution, and use a random int from the dataset, and return the data element at that position:

public int getRandomData() {
    int data[] ={1231,414,222,4211,,41,203,123,432,...};
    length=data.length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[randomInt];
}

I don't know if it works as I want, because I use data in order it is measured, which has great amount of serial correlation.

2

2 Answers

3
votes

It works as you want. The order of the data is irrelevant.

2
votes

Random sampling preserves the probability distribution.