I have a set of >2000 numbers, gathered from measurement. I want to sample from this data set, ~10 times in each test, while preserving probability distribution overall, and in each test (to extent approximately possible). For example, in each test, I want some small value, some middle class value, some big value, with the mean and variance approximately close to the original distribution. Combining all the tests, I also want the total mean and variance of all the samples, approximately close to the original distribution.
As my dataset is a long-tail probability distribution, the amount of data at each quantile are not the same:
Fig 1. Density plot of ~2k elements of data.
I am using Java, and right now I am using a uniform distribution, and use a random int from the dataset, and return the data element at that position:
public int getRandomData() {
int data[] ={1231,414,222,4211,,41,203,123,432,...};
length=data.length;
Random r=new Random();
int randomInt = r.nextInt(length);
return data[randomInt];
}
I don't know if it works as I want, because I use data in order it is measured, which has great amount of serial correlation.