Consider the following (excel) dataset:
m | r
----|------
2.0 | 3.3
0.8 |
| 4.0
1.3 |
2.1 | 5.2
| 2.3
| 1.9
2.5 |
1.2 | 3.0
2.0 | 2.6
My goal is to fill in missing values using the following condition:
Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlation after the empty cells have been filled in. Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.
So far I have done it using Matlab:
clear all;
m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;
rho = corr(m,r,'rows','pairwise');
x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);
SOL = fmincon(f,x0,[],[],[],[],lb)
where the function my_correl
is:
function X = my_correl(x,rho)
sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);
X = (rho - rho_num/rho_den)^2;
end
This function computes the correlation manually, where every missing data is a variable x(i)
.
The problem: my actual dataset has more than 20,000 observations. There is no way I can create that rho formula manually.
How can I fill in my dataset?
Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.
Note 2: a 100 points bounty will be awarded to the answer. Promise from now.
Amelia
package. Please see stats.stackexchange.com/questions/47247/… for more information about this package. The imputed dataset will not have exactly the same correlation as the incomplete dataset, but it should be close given that Amelia assumes the full dataset follows a multivariate normal distribution. – jav