I'm still in the beginning stages of R but I've gotten a few functions down and now I'm looking for my final "project."
I've created a function that takes each of my four sources of data (different populations) and creates histograms, performs kolmogorov-smirnov tests, and then graphs any significant results for a given row. What I want to do is turn it into an apply function. However, the issue is that my function takes four variables, and I don't know a way to make apply take four sources of data.
hist_fx <- function(w,x,y,z) {
hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
lines(density(w),col="red")
abline(v=c(mean(w)),col="red")
hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
lines(density(x),col="red")
abline(v=c(mean(x)),col="red")
hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
lines(density(y),col="red")
abline(v=c(mean(y)),col="red")
hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
lines(density(z),col="red")
abline(v=c(mean(z)),col="red")
all <- c(w,x,y,z)
hist(all,prob=TRUE,xlim=c(-1,0.5),ylim=c(0,3))
lines(density(w),col="purple")
lines(density(x),col="red")
lines(density(y),col="blue")
lines(density(z),col="green")
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
plot(ecdf(y),col="red",add=TRUE)
plot(ecdf(z),col="purple",add=TRUE)
t1 <- ks.test(w,x)
print(t1)
t2 <- ks.test(w,y)
print(t2)
t3 <- ks.test(w,z)
print(t3)
if(t1$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
}
if(t2p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(y),col="red",add=TRUE)
}
if(t3$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(z),col="purple",add=TRUE)
}
}
I'm able to use this function with apply for one population at a time (i.e. turn hist_fx into a function of one variable). However, I can't find a way to make this work for all four populations at the same time. I've messed around with some for loops, though they haven't been successful as of yet.
One last thing that might be of use: my data is arranged such that independent variables are the rows and the dependent variables are columns. Consequently, I need to run these per row (hence my idea of a for loop).
EDIT:
Here's the dput for one of the populations:
dput(k2) structure(c(-0.15, 0.13, 0.23, -0.23, 0.06, -0.11, 0.107, 0.06, -0.17, 0.12, 0.06, -0.25, -0.32, 0.13, 0.06, -0.2, -0.08, 0.06, 0.12, 0.02, 0.11, -0.11, -0.15, 0.097, 0.347, -0.307, 0.097, -0.047, 0.09, 0.01, -0.217, 0.117, 0.03, -0.3, -0.33, 0.13, 0.19, -0.24, -0.08, -0.01, 0.15, 0.61, 0.18, -0.15, -0.103, 0.135, 0.31, -0.25, 0.157, -0.105, -0.08, 0.01, -0.165, 0.17, 0.1, -0.23, -0.28, 0.15, 0.13, -0.14, -0.06, 0.01, 0.07, -0.02, 0.11, -0.06, -0.123, 0.13, 0.35, -0.27, 0.165, -0.065, 0.135, 0.13, -0.17, 0.135, 0.08, -0.21, -0.25, 0.2, 0.16, -0.18, NA, -0.04, 0.05, -0.02, 0.13, -0.14, -0.13, 0.098, 0.27, -0.193, 0.062, -0.08, 0.057, 0.028, -0.199, 0.1, 0.04, -0.24, -0.32, 0.13, 0.13, -0.15, -0.05, 0.01, 0.08, -0.04, 0.1, -0.1, -0.14, 0.154, 0.261, -0.194, 0.1, -0.129, 0.063, 0.142, -0.136, 0.136, 0.08, -0.23, -0.24, 0.12, 0.1, -0.16, -0.06, 0.04, 0.09, -0.01, 0.04, -0.08, -0.127, 0.133, 0.337, -0.06, 0.11, -0.107, 0.16, 0.167, -0.183, 0.103, 0.05, -0.2, -0.3, 0.22, -0.01, -0.17, -0.14, 0.02, 0.07, 0.01, 0.11, -0.11, -0.155, 0.221, 0.22, -0.172, 0.09, -0.15, 0.12, 0.03, -0.153, 0.146, 0.11, -0.2, -0.24, 0.16, 0.07, -0.19, -0.1, 0.03, 0.17, 0.02, 0.09, -0.16, -0.062, 0.19, 0.269, -0.265, 0.118, -0.11, 0.126, 0.094, -0.186, 0.151, 0.08, -0.26, -0.31, 0.13, 0.09, -0.23, -0.12, 0.05, 0.13, 0.01, 0.11, -0.14, -0.095, 0.14, 0.24, -0.46, 0.09, -0.17, 0.08, 0.01, -0.24, 0.16, 0.04, -0.38, -0.39, 0.11, 0.06, -0.31, -0.25, 0.03, 0.21, -0.14, 0, -0.22, -0.07, 0.148, 0.311, -0.27, 0.11, -0.055, 0.16, 0.04, -0.197, 0.064, 0.09, -0.24, -0.34, 0.17, 0.07, -0.15, -0.18, 0.03, 0.13, 0.07, 0.13, -0.08, -0.136, 0.142, 0.27, -0.257, 0.1, -0.13, 0.103, 0.064, -0.197, 0.118, 0.06, -0.29, -0.35, 0.13, 0.1, -0.19, -0.13, 0.01, 0.1, -0.01, 0.13, -0.15), .Dim = c(22L, 12L))
To further clarify, here's the format of the actual data frame:
c1 c2 c3 c4
r2 x x x
r3 x x x
r4 x x x
Each column represents a star's values for the variable on the row. As such, I want to create a histogram for each row, for each dataset.
For the values of the function, I just used those variables for simplicity's sake. w = population 1, x = population 2, y = population 3, z = population 4.
As for an example:
> hist_fx(k2[1,],n2[1,],j2[1,],g2[1,])
Two-sample Kolmogorov-Smirnov test
data: w and x
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided
Two-sample Kolmogorov-Smirnov test
data: w and y
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided
Two-sample Kolmogorov-Smirnov test
data: w and z
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided
My problem is that currently, I can only run the function one row at a time. I'd like to be able to do it for all rows. I was thinking of using apply because I've used it in a very similar context except only for one source of data.
dput(your_data)
? I have an idea but don't want to post the answer without testing it. – Raphael Kw
,x
,y
, andz
are and it's not clear why you would want to useapply
. – IRTFM