Using apply using multiple sources of data?

Question

I'm still in the beginning stages of R but I've gotten a few functions down and now I'm looking for my final "project."

I've created a function that takes each of my four sources of data (different populations) and creates histograms, performs kolmogorov-smirnov tests, and then graphs any significant results for a given row. What I want to do is turn it into an apply function. However, the issue is that my function takes four variables, and I don't know a way to make apply take four sources of data.

hist_fx <- function(w,x,y,z) {
  hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
    lines(density(w),col="red")
    abline(v=c(mean(w)),col="red")

  hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
    lines(density(x),col="red")
    abline(v=c(mean(x)),col="red")


  hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
    lines(density(y),col="red")
    abline(v=c(mean(y)),col="red")


  hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
    lines(density(z),col="red")
    abline(v=c(mean(z)),col="red")

  all <- c(w,x,y,z)
    hist(all,prob=TRUE,xlim=c(-1,0.5),ylim=c(0,3))
    lines(density(w),col="purple")
    lines(density(x),col="red")
    lines(density(y),col="blue")
    lines(density(z),col="green")

  plot(ecdf(w),col="green")
  plot(ecdf(x),col="blue",add=TRUE)
  plot(ecdf(y),col="red",add=TRUE)
  plot(ecdf(z),col="purple",add=TRUE)

  t1 <- ks.test(w,x)
    print(t1)
  t2 <- ks.test(w,y)
    print(t2)
  t3 <- ks.test(w,z)
    print(t3)

  if(t1$p.value < 0.05) {
        plot(ecdf(w),col="green")
        plot(ecdf(x),col="blue",add=TRUE)
    }
  if(t2p.value < 0.05) {
        plot(ecdf(w),col="green")
        plot(ecdf(y),col="red",add=TRUE)
    }
  if(t3$p.value < 0.05) {
        plot(ecdf(w),col="green")
        plot(ecdf(z),col="purple",add=TRUE)
    }
}

I'm able to use this function with apply for one population at a time (i.e. turn hist_fx into a function of one variable). However, I can't find a way to make this work for all four populations at the same time. I've messed around with some for loops, though they haven't been successful as of yet.

One last thing that might be of use: my data is arranged such that independent variables are the rows and the dependent variables are columns. Consequently, I need to run these per row (hence my idea of a for loop).

EDIT:

Here's the dput for one of the populations:

dput(k2) structure(c(-0.15, 0.13, 0.23, -0.23, 0.06, -0.11, 0.107, 0.06, -0.17, 0.12, 0.06, -0.25, -0.32, 0.13, 0.06, -0.2, -0.08, 0.06, 0.12, 0.02, 0.11, -0.11, -0.15, 0.097, 0.347, -0.307, 0.097, -0.047, 0.09, 0.01, -0.217, 0.117, 0.03, -0.3, -0.33, 0.13, 0.19, -0.24, -0.08, -0.01, 0.15, 0.61, 0.18, -0.15, -0.103, 0.135, 0.31, -0.25, 0.157, -0.105, -0.08, 0.01, -0.165, 0.17, 0.1, -0.23, -0.28, 0.15, 0.13, -0.14, -0.06, 0.01, 0.07, -0.02, 0.11, -0.06, -0.123, 0.13, 0.35, -0.27, 0.165, -0.065, 0.135, 0.13, -0.17, 0.135, 0.08, -0.21, -0.25, 0.2, 0.16, -0.18, NA, -0.04, 0.05, -0.02, 0.13, -0.14, -0.13, 0.098, 0.27, -0.193, 0.062, -0.08, 0.057, 0.028, -0.199, 0.1, 0.04, -0.24, -0.32, 0.13, 0.13, -0.15, -0.05, 0.01, 0.08, -0.04, 0.1, -0.1, -0.14, 0.154, 0.261, -0.194, 0.1, -0.129, 0.063, 0.142, -0.136, 0.136, 0.08, -0.23, -0.24, 0.12, 0.1, -0.16, -0.06, 0.04, 0.09, -0.01, 0.04, -0.08, -0.127, 0.133, 0.337, -0.06, 0.11, -0.107, 0.16, 0.167, -0.183, 0.103, 0.05, -0.2, -0.3, 0.22, -0.01, -0.17, -0.14, 0.02, 0.07, 0.01, 0.11, -0.11, -0.155, 0.221, 0.22, -0.172, 0.09, -0.15, 0.12, 0.03, -0.153, 0.146, 0.11, -0.2, -0.24, 0.16, 0.07, -0.19, -0.1, 0.03, 0.17, 0.02, 0.09, -0.16, -0.062, 0.19, 0.269, -0.265, 0.118, -0.11, 0.126, 0.094, -0.186, 0.151, 0.08, -0.26, -0.31, 0.13, 0.09, -0.23, -0.12, 0.05, 0.13, 0.01, 0.11, -0.14, -0.095, 0.14, 0.24, -0.46, 0.09, -0.17, 0.08, 0.01, -0.24, 0.16, 0.04, -0.38, -0.39, 0.11, 0.06, -0.31, -0.25, 0.03, 0.21, -0.14, 0, -0.22, -0.07, 0.148, 0.311, -0.27, 0.11, -0.055, 0.16, 0.04, -0.197, 0.064, 0.09, -0.24, -0.34, 0.17, 0.07, -0.15, -0.18, 0.03, 0.13, 0.07, 0.13, -0.08, -0.136, 0.142, 0.27, -0.257, 0.1, -0.13, 0.103, 0.064, -0.197, 0.118, 0.06, -0.29, -0.35, 0.13, 0.1, -0.19, -0.13, 0.01, 0.1, -0.01, 0.13, -0.15), .Dim = c(22L, 12L))

To further clarify, here's the format of the actual data frame:

c1 c2 c3 c4

r2 x x x

r3 x x x

r4 x x x

Each column represents a star's values for the variable on the row. As such, I want to create a histogram for each row, for each dataset.

For the values of the function, I just used those variables for simplicity's sake. w = population 1, x = population 2, y = population 3, z = population 4.

As for an example:

 > hist_fx(k2[1,],n2[1,],j2[1,],g2[1,])

    Two-sample Kolmogorov-Smirnov test

data:  w and x
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided


    Two-sample Kolmogorov-Smirnov test

data:  w and y
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided


    Two-sample Kolmogorov-Smirnov test

data:  w and z
D = 1, p-value = 1.229e-05
alternative hypothesis: two-sided

My problem is that currently, I can only run the function one row at a time. I'd like to be able to do it for all rows. I was thinking of using apply because I've used it in a very similar context except only for one source of data.

Can you post a sample of your data using dput(your_data)? I have an idea but don't want to post the answer without testing it. — Raphael K
Post an example. It's not clear what w, x, y, and z are and it's not clear why you would want to use apply. — IRTFM
I posted an example and a sample of my data along with further clarifying my question. — learner_of_stats

Parfait Parfait · Accepted Answer · 2016-05-14T04:22:16

Not quite sure of your needs but consider transposing, t() to run plots column-wise for row data. And consider using mapply(), the multivariate type of the apply family which runs an operation element-wise at the same time for equal-length objects. Even break apart the operations as running them together may only print/plot the last iteration to screen.

Transpose (data used were slight variations of posted dput matrix)

pop1 <- data.frame(t(data))
pop2 <- data.frame(t(data))
pop3 <- data.frame(t(data))
pop4 <- data.frame(t(data))

Histograms

hist_fx <- function(w,x,y,z) {

  whist <- hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
  lines(density(w),col="red")
  abline(v=c(mean(w)),col="red")

  xhist <- hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
  lines(density(x),col="red")
  abline(v=c(mean(x)),col="red")      

  yhist <- hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
  lines(density(y),col="red")
  abline(v=c(mean(y)),col="red")      

  zhist <- hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
  lines(density(z),col="red")
  abline(v=c(mean(z)),col="red")

}

# HISTOGRAM PLOTS FOR EACH DF COLUMN 
output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)

Kolmogorov-Smirnov tests (using slight variations of dput data)

hist_fx <- function(w,x,y,z) {
  t1 <- ks.test(w,x)      
  t2 <- ks.test(w,y)      
  t3 <- ks.test(w,z)   

  if(t1$p.value < 0.05) {
     plot(ecdf(w),col="green")
     plot(ecdf(x),col="blue",add=TRUE)
  }
  if(t2$p.value < 0.05) {
     plot(ecdf(w),col="green")
     plot(ecdf(y),col="red",add=TRUE)
  }
  if(t3$p.value < 0.05) {
     plot(ecdf(w),col="green")
     plot(ecdf(z),col="purple",add=TRUE)
  }

  return(c(t1, t2, t3))
}

output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)

output    
#             X1                                  
# statistic   0.1666667                           
# p.value     0.9962552                           
# alternative "two-sided"                         
# method      "Two-sample Kolmogorov-Smirnov test"
# data.name   "w and x"                           
# statistic   0.25                                
# p.value     0.8474885                           
# alternative "two-sided"                         
# method      "Two-sample Kolmogorov-Smirnov test"
# data.name   "w and y"                           
# statistic   0.08333333                          
# p.value     1                                   
# alternative "two-sided"                         
# method      "Two-sample Kolmogorov-Smirnov test"
# data.name   "w and z"                           
#             X2                                  
# statistic   0.25                                
# p.value     0.8474885                           
# alternative "two-sided"                         
# method      "Two-sample Kolmogorov-Smirnov test"
# data.name   "w and x"                           
# statistic   0.08333333                          
# p.value     1                                   
# alternative "two-sided"                         
# method      "Two-sample Kolmogorov-Smirnov test"
# data.name   "w and y"                           
# statistic   0.1666667                           
# p.value     0.9962552                           
# alternative "two-sided"                         
# method      "Two-sample Kolmogorov-Smirnov test"
# data.name   "w and z"           
# ...

Using apply using multiple sources of data?

1 Answers