How can I make a larger data frame combining two data frames but some rows have missing values (details below)

Question

Here is some hypothetical data. This may seem like a stupid example, but it gets the idea across I think. So I am looking for a correlation between age and height in various locations (think countries or whatever), but I am separating it by gender. First, I am checking to make sure that the ages between genders are comparable by running Wilcox.test. Here is a hypothetical output with an extra column I added, telling me if ages are not statistically different/comparable.

Location    Age comparison p value (wilcox.test)    Age comparable (p =.05)?
A   0.04    no
B   0.07    yes
C   0.09    yes
D   0.1 yes
E   0.02    no
F   0.03    no
H   0.78    yes

Then I run a Spearman cor.test, but the n for males and females whose data was collected at each of these locations is 2 or less, so it cannot be run on some of the locations, giving me a sample output of:

Correlation between age and height in males

Location    Correlation p value (cor.test, pearson) Correlation present?
A   0.04    yes
B   0.17    no
H   0.47    no

Correlation between age and height in females

Location    Correlation p value (cor.test, pearson) Correlation present?
C   0.08    no
D   0.03    yes
E   0.19    no
F   0.17    no
H   0.23    no

Now, I want to combine this data (age comparison and correlation) into one data frame that makes it easy to analyze the data. Here is what I want to output:

    Location    Age comparison p value (wilcox.test)    Age comparable (p =.05)?    cor.test p value (male)     cor.test p value (female)   
    A   0.04    no  0.04    yes na  na
    B   0.07    yes 0.17    no  na  na
    C   0.09    yes na  na  0.08    no
    D   0.1 yes na  na  0.03    yes
    E   0.02    no  na  na  0.19    no
    F   0.03    no  na  na  0.17    no
    H   0.78    yes 0.47    no  0.23    no

However, I am not sure how to line up the outputs, entering in the necessary na due to n being too small.

Could you guide me through this process please? This is my first question, so sorry if it isn't formatted well/is unclear. If you have any questions please don't hesitate to ask :)

Thanks!

Re formatting, it's helpful if your question is easily reproducible (so it can be copy-pasted into a new R session and run). See stackoverflow.com/questions/5963269/… for guidance. — Frank
Use merge() to perform joins. Use all = T if you want to include everything in both tables (union), all.x for a left join, and all.y for a right join. It will automatically match by the common column name if it exists, in this case Location. Otherwise you can specify by.x and by.y — Mako212
@Mako212 So if I want the outputs of male and female in different columns (and the correlation output variable is named "Location A Male" for example), can I get the p.value matching to the location (each location being a row) in the right column (male/female)? How do I specify to look in the variable name whether it is male or female correlation, and output to the correct column based on that? And if row names do not match (age comparison named Location A, while correlation is Location A Male), how do I tell it to match them because "Location A" is the same, so I want it in the same row? — Keshav M

AkselA AkselA · Accepted Answer · 2017-07-31T15:45:42

As @Mako mentioned, what you are looking for is a "join", or merge() in R parlance.

I took the liberty to shorten the column names somewhat.

male <- read.table(text="
  Location    p corr
  A   0.04    yes
  B   0.17    no
  H   0.47    no", header=TRUE)

female <- read.table(text="
  Location    p corr
  C   0.08    no
  D   0.03    yes
  E   0.19    no
  F   0.17    no
  H   0.23    no", header=TRUE)

merge(x=male, y=female, by="Location", all=TRUE, suffixes=c("_male", "_female"))

#   Location p_male corr_male p_female corr_female
# 1        A   0.04       yes       NA        <NA>
# 2        B   0.17        no       NA        <NA>
# 3        H   0.47        no     0.23          no
# 4        C     NA      <NA>     0.08          no
# 5        D     NA      <NA>     0.03         yes
# 6        E     NA      <NA>     0.19          no
# 7        F     NA      <NA>     0.17          no

For more than two data frames you can use Reduce() in conjunction with merge().There's no neat way of applying suffixes using this method, but by default the sequence of columns is the same as the sequence you supplied the data frames in.

na <- read.table(text="
  Location    p corr
  A   0.07    no
  D   0.02    yes
  E   0.18    no
  F   0.16    no", header=TRUE)

Reduce(function(...) merge(..., by="Location", all=TRUE), list(female, male, na))
#   Location  p.x corr.x  p.y corr.y    p corr
# 1        C 0.08     no   NA   <NA>   NA <NA>
# 2        D 0.03    yes   NA   <NA> 0.02  yes
# 3        E 0.19     no   NA   <NA> 0.18   no
# 4        F 0.17     no   NA   <NA> 0.16   no
# 5        H 0.23     no 0.47     no   NA <NA>
# 6        A   NA   <NA> 0.04    yes 0.07   no
# 7        B   NA   <NA> 0.17     no   NA <NA>

How can I make a larger data frame combining two data frames but some rows have missing values (details below)

1 Answers