0
votes

Summary

Provided a dataframe in which I have several columns that are variables (each of them being numeric but one, which is a factor) and rows are observations,I would like to create a new column with the mean of all numeric columns + another one with a weighted mean of all numeric columns.

I have found quite some ways that apparently solve this problem (using dplyr, lapply, data.table... ) but none of them work with wide dataframes (and I am not sure I can convert it to long format -see below, and please be patient before marking as duplicate, as I haven't found any answer to my problem).

Long version:

I have a dataframe in wide format like the one provided below (the original one has more than 1700 observations of 20 variables grouped into 30 neighbourhoods) that is the result of of calculating the median of values of each variable:

df = data.frame(matrix(rnorm(15), nrow = 3))
df$neighbour = c("neighbour1", "neighbour2", "neighbour3")

df
> df
          X1         X2         X3         X4        X5  neighbour
1  1.0384405  0.6116994 -0.2075835  0.3206011 1.3855455 neighbour1
2 -0.5115649 -0.7722500  0.8374265 -1.3697758 0.1690452 neighbour2
3  1.0145282  0.6809156 -0.2918737  0.2912297 1.0689213 neighbour3

I would like to create

  • 1) a column named mean that is the mean of all numeric values (all columns but neighbour) and
  • 2) a wmean column with is the weighted mean of each column, where the weight is provided by the following vector: weight = c(.25, .05, .3, .3, .3)

My first attempt was using dplyr::mutate to create those columns, but I haven't succeeded, most likely because I'm doing it wrong (So If I haven't succeeded with a regular mean, I have no clue of how to perform a weighted mean):

df = df %>%
  mutate(mean = mean(select(-neighbour)))
Error in mutate_impl(.data, dots) : 
  argumento no válido para un operador unitario
> df = df %>%
+   mutate(mean = mean())
Error in mutate_impl(.data, dots) : 
  el argumento "x" está ausente, sin valor por omisión
> df = df %>%
+   mutate(mean = mean(is.numeric()))
Error in mutate_impl(.data, dots) : 
  0 arguments passed to 'is.numeric' which requires 1
> 

Also tried with mutate_each, but I'm assuming that my problem is that I do not know how to pass the right columns to calculate the mean (not to mention that I have no clue about weighted mean).

From what I have read there are many ways to create the desired columns:

  • This answer by Carlos Cinelli gives examples using sapply + filter, dplyr and tydr, but all these solutions are based on the fact they do not create a new column, with the median of each neighbour's observations but the median of each variables' values.

  • This answer by @Roland suggests to use data.table, but in order to be able to use it, my dataframe should have a column with the weight (whereas I do not have it and I'm afraid I wouldn't know how to create a column like that, provided that I have more than 1700 observations)

  • This answer by @Bob uses apply to create a mean of several columns (that's close to what I'm loooking for!) but still no clue of how to A) exclude the neighbour column, as otherwise it will fail, and B) to calculate the weighted mean.

Can anyone bring me some light with it? I am so ofuscated right now trying to solve this that I can't see the answer.

EDIT: As per @boshek's answer I have tried to convert from wide to long format and then applying summarise_each, but haven't succeeded neither:

df = df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour, variable) %>%
  summarise_each(., funs=mean)
4
Generally best to use set.seed before making a random-data example so it's easier to verify that answers work. For the same reason, I'd use sample(100, n) instead of rnorm(n), just for readability.Frank
Made some changes belowboshek
Fyi, Roland's way doesn't require a weight column. Simply using your weight vector works fine... melt(data.table(df))[, .(mean(value), weighted.mean(value, weight)), by=neighbour, verbose=TRUE] .. then you just need to merge / assign those values back.Frank
Thanks for your suggestion, Frank. I've never used set.seed nor sample() I'll have to study how they work for future questions.ccamara

4 Answers

2
votes
df$mean <- apply(df[1:5], 1, mean)
df$wt.mean <- apply(df[1:5], 1, weighted.mean, weight)
2
votes

Ok - so you want means ACROSS the row?

I'd use gather from dplyr then merge it back with your original data:

df.mean <- df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour) %>%
  summarise(mean_value=mean(value), wmean_value=weighted.mean(value))

df.comb <- df %>%
  full_join(.,df.mean, by=c("neighbour"))

There are a few ways to skin this cat but this is one.

Is this what you wanted?

1
votes

I think the rowMeans() function in base may be your best bet.

df$mean <- rowMeans(dplyr::select(df, starts_with("X")))

The weighted mean may be more difficult. I couldn't find a quick and clean way to do it, but here's an option that works:

# define a function that calculates a weighted mean
wmean <- function(x, weight){
  stopifnot(length(x) == length(weight))
  if(sum(weight) != 1) {
    message("Rescaling weights to sum to 1")
    weight <- weight/sum(weight)
  }
  wx <- sum(x * weight)
  return(wx)
}
# apply that function row by row to the X columns in df
df$wmean <- apply(X=dplyr::select(df, starts_with("X")), MARGIN=1, FUN=wmean, weight = weight)
0
votes

I know I'm a bit late posting this, but I was looking for a solution to a similar problem and found the rowWeightedMeans from the matrixStats library, wich also supports na.rm, you only need to convert to matrix, it works as follows:

library(matrixStats)
df$wmean <- rowWeightedMeans(as.matrix(df[ , c('X1', 'X2', 'X3', 'X4', 'X5')]), w = weight)

This worked perfectly for me and as mentioned, has the extra that supports na.rm = TRUE wich I needed