Summary
Provided a dataframe in which I have several columns that are variables (each of them being numeric but one, which is a factor) and rows are observations,I would like to create a new column with the mean of all numeric columns + another one with a weighted mean of all numeric columns.
I have found quite some ways that apparently solve this problem (using dplyr
, lapply
, data.table
... ) but none of them work with wide dataframes (and I am not sure I can convert it to long format -see below, and please be patient before marking as duplicate, as I haven't found any answer to my problem).
Long version:
I have a dataframe in wide format like the one provided below (the original one has more than 1700 observations of 20 variables grouped into 30 neighbourhoods) that is the result of of calculating the median of values of each variable:
df = data.frame(matrix(rnorm(15), nrow = 3))
df$neighbour = c("neighbour1", "neighbour2", "neighbour3")
df
> df
X1 X2 X3 X4 X5 neighbour
1 1.0384405 0.6116994 -0.2075835 0.3206011 1.3855455 neighbour1
2 -0.5115649 -0.7722500 0.8374265 -1.3697758 0.1690452 neighbour2
3 1.0145282 0.6809156 -0.2918737 0.2912297 1.0689213 neighbour3
I would like to create
- 1) a column named
mean
that is the mean of all numeric values (all columns butneighbour
) and - 2) a
wmean
column with is the weighted mean of each column, where the weight is provided by the following vector:weight = c(.25, .05, .3, .3, .3)
My first attempt was using dplyr::mutate
to create those columns, but I haven't succeeded, most likely because I'm doing it wrong (So If I haven't succeeded with a regular mean, I have no clue of how to perform a weighted mean):
df = df %>%
mutate(mean = mean(select(-neighbour)))
Error in mutate_impl(.data, dots) :
argumento no válido para un operador unitario
> df = df %>%
+ mutate(mean = mean())
Error in mutate_impl(.data, dots) :
el argumento "x" está ausente, sin valor por omisión
> df = df %>%
+ mutate(mean = mean(is.numeric()))
Error in mutate_impl(.data, dots) :
0 arguments passed to 'is.numeric' which requires 1
>
Also tried with mutate_each
, but I'm assuming that my problem is that I do not know how to pass the right columns to calculate the mean (not to mention that I have no clue about weighted mean).
From what I have read there are many ways to create the desired columns:
This answer by Carlos Cinelli gives examples using
sapply + filter
,dplyr
andtydr
, but all these solutions are based on the fact they do not create a new column, with the median of each neighbour's observations but the median of each variables' values.This answer by @Roland suggests to use data.table, but in order to be able to use it, my dataframe should have a column with the weight (whereas I do not have it and I'm afraid I wouldn't know how to create a column like that, provided that I have more than 1700 observations)
- This answer by @Bob uses
apply
to create a mean of several columns (that's close to what I'm loooking for!) but still no clue of how to A) exclude theneighbour
column, as otherwise it will fail, and B) to calculate the weighted mean.
Can anyone bring me some light with it? I am so ofuscated right now trying to solve this that I can't see the answer.
EDIT: As per @boshek's answer I have tried to convert from wide to long format and then applying summarise_each, but haven't succeeded neither:
df = df %>%
gather(variable, value, -neighbour) %>%
group_by(neighbour, variable) %>%
summarise_each(., funs=mean)
set.seed
before making a random-data example so it's easier to verify that answers work. For the same reason, I'd usesample(100, n)
instead ofrnorm(n)
, just for readability. – Frankmelt(data.table(df))[, .(mean(value), weighted.mean(value, weight)), by=neighbour, verbose=TRUE]
.. then you just need to merge / assign those values back. – Frankset.seed
norsample()
I'll have to study how they work for future questions. – ccamara