I have a dataset with hundreds of columns and thousands of rows. A sample of this dataset is:
df <- read.table(header=TRUE,
text="
income_2007 occupation_2007 income_2008 occupation_2008 income_2009 occupation_2009 income_2010 occupation_2010 income_2011 occupation_2011 income_2012 occupation_2012 income_2013 occupation_2013 income_2014 occupation_2014
1 6117 XYZ 6339 XYZ 6240 XYZ 6274 XYZ 7834 XYZ 8722 XYZ 7770 XYZ 6849 XYZ
2 6774 XYZ 8168 XYZ 8225 XYZ 7944 XYZ 8442 XYZ 9048 XYZ 9031 XYZ 8630 XYZ
3 NA NA NA NA 1390 XYZ NA NA 2758 XYZ 4054 XYZ 5666 XYZ 5398 XYZ
4 2237 XYZ 3134 XYZ 5118 XYZ 4484 XYZ 4441 XYZ 5125 XYZ 5012 XYZ 4406 XYZ
5 NA NA NA NA 2521 XYZ 2025 XYZ 3218 XYZ 4422 XYZ 5684 XYZ 6448 XYZ
6 9636 XYZ 8738 XYZ 8944 XYZ 10206 XYZ 11427 XYZ 13310 XYZ 12397 XYZ 13758 XYZ
7 5406 XYZ 6761 XYZ 8603 XYZ 8914 XYZ 9706 XYZ 9017 XYZ 9980 XYZ 10926 XYZ
8 1227 XYZ NA NA NA NA 2302 XYZ 4246 XYZ 4798 XYZ 6175 XYZ 5978 XYZ
9 NA NA NA NA NA NA NA NA 1547 XYZ 2631 XYZ 1931 XYZ 2488 XYZ
10 4876 XYZ 4438 XYZ 5844 XYZ 5884 XYZ 7532 XYZ 7739 XYZ 8630 XYZ 9573 XYZ
")
In this example I have columns for the average income and main occupation for each year during 8 years (2007-2014).
I need to create two new columns. First column with the natural logarithm of the median income for the first four years (2007-2010). Second column with the natural logarithm of the median income for the last four years (2011-2014). The expected result is:
df1 <- read.table(header=TRUE,
text="
income_2007 occupation_2007 income_2008 occupation_2008 income_2009 occupation_2009 income_2010 occupation_2010 income_2011 occupation_2011 income_2012 occupation_2012 income_2013 occupation_2013 income_2014 occupation_2014 ln_inc_07_10 ln_inc_11_14
1 6117 XYZ 6339 XYZ 6240 XYZ 6274 XYZ 7834 XYZ 8722 XYZ 7770 XYZ 6849 XYZ 8,741456116 8,96213539
2 6774 XYZ 8168 XYZ 8225 XYZ 7944 XYZ 8442 XYZ 9048 XYZ 9031 XYZ 8630 XYZ 8,994172434 9,085966917
3 NA NA NA NA 1390 XYZ NA NA 2758 XYZ 4054 XYZ 5666 XYZ 5398 XYZ 7,237059026 8,460834458
4 2237 XYZ 3134 XYZ 5118 XYZ 4484 XYZ 4441 XYZ 5125 XYZ 5012 XYZ 4406 XYZ 8,245121966 8,46094025
5 NA NA NA NA 2521 XYZ 2025 XYZ 3218 XYZ 4422 XYZ 5684 XYZ 6448 XYZ 7,728855824 8,527737405
6 9636 XYZ 8738 XYZ 8944 XYZ 10206 XYZ 11427 XYZ 13310 XYZ 12397 XYZ 13758 XYZ 9,136693832 9,461371427
7 5406 XYZ 6761 XYZ 8603 XYZ 8914 XYZ 9706 XYZ 9017 XYZ 9980 XYZ 10926 XYZ 8,946635209 9,194515822
8 1227 XYZ NA NA NA NA 2302 XYZ 4246 XYZ 4798 XYZ 6175 XYZ 5978 XYZ 7,475622643 8,591929538
9 NA NA NA NA NA NA NA NA 1547 XYZ 2631 XYZ 1931 XYZ 2488 XYZ NA 7,700521525
10 4876 XYZ 4438 XYZ 5844 XYZ 5884 XYZ 7532 XYZ 7739 XYZ 8630 XYZ 9573 XYZ 8,586719254 9,009997401
")
I would rather compute both new columns with dplyr, as this is the approach I am using already. What have I tried:
library(dplyr)
df1 <- df %>% mutate(ln_inc_07_10 = log(median(income_2007, income_2008, income_2009, income_2010)),
ln_inc_11_14 = log(median(income_2011, income_2012, income_2013, income_2014)))
However this computes the same value for every row. I need the median to be computed across columns.
Please notice that there are some missing values. Those should not be counted. And when there is an even number of columns with a valid income value, I would like to compute the median by taking the average of the two observations in the middle of the distribution.