Function to compute summary statistics & apply to columns in R

Question

I need to write a function that takes a numeric input vector so that it computes summary statistics of the minimum, mean, median, & maximum. The result should be a vector of length four. Then I need to apply it to all the columns in my data frame and produce a new data frame with these results with 5 columns.

The original data frame looks like this (partial):

    > dput(head(commodities,4))
structure(c(2054.86, 2131.01, 1978.38, 1932.46, 401.96, 372.19, 
422.91, 395.9, 66.58, 66.58, 69.9, 69.9, 136.36, 134.55, 118, 
114.51, 39.7, 40.26, 40.83, 41.41, 3167.16, 3236.82, 3091.1, 
2910.1, 168.67, 164.83, 184.38, 180.81, 162.56, 162, 169.89, 
162.9, 591.59, 596.6, 561.51, 541.45, 2592.63, 2916.71, 2303.83, 
2074.55, 88.72, 97.21, 93.53, 90.56, 986.56, 1040.81, 960.43, 
944.36, 980.08, 1000.38, 1009.4, 1015.04, 59.1, 48.7, 39.4, 38.1, 
12.15, 12.15, 12.15, 12.15, 117.23, 122.01, 119.32, 132.72, 1111.13, 
1166.24, 1117.74, 970.03, 84.39, 80.25, 80.25, 87.5, 146.08, 
159.57, 155.28, 152.79, 105.51, 114.17, 109.84, 108.26, 6584.8, 
6978.93, 6733.79, 6233.37, 35.64, 35.09, 36.01, 35.09, 40, 38.5, 
38.25, 38.15, 38, 36, 35.75, 35, 37, 37.04, 39.52, 39.5, 2271.72, 
2256.48, 2188.11, 2081.17, 347, 350, 338, 377, 547.05, 555.32, 
518.13, 504.91, 72.42, 65.53, 65, 51.86, 33.9, 32.55, 31.83, 
30.99, 395, 399, 415, 419, 68.82, 75.31, 66.35, 60.55, 7.45, 
7.6, 7.4, 7.43, 297.61, 308.29, 304.92, 302.95, 138, 131.23, 
131.23, 143.08, 13.34, 12.68, 12.79, 12.35, 201.76, 198.26, 186.1, 
181.2, 525.58, 518.75, 486.78, 451.07, 238.77, 241.36, 227.08, 
218.21, 17.3, 22.75, 19.63, 21.25, 566.93, 573.96, 535.28, 486.06, 
225.18, 233.09, 226.83, 221.81, 16973.59, 17090.21, 17460.59, 
17041.71, 40, 38, 35, 32, 175.63, 172.7, 163.51, 156.53, 553.12, 
568.15, 552.75, 510.65, 684.28, 722.57, 695.96, 688.13, 773.82, 
868.62, 740.75, 707.68), .Dim = c(4L, 48L), .Dimnames = list(
    c("1980M1", "1980M2", "1980M3", "1980M4"), c("aluminum", 
    "bananas", "barley", "beef", "coal", "cocoa", "coffee_arabica", 
    "coffee_robusta", "rapeseed_oil", "copper", "cotton", "fishmeal", 
    "groundnuts", "hides", "iron_ore", "lamb", "lead", "logs_soft", 
    "logs_hard", "maize", "nickel", "crude_westtx", "crude_brent", 
    "oil_dubai", "oil_westtx", "olive_oil", "oranges", "palm_oil", 
    "pork", "poultry", "rice", "rubber", "fish", "sawn_wood_hard", 
    "sawn_wood_soft", "shrimp", "soybean_meal", "soybean_oil", 
    "soybeans", "sugar", "sunflower_oil", "tea", "tin", "uranium", 
    "wheat", "wool_coarse", "wool_fine", "zinc")))

I tried writing a function like this:

commodities_summary <- function(x) {
  com_min <- min(x)
  com_mean <- mean(x)
  com_med <- median(x)
  com_max <- max(x)
  c(Min=com_min, Mean=com_mean, Median=com_med, Max=com_max)
}

Then trying to apply it to all the columns in the data frame like this:

commodities2 <- ddply(.data=commodities, 
                      .variables=c(1:48), 
                      .fun=commodities_summary)

But I keep getting an error with a new data frame looking like this:

> dput(head(commodities2,4))
structure(list(X1 = 2054.86, X2 = 2131.01, X3 = 1978.38, X4 = 1932.46, 
    X5 = 1775.8, X6 = 1668.96, X7 = 1758.07, X8 = 1783.63, X9 = 1655.07, 
    X10 = 1626.15, X11 = 1503.91, X12 = 1430.66, X13 = 1430.83, 
    X14 = 1452.38, X15 = 1442.95, X16 = 1369.23, X17 = 1297, 
    X18 = 1232.41, X19 = 1175.31, X20 = 1230.83, X21 = 1168.91, 
    X22 = 1140.18, X23 = 1081.85, X24 = 1130.88, X25 = 1113.26, 
    X26 = 1087.71, X27 = 1029.24, X28 = 996.21, X29 = 973.29, 
    X30 = 918.85, X31 = 957.85, X32 = 958.86, X33 = 959.53, X34 = 951.48, 
    X35 = 965.15, X36 = 987.39, X37 = 1076.53, X38 = 1230.46, 
    X39 = 1301.73, X40 = 1363.57, X41 = 1453.43, X42 = 1465.58, 
    X43 = 1520.21, X44 = 1603.67, X45 = 1613.73, X46 = 1566.26, 
    X47 = 1516.73, X48 = 1549.34, Min = 7.45, Mean = 937.015208333333, 
    Median = 172.15, Max = 16973.59), .Names = c("X1", "X2", 
"X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11", "X12", 
"X13", "X14", "X15", "X16", "X17", "X18", "X19", "X20", "X21", 
"X22", "X23", "X24", "X25", "X26", "X27", "X28", "X29", "X30", 
"X31", "X32", "X33", "X34", "X35", "X36", "X37", "X38", "X39", 
"X40", "X41", "X42", "X43", "X44", "X45", "X46", "X47", "X48", 
"Min", "Mean", "Median", "Max"), row.names = 1L, class = "data.frame")

I need a data frame so that it has the names of the columns (aluminum, bananas, barley, etc.) in the first column, then each of the summary statistics in the following 4 columns.

Do not post your data as an image, please learn how to give a reproducible example — Jaap
Can you try with dplyr i.e commodities %>% group_by_(names(commodities)[1:48]) %>% summarise_each(fun(commodities_summary(.))) — akrun
The image you posted is not easy for others to test. As @Jaap mentioned, please do post a dput output i.e. dput(droplevels(head(yourdata,10))) — akrun
I just updated using dput, hopefully this will be better. Sorry I'm all new to R. — codingishard

jogo jogo · Accepted Answer · 2015-11-30T07:55:10

IMHO this will do what you want:

sapply(commodities, commodities_summary)

eventually transpose the result. Example with cars

> sapply(cars, commodities_summary)
       speed   dist
Min      4.0   2.00
Mean    15.4  42.98
Median  15.0  36.00
Max     25.0 120.00

Because your data is not a dataframe you have to do:

sapply(as.data.frame(commodities), commodities_summary)

or if you want to transpose your results:

t(sapply(as.data.frame(commodities), commodities_summary))

Function to compute summary statistics & apply to columns in R

1 Answers