I am learning R and I am trying to create histograms and boxplots for every column in a dataframe which has 80+ columns. The plots will be grouped based on the value of a column named "cluster".
Since this task is quite cumbersome and the names of the columns are not user-friendly, and given that more tasks of this kind will come in the future, I was thinking to find a way to automate the process.
So I came up with the idea to create a function that will call the histogram() and boxplot() functions of ggplot and will create two ggplot objects p1 and p2 which will store in a list. The function would then return the list. Then loop through the columns of the dataframe and apply the function and store the results in a list called plots_all. Finally, extract the ggplot objects (histograms and boxplots) one at the time and print them.
However, I have difficulty implementing this idea. Perhaps, there are other ways more efficient to perform the same task. In any case, I would appreciate your help.
More specifically, I can not get the means of the columns appear by group in the histogram using the function as they would appear if I wrote the command myself. Second, I can not pass the name of the column to the function and use it appropriately to label the graph. Third, I find a difficulty in extracting exactly the plot I want from the list (I get both plots simultaneously). Of course I could write two functions each dedicated to a single type of plot, but still I am curious why my method is not working as I would expect. So now, let's dive in!
Let me begin by giving some background information:
A glimpse on the data:
head(df_all[, c(1:2, 48)])
TOTAL_Estimated_Collateral_value_sum TOTAL_CREDIT_BUREAU_RATING_max cluster
1 -0.17499342 -0.37721374 1
2 -0.86443362 -0.50003823 1
3 0.22211949 -0.49997598 2
4 0.01007717 -0.07512348 1
5 -0.77617685 -0.49997598 2
6 -1.43518056 -0.42273492 1
> table(df_all$cluster)
1 2 3
24342 8565 1350
The code I am using is the following:
plots <- function(w, n, df_all){
# This function takes three arguments: w is a column of the dataframe, n is the name of that column and df_all is the dataframe from which the column originates
mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(w, na.rm = TRUE)) #calculate the mean of the column
# Creates a histogram using the column w. The name of the column n is used in the labs() to define axes labels and
# graph title. Stores the histogram object to a variable p1.
p1 <- ggplot(df_all, aes(x = w, fill= cluster)) +
geom_histogram(alpha = 0.7, position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color= cluster),
linetype="dashed", size = 2)+
labs(title = n, x = "Cluster", y = n)
# Creates a boxplot using the column w. The name of the column n is used in the labs() to define axes labels and
# graph title. Stores the histogram object to a variable p2.
p2 <- ggplot(df_all, aes(x=cluster, y=w, fill=cluster)) +
geom_boxplot() + labs(title = n, x = "Cluster", y = n)
plot <- list() # Initiates an empty list
plot[[1]] <- p1 #Appends the object p1 to the list
plot[[2]] <- p2 #Appends the object p2 to the list
plot # Returns the list
}
plots_all <- list() # initiates
for (i in 1 : 38){ # Loops over a selection of the indices of the columns of df_all
n <- names(df_all[,i]) # Extracts the name of the column at index i and stores it to variable n
w <- df_all[,i] # Extracts the df_all column at index i and stores it to a vector w
plots_all[[i]] <- plots(w, n, df_all) #Call the plots() function with the appropriate arguments and stores the
# returned list to a list plots_all
}
To get the plots for the first column I write:
plots_all[[1]]
This will plot both plots --histogram and boxplot-- at one stroke. So I am not given an opportunity to select which of the two to display.
Moreover, I get a histogram that looks like this:
As you can see this histogram does not display the means of the three groups as vertical lines but only one mean.
However using the following code, I can get the 3 group means appearing:
mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(TOTAL_Estimated_Collateral_value_sum, na.rm = TRUE))
ggplot(df_all, aes(x=TOTAL_Estimated_Collateral_value_sum, fill= cluster)) +
geom_histogram(alpha = 0.7, position="dodge")+
geom_vline(data=mu, aes(xintercept=grp.mean, color= cluster),
linetype="dashed", size = 2)+
theme(legend.position="top")
You can inspect the output here:
So, 3 questions:
1) Why the group means do not appear as vertical lines as I would expect when I am using the function? What should I change?
2) How I can pass to the ggplot labs() function the information I want (a string that is a function of the name of the column I am passing to the function) when I am using the function, to label axes and title the graph appropriately? Should I use paste() in some way and if yes how?
3) How I can control which plot I will print (histogram vs. boxplot)
Your advice will be appreciated.