1
votes

I am learning R and I am trying to create histograms and boxplots for every column in a dataframe which has 80+ columns. The plots will be grouped based on the value of a column named "cluster".

Since this task is quite cumbersome and the names of the columns are not user-friendly, and given that more tasks of this kind will come in the future, I was thinking to find a way to automate the process.

So I came up with the idea to create a function that will call the histogram() and boxplot() functions of ggplot and will create two ggplot objects p1 and p2 which will store in a list. The function would then return the list. Then loop through the columns of the dataframe and apply the function and store the results in a list called plots_all. Finally, extract the ggplot objects (histograms and boxplots) one at the time and print them.

However, I have difficulty implementing this idea. Perhaps, there are other ways more efficient to perform the same task. In any case, I would appreciate your help.

More specifically, I can not get the means of the columns appear by group in the histogram using the function as they would appear if I wrote the command myself. Second, I can not pass the name of the column to the function and use it appropriately to label the graph. Third, I find a difficulty in extracting exactly the plot I want from the list (I get both plots simultaneously). Of course I could write two functions each dedicated to a single type of plot, but still I am curious why my method is not working as I would expect. So now, let's dive in!

Let me begin by giving some background information:

A glimpse on the data:

head(df_all[, c(1:2, 48)])
  TOTAL_Estimated_Collateral_value_sum TOTAL_CREDIT_BUREAU_RATING_max cluster
1                          -0.17499342                    -0.37721374       1
2                          -0.86443362                    -0.50003823       1
3                           0.22211949                    -0.49997598       2
4                           0.01007717                    -0.07512348       1
5                          -0.77617685                    -0.49997598       2
6                          -1.43518056                    -0.42273492       1
> table(df_all$cluster)

    1     2     3 
24342  8565  1350

The code I am using is the following:

plots <- function(w, n, df_all){
  # This function takes three arguments: w is a column of the dataframe, n is the name of that column and df_all is the dataframe from which the column originates

  mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(w, na.rm = TRUE))  #calculate the mean of the column

  # Creates a histogram using the column w. The name of the column n is used in the labs() to define axes labels and
  # graph title. Stores the histogram object to a variable p1.
  p1 <- ggplot(df_all, aes(x = w, fill= cluster)) +
    geom_histogram(alpha = 0.7, position="dodge")+
    geom_vline(data=mu, aes(xintercept=grp.mean, color= cluster),
               linetype="dashed", size = 2)+
    labs(title = n, x = "Cluster", y = n) 

  # Creates a boxplot using the column w. The name of the column n is used in the labs() to define axes labels and
  # graph title. Stores the histogram object to a variable p2.
  p2 <- ggplot(df_all, aes(x=cluster, y=w, fill=cluster)) +
    geom_boxplot() + labs(title = n, x = "Cluster", y = n)

  plot <- list() # Initiates an empty list
  plot[[1]] <- p1  #Appends the object p1 to the list
  plot[[2]] <- p2  #Appends the object p2 to the list

  plot # Returns the list
}

plots_all <- list() # initiates 

for (i in 1 : 38){   # Loops over a selection of the indices of the columns of df_all
  n <- names(df_all[,i]) # Extracts the name of the column at index i and stores it to variable n
  w <- df_all[,i]  # Extracts the df_all column at index i and stores it to a vector w
  plots_all[[i]] <- plots(w, n, df_all) #Call the plots() function with the appropriate arguments and stores the 
                                        # returned list to a list plots_all
}

To get the plots for the first column I write:

plots_all[[1]]

This will plot both plots --histogram and boxplot-- at one stroke. So I am not given an opportunity to select which of the two to display.

Moreover, I get a histogram that looks like this:

problematic histogram without group means outputted by the function

As you can see this histogram does not display the means of the three groups as vertical lines but only one mean.

However using the following code, I can get the 3 group means appearing:

mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(TOTAL_Estimated_Collateral_value_sum, na.rm = TRUE))
ggplot(df_all, aes(x=TOTAL_Estimated_Collateral_value_sum, fill= cluster)) +
  geom_histogram(alpha = 0.7, position="dodge")+
  geom_vline(data=mu, aes(xintercept=grp.mean, color= cluster),
             linetype="dashed", size = 2)+
  theme(legend.position="top")

You can inspect the output here:

correct histogram with group means

So, 3 questions:

1) Why the group means do not appear as vertical lines as I would expect when I am using the function? What should I change?

2) How I can pass to the ggplot labs() function the information I want (a string that is a function of the name of the column I am passing to the function) when I am using the function, to label axes and title the graph appropriately? Should I use paste() in some way and if yes how?

3) How I can control which plot I will print (histogram vs. boxplot)

Your advice will be appreciated.

1

1 Answers

0
votes

I did the following change in the code of the function:

  plot <- list(list(p1), list(p2), list(mu))

Then, I can get the elements separately using the following slicing syntax:

plots_all <- plots(w, n, df)
plots_all[[1]][[2]]  # Returns the boxplot (second plot) of the first variable

I still have not found how to get the means right and how to use the name of the variable passed to the function in defining plot labels and title.

In relation to the mean problem (mu), mu is returned with the same value for all levels of cluster, which is basically zero, which implies that it aggregates over the entire range ignoring 'cluster' (the variable is standardized). I.e.:

cluster      grp.mean
1       1 -3.542677e-17
2       2 -3.542677e-17
3       3 -3.542677e-17

However, when I run the command outside of the function it returns the right answer:

> mu <- ddply(df_all, "cluster", summarise, grp.mean=mean(TOTAL_Estimated_Collateral_value_sum, na.rm = TRUE))
> mu
  cluster     grp.mean
1       1 -0.042860846
2       2  0.120947850
3       3  0.005481753