4
votes

I am trying to plot the results of Iris dataset Quadratic Discriminant Analysis (QDA) using MASS and ggplot2 packages. The script show in its first part, the Linear Discriminant Analysis (LDA) but I but I do not know to continue to do it for the QDA. The objects of class "qda" are a bit different from the "lda" class objects, for example: I can not find the Proportion of trace/X% of explained between-group Variance/discriminant components and can not add them to the graph axes. Any help or ideas how to code this graph using ggplot2?

Code:

require(MASS)
require(ggplot2)
require(scales)
 

irislda <- lda(Species ~ ., iris)
prop.lda = irislda$svd^2/sum(irislda$svd^2)
plda <- predict(irislda,   iris)

datasetLDA = data.frame(species = iris[,"Species"], irislda = plda$x)
ggplot(datasetLDA) + geom_point(aes(irislda.LD1, irislda.LD2, colour = species, shape = species), size = 2.5) + 
    labs(x = paste("LD1 (", percent(prop.lda[1]), ")", sep=""),
       y = paste("LD2 (", percent(prop.lda[2]), ")", sep=""))

 
irisqda <- qda(Species ~ ., iris)
pqda <- predict(irisqda,   iris)
datasetQDA = data.frame(species = iris[,"Species"], irisqda = pqda$posterior) 
ggplot(datasetQDA) + geom_point(???, ???, colour = species, shape = species), size = 2.5)
2
Please check this michael.hahsler.net/SMU/EMIS7332/R/viz_classifier.html posterior probabilities can be plotted in order to see the distribution for each class!Duck
I did not see any information about Quadratic Discriminant Analysis in the HTML page.JohnSal

2 Answers

10
votes

Following Ducks comment, if you have only 2 dimensions we can use the decisionplot function provided in the link to visualize these. It has to be altered slightly for more variables.

library(MASS)
model <- qda(Species ~ Sepal.Length + Sepal.Width, iris)
decisionplot(model, iris, class = "Species")

base plot

The decisionplot function is shown below.

decisionplot <- function(model, data, class = NULL, predict_type = "class",
  resolution = 100, showgrid = TRUE, ...) {

  if(!is.null(class)) cl <- data[,class] else cl <- 1
  data <- data[,1:2]
  k <- length(unique(cl))

  plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+1L, ...)

  # make grid
  r <- sapply(data, range, na.rm = TRUE)
  xs <- seq(r[1,1], r[2,1], length.out = resolution)
  ys <- seq(r[1,2], r[2,2], length.out = resolution)
  g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution))
  colnames(g) <- colnames(r)
  g <- as.data.frame(g)

  ### guess how to get class labels from predict
  ### (unfortunately not very consistent between models)
  p <- predict(model, g, type = predict_type)
  if(is.list(p)) p <- p$class
  p <- as.factor(p)

  if(showgrid) points(g, col = as.integer(p)+1L, pch = ".")

  z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
  contour(xs, ys, z, add = TRUE, drawlabels = FALSE,
    lwd = 2, levels = (1:(k-1))+.5)

  invisible(z)
}

If we wanted to recreate this with ggplot2 we'd simply have to change the function to utilize ggplot2 functions rather than base plots. This entails changing the data into data.frames and building the plot along the way.

decisionplot_ggplot <- function(model, data, class = NULL, predict_type = "class",
                         resolution = 100, showgrid = TRUE, ...) {
  
  if(!is.null(class)) cl <- data[,class] else cl <- 1
  data <- data[,1:2]
  cn <- colnames(data)
  
  k <- length(unique(cl))
  
  data$pch <- data$col <- as.integer(cl) + 1L
  gg <- ggplot(aes_string(cn[1], cn[2]), data = data) + 
    geom_point(aes_string(col = 'as.factor(col)', shape = 'as.factor(col)'), size = 3)
  
  # make grid
  r <- sapply(data[, 1:2], range, na.rm = TRUE)
  xs <- seq(r[1, 1], r[2, 1], length.out = resolution)
  ys <- seq(r[1, 2], r[2, 2], length.out = resolution)
  
  g <- cbind(rep(xs, each = resolution), 
             rep(ys, time = resolution))
  colnames(g) <- colnames(r)
  
  g <- as.data.frame(g)
  
  ### guess how to get class labels from predict
  ### (unfortunately not very consistent between models)
  p <- predict(model, g, type = predict_type)
  if(is.list(p)) p <- p$class
  g$col <- g$pch <- as.integer(as.factor(p)) + 1L
  
  if(showgrid) 
    gg <- gg + geom_point(aes_string(x = cn[1], y = cn[2], col = 'as.factor(col)'), data = g, shape = 20, size = 1)
  
  gg + geom_contour(aes_string(x = cn[1], y = cn[2], z = 'col'), data = g, inherit.aes = FALSE)
}

Usage:

decisionplot_ggplot(model, iris, class = "Species")

Note it now returns the ggplot itself, so one could use standard functions to change the title, theme etc. Also this is simply a direct translation. Using geom_polygon with a valid alpha would likely be more visually pleasing. Similar better contours could be made with an alternative choice of geom_*. ggplot

1
votes

Adding to my own answer, as I stated (for cases where npar > 2)

It has to be altered slightly for more variables.

One would have to choose how to do this. One method would be to use the mean or numeric variables and base level of factors, eg. holding these at a constant level, while plotting the remaining variables. This seemed somewhat interesting for the question, as the example given had more than 2 variables.

We can alter the decisionplot_ggplot function (quite substantially) in order to obtain this, while letting the user specify variables to be plotted through a new argument vars, while keeping the usage very similar to the previous function definition

#' Create a decisionplot using the ggplot package.
#'
#' @param model any model 
#' @param vars either a character vector of length 2, or a formula of form V1 ~ V2 specifying y and x axis of the plot respectively.
#' @param data data to use for visualization. Should contain all the data needed to use the model
#' @param resolution number of points to use for visualizetion.
#' @param showgrid logical, should areas of classes be coloured?
#' @param ... further parameters passed to predict
#' @param modes.means levels use for plotting the variables not specified in vars.
decisionplot_ggplot <- function(model, 
                                vars, 
                                data, 
                                resolution = 100,
                                showgrid = TRUE, 
                                ...,
                                modes.means) {
  if(missing(model) || missing(vars) || missing(data))
    stop('model, vars or data is missing')
  if(!(is.character(vars) && length(vars) == 2) && !('formula' %in% class(vars) && length(vars <- all.vars(vars)) == 2))
    stop('vars should be either a formula or a character vector oflength 2.')
  if(!is.data.frame(data))
    stop('data does not seem to comform with standard types.')
  
  t <- terms(model)
  if(!all((other.vars <- attr(t, 'term.labels')) %in% colnames(data)))
    stop('data is missing one or more variables in model.')
  lhs <- as.character(t[[2]])
  # Set up data for prediction, for the data in vars
  prd.vars <- lapply(data[, vars], function(x){
    if(is.character(x) || is.factor(x)){
      unique(x)
    }else{
      r <- range(x)
      seq(r[1], r[2], length.out = resolution)
    }
  })
  names(prd.vars) <- vars
  # set up data for prediction for the remaining data
  if(missing(modes.means)){
    other.vars <- other.vars[!other.vars %in% vars]
    if(length(other.vars)){
      modes.means <- lapply(data[, other.vars], function(x){
        if(is.character(x)){
          unique(x)[1]
        }else if(is.factor(x)){
          levels(x)[1]
        }else{
          mean(x)
        }
      }) 
      names(modes.means) <- other.vars
    }else
      modes.means <- NULL
  }else{
    if(is.null(other.vars))
      stop('other.vars is null but modes.means was provided. Please leave this missing.')
    if(!all(other.vars %in% names(modes.means)))
      stop('modes.means are lacking one or more variables.')
    modes.means <- as.list(modes.means)
    if(any(lengths(modes.means) > 1))
      stop('modes.means should only contain a single values for all variables.')
  }
  prd.data <- expand.grid(c(prd.vars, modes.means))
  p <- predict(model, prd.data, ...)
  prd.data$nm <- if(is.list(p)) 
    p$class 
  else 
    p
  names(prd.data)[ncol(prd.data)] <- lhs
  # Create the final plot.
  gg <- ggplot(aes_string(vars[1], vars[2]), data = data) + 
    geom_point(aes_string(col = lhs, shape = lhs), size = 3) + 
    geom_contour(aes_string(vars[1], vars[2], z = paste0('as.integer(', lhs, ') + 1L')), data = prd.data, inherit.aes = FALSE)
  if(showgrid)
    gg <- gg + 
      geom_point(aes_string(vars[1], vars[2], col = lhs), data = prd.data, shape = 20, size = 0.5)
  gg
}

Now the added flexibility of this function allows us to plot some more complex models

library(MASS)
library(ggplot2)
model <- qda(Species ~ ., iris)
decisionplot_ggplot(model, Sepal.Length ~ Petal.Length, iris, showgrid = TRUE)

multivariate-decisionplot

One thing to be very aware of however, is that this plot is very sensitive to the chosen values for the variables, that are not included in the plot itself. For example the following is not a very useful plot.

decisionplot_ggplot(model, Sepal.Length ~ Petal.Width, iris, showgrid = TRUE)

bad-multivariate-decisionplot

What we see here is that the average value of Petal.Length clearly has a huge impact on whether or not the model classifies an observation as setosa. And as this is held constant, we've lost a lot of our classification power.

This should very clearly indicate the obvious, that one we move past models with 2 parameters decision boundary plots are very hard to visualize in any useful manner. Personally I'd discourage using decision boundary plots outside the simple models.