Removing outliers from data frame using boxplotting method

Question

I have a data frame with about 15 variables. I have to remove outliers from the variables.

Following a tutorial on web, I am using boxplotting method to remove outliers. I am using a stacked kind of way to remove outliers one by one from the data frame till all data is treated.

Here is my code. My question is, is it a good way to remove outliers or how to improve the code.

#removong outliers from the columns
outliers <- boxplot(outlier_H_rem$var1, plot=FALSE)$out
if(length(outliers) == 0){ outlier_H_rem1<-outlier_H_rem
boxplot(outlier_H_rem1$var1)} else { 
outlier_H_rem1<-outlier_H_rem[-which(outlier_H_rem$var1 %in% outliers),]
var1<-outlier_H_rem1$var1}
boxplot(outlier_H_rem1$var1)

outliers <- boxplot(outlier_H_rem1$var2, plot=FALSE)$out
if(length(outliers) == 0){ outlier_H_rem2<-outlier_H_rem1
boxplot(outlier_H_rem2$var2)} else { 
outlier_H_rem2<-outlier_H_rem1[-which(outlier_H_rem1$var2 %in% outliers),]
moisture2<-outlier_H_rem2$var2}
boxplot(outlier_H_rem2$var2)

outlier_H_rem is the stacked data frame I am testing each time with next var outlier_H_rem1$var1, outlier_H_rem2$var2, outlier_H_rem3$var3 till last var. outlier_H_rem15$var15 is the last stacked data frame that is treated with all variables.

it depends.. do you want your variables as seperate vectors in the end? — Humpelstielzchen
By separate vectors if you mean they remain same as original data frame of 15~ var so I may treat them separately then yes. Basically I have two data frames with same data from different sites. At some time after outliers removal, I need to merge those in one list. — XCeptable
okay, so you're not interested in your cases/rows/observations but just want each variable seperatly cleaned from outliers. — Humpelstielzchen
yes, to start with is in each turn, I keep removing the rows that have outliers separately for each of the variables till al 15 variables are treated. — XCeptable
@XCeptable have the answers you have gotten been helpful? If any solved your problem consider accepting one as the answer. — Kresten

Steen Harsted Steen Harsted · Accepted Answer · 2019-03-21T11:25:38

I can read from your answer to @Humpelstielzchen that you want to work the variables as individual vectors, so I will answer according to that, but please remember that subsequent merging of the variables might be difficult because you lose position order of the values when you extract them as individual vectors and then remove some observations.

In the below example I have created some sample data to illuminate this issue. Please note that var3 does not have an outlier. How will you merge the data later (they will have different lengths)? Also even though var1 and var2 both end up with 11 observations after outlier removal, the last position in the vector came from position 11 and 12 in the original data.

Given that you are still ok with this, then your method will work. I have given some comments to your code.

library(tidyverse)

set.seed(1)

outlier_H_rem <- tibble(
  var1 = rnorm(10, 0, 1),
  var2 = rnorm(10, 0, 1),
  var3 = rnorm(10, 0, 1)) %>% 
  #Introduce outliers
  rbind(c(5, 0, 0), c(0,7, 0))

outlier_H_rem

#removeing outliers from the columns
outliers <- boxplot(outlier_H_rem$var1, plot=FALSE)$out

if(length(outliers) == 0){ 
  outlier_H_rem1 <- outlier_H_rem
  #boxplot(outlier_H_rem1$var1) - This line is irrelevant as you create the plot again after the if else call
  } else { 
  outlier_H_rem1 <- outlier_H_rem[-which(outlier_H_rem$var1 %in% outliers),]
  var1 < -outlier_H_rem1$var1 #What is the purpose of this line?
  }

boxplot(outlier_H_rem1$var1)

Removing outliers from data frame using boxplotting method

2 Answers