0
votes

I have a data frame with about 15 variables. I have to remove outliers from the variables.

Following a tutorial on web, I am using boxplotting method to remove outliers. I am using a stacked kind of way to remove outliers one by one from the data frame till all data is treated.

Here is my code. My question is, is it a good way to remove outliers or how to improve the code.

#removong outliers from the columns
outliers <- boxplot(outlier_H_rem$var1, plot=FALSE)$out
if(length(outliers) == 0){ outlier_H_rem1<-outlier_H_rem
boxplot(outlier_H_rem1$var1)} else { 
outlier_H_rem1<-outlier_H_rem[-which(outlier_H_rem$var1 %in% outliers),]
var1<-outlier_H_rem1$var1}
boxplot(outlier_H_rem1$var1)

outliers <- boxplot(outlier_H_rem1$var2, plot=FALSE)$out
if(length(outliers) == 0){ outlier_H_rem2<-outlier_H_rem1
boxplot(outlier_H_rem2$var2)} else { 
outlier_H_rem2<-outlier_H_rem1[-which(outlier_H_rem1$var2 %in% outliers),]
moisture2<-outlier_H_rem2$var2}
boxplot(outlier_H_rem2$var2)

outlier_H_rem is the stacked data frame I am testing each time with next var outlier_H_rem1$var1, outlier_H_rem2$var2, outlier_H_rem3$var3 till last var. outlier_H_rem15$var15 is the last stacked data frame that is treated with all variables.

2
it depends.. do you want your variables as seperate vectors in the end? - Humpelstielzchen
By separate vectors if you mean they remain same as original data frame of 15~ var so I may treat them separately then yes. Basically I have two data frames with same data from different sites. At some time after outliers removal, I need to merge those in one list. - XCeptable
okay, so you're not interested in your cases/rows/observations but just want each variable seperatly cleaned from outliers. - Humpelstielzchen
yes, to start with is in each turn, I keep removing the rows that have outliers separately for each of the variables till al 15 variables are treated. - XCeptable
@XCeptable have the answers you have gotten been helpful? If any solved your problem consider accepting one as the answer. - Kresten

2 Answers

0
votes

I can read from your answer to @Humpelstielzchen that you want to work the variables as individual vectors, so I will answer according to that, but please remember that subsequent merging of the variables might be difficult because you lose position order of the values when you extract them as individual vectors and then remove some observations.

In the below example I have created some sample data to illuminate this issue. Please note that var3 does not have an outlier. How will you merge the data later (they will have different lengths)? Also even though var1 and var2 both end up with 11 observations after outlier removal, the last position in the vector came from position 11 and 12 in the original data.

Given that you are still ok with this, then your method will work. I have given some comments to your code.

library(tidyverse)

set.seed(1)

outlier_H_rem <- tibble(
  var1 = rnorm(10, 0, 1),
  var2 = rnorm(10, 0, 1),
  var3 = rnorm(10, 0, 1)) %>% 
  #Introduce outliers
  rbind(c(5, 0, 0), c(0,7, 0))

outlier_H_rem

#removeing outliers from the columns
outliers <- boxplot(outlier_H_rem$var1, plot=FALSE)$out

if(length(outliers) == 0){ 
  outlier_H_rem1 <- outlier_H_rem
  #boxplot(outlier_H_rem1$var1) - This line is irrelevant as you create the plot again after the if else call
  } else { 
  outlier_H_rem1 <- outlier_H_rem[-which(outlier_H_rem$var1 %in% outliers),]
  var1 < -outlier_H_rem1$var1 #What is the purpose of this line?
  }

boxplot(outlier_H_rem1$var1)
0
votes

May I suggest a slightly different approach.

Transform your data from wide to long form then calculate the outliers using quantiles and inter quantile ranges.

Then filter out the outliers and transform back to wide form. Dropping rows with outliers leaves you the desired result

Building on @Steen Harsted

library(tidyverse)

set.seed(1)

outlier_H_rem <- tibble(
  var1 = rnorm(10, 0, 1),
  var2 = rnorm(10, 0, 1),
  var3 = rnorm(10, 0, 1)) %>% 
  #Introduce outliers
  rbind(c(5, 0, 0), c(0,7, 0))

outlier_H_rem

# A tibble: 12 x 3
var1    var2    var3
<dbl>   <dbl>   <dbl>
  1 -0.626  1.51    0.919 
2  0.184  0.390   0.782 
3 -0.836 -0.621   0.0746
4  1.60  -2.21   -1.99  
5  0.330  1.12    0.620 
6 -0.820 -0.0449 -0.0561
7  0.487 -0.0162 -0.156 
8  0.738  0.944  -1.47  
9  0.576  0.821  -0.478 
10 -0.305  0.594   0.418 
11  5      0       0     
12  0      7       0 

outlier_H_rem %>% 
  # Collect dat in tidy form
  tidyr::gather("Feature", "Value", everything()) %>%
  ggplot2::ggplot(aes(x=Feature, y=Value)) +geom_boxplot()

enter image description here

Now here is how to identify the outliers using tools from the tidyverse

outlier_H_rem %>% 
  # Collect data in tidy form
  tidyr::gather("Feature", "Value", everything()) %>% 
  # Group by "Feature" and calculate outliers using iqr and quantiles
  # Also adding a row counter
  group_by(Feature) %>% 
  mutate(r=1:n()) %>%
  mutate(q1 = quantile(Value,probs=0.25),
         q3 = quantile(Value,probs=0.75),
         iqr = IQR(Value),
         outlier = if_else((q1-1.5*iqr)>Value | (q3+1.5*iqr)<Value, TRUE, FALSE)) %>% 
  # Filter out the ouliers
  filter(!outlier) %>% 
  # deselect calculated rows
  select(-q1, -q3, -iqr, -outlier) %>%
  # Spread the results again. 
  # optionally remove rows with rows with NA (contained outliers) using na.omit()
  spread(Feature, Value) %>% 
  # remove row counter
  select(-r)

# A tibble: 12 x 3
var1     var2     var3
*   <dbl>    <dbl>    <dbl>
  1  -0.626   1.51     0.919 
2   0.184   0.390    0.782 
3  -0.836  -0.621    0.0746
4   1.60   NA       NA     
5   0.330   1.12     0.620 
6  -0.820  -0.0449  -0.0561
7   0.487  -0.0162  -0.156 
8   0.738   0.944   NA     
9   0.576   0.821   -0.478 
10  -0.305   0.594    0.418 
11  NA       0        0     
12   0      NA        0