1
votes

I have a data frame in which every column contains different measurements except the first column which contains IDs. I want to create a smaller data frame which contains all columns for only those IDs which are outliers in at least one column. Here's what the data frame looks like now:

        BRICK       MARBLE          MASONITE        STEEL
ff5     1.9870268   0.3344881       0.09917627      3.205099
fdd     1.8088945   0.5292931       0.10868434      1.835525
fd9     1.2062831   0.2696240       0.12047189      3.279331

I have created vectors containing the outliers in each column using:

outliers_Marble = boxplot(Material$MARBLE, plot=FALSE)$out

I figured out how to make mini data frames that match a single outliers vector using

newframe = Material[match(outliers_Marble, Material$MARBLE,]

The part that has me stumped is applying this method to each column with the appropriate outliers vector. I know I could do each one manually then combine the data frames using but I am really hoping that someone can help me find a way to combine multiple calls of the match function into a single command. Thanks in advance.

2

2 Answers

1
votes

Here's some test data with outliers added

set.seed(14)
dd<-data.frame(
    ID=paste0("ff",1:50),
    BRICK=rnorm(50,2),
    MARBLE=runif(50),
    MASONITE=runif(50, 0, .4),
    STEEL=rnorm(50,5)
)
dd$BRICK[5]<-6
dd$MARBLE[13]<-1.7
dd$MASONITE[26]<- -2
dd$STEEL[30]<- 20

Rather than using boxplot, i went to boxplot.stats to get the edges of the wiskers to make it easier to find the index of outliers. Here's how you can do that

outliers<-unique(unlist(lapply(dd[-1], function(x) {
    ex <- boxplot.stats(x)$stats; which(x<ex[1] | x>ex[5])
})))

And we can see that we found them

> outliers
[1]  5 13 26 30

Now i've already combined and removed duplicates from outliers, now I can get them out of the table

newframe <- if(length(outliers)>0)
      dd[-outliers, ]
    else 
      dd
0
votes

Try following...

#Defining function for outlier
outLierFun = function(x){boxplot(x, plot=FALSE)$out}

#Removing column for which outlier test not needed
colNames = setdiff(colnames(Material),'BRICK')

#Finding out outlier
outliers = lapply(Material[colNames ], FUN = outLierFun)

#Empty Dataframe
newFrame = Material[0,] 
for(i in colNames){
  temp = subset(Material, get(i) %in% outliers[[i]])
  newFrame = unique(rbind(newFrame, temp))
}
#Final results
newFrame