Using dplyr function group_by() with cut()

Question

I have a data set of real estate data. I'm trying to create a new column of days on market groups (labeled DOM_Groups) and group them into 15-day intervals (i.e. 0-14, 15-29, etc.). Then I'm trying to summarize() these groupings by the count of observations and the average sale price for each 15-day group.

I'm using the cut() function attempting to break my DOM_Groups into these 15-day intervals. In the base spreadsheet that I imported, the column containing the days on market has a unique observation in each cell, and the data in that column are numeric whole numbers...no decimals, no negative numbers.

When I run the following code, the tibble output is not grouping correctly, and it is including a negative number with a decimal, which does not exist in my data set. I'm not sure what to do to correct this.

gibbsMkt %>% 
  mutate(DOM_Groups = cut(DOM, breaks = 15, dig.lab = 2)) %>% 
  filter(Status == "SOLD") %>% 
  group_by(DOM_Groups) %>% 
  summarize(numDOM = n(),
            avgSP = mean(`Sold Price`, na.rm = TRUE))

The tibble output I get is this:


DOM_Groups        numDOM   avgSP
  <fct>              <int>   <dbl>
1 (-0.23,16]            74 561675.
2 (16,31]               18 632241.
3 (31,47]               11 561727.
4 (47,63]                8 545862.
5 (63,78]                7 729286.
6 (78,94]                6 624167.
7 (1.4e+02,1.6e+02]      2 541000 
8 (1.6e+02,1.7e+02]      1 535395

Also, for rows 7 & 8 in the tibble, the largest number is 164, so I also don't understand why these rows are being converted to scientific notation.

When I use an Excel pivot table, I get the output that I want to reproduce in R, which is depicted below:

How can I reproduce this in R with the correct code?

those are just labels, it doesn't mean you have negative numbers: levels(cut(rpois(1000, 1), breaks = 15)) and breaks = 15 means you will get 15 intervals, not cut the data into 15-unit intervals — rawr
Sorry, not being rude, but that did not help me to understand how to recreate the pivot table from Excel in R using the proper code. Since I'm wanting groups of 15 days, how do I accomplish that? Also, why is the tibble displaying a negative number? — EastBeast

rawr rawr · Accepted Answer · 2021-04-26T20:06:22

cut(x, breaks = 15) means x will be cut into 15 intervals--it cannot guess that you want 15-unit intervals starting with 0 and ending with 150. This is in the docs for ?cut:

breaks either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.

You will need to define your own start and end to each interval such as:

seq(0, max(x), 15)
# [1]   0  15  30  45  60  75  90 105 120 135 150
cut(x, seq(0, max(x), 15))

However, if you set it up correctly, you can define your intervals and make labels at the same time.

set.seed(1)
x <- floor(runif(500, 0, 164))
from <- seq(0, max(x), 15)
to <- from + 15 - 1

labs <- sprintf('%s-%s', from, to)
# [1] "0-14"    "15-29"   "30-44"   "45-59"   "60-74"   "75-89"   "90-104"  "105-119" "120-134" "135-149" "150-164"

data.frame(table(cut(x, c(from, Inf), right = FALSE)), labels = labs)
#         Var1 Freq  labels
# 1     [0,15)   35    0-14
# 2    [15,30)   57   15-29
# 3    [30,45)   45   30-44
# 4    [45,60)   44   45-59
# 5    [60,75)   57   60-74
# 6    [75,90)   55   75-89
# 7   [90,105)   33  90-104
# 8  [105,120)   47 105-119
# 9  [120,135)   40 120-134
# 10 [135,150)   39 135-149
# 11 [150,Inf)   48 150-164

DOM_Groups <- cut(x, c(from, Inf), labs, right = FALSE)
data.frame(table(DOM_Groups))
#    DOM_Groups Freq
# 1        0-14   35
# 2       15-29   57
# 3       30-44   45
# 4       45-59   44
# 5       60-74   57
# 6       75-89   55
# 7      90-104   33
# 8     105-119   47
# 9     120-134   40
# 10    135-149   39
# 11    150-164   48

Your other question of "why am I getting negative numbers," as I mentioned this does not mean that you have negatives in your data--these are just labels generated by using breaks = 15 with your data.

These are the relevant lines in cut.default

if (length(breaks) == 1L) {
  if (is.na(breaks) || breaks < 2L) 
    stop("invalid number of intervals")
  nb <- as.integer(breaks + 1)
  dx <- diff(rx <- range(x, na.rm = TRUE))
  if (dx == 0) {
    dx <- if (rx[1L] != 0) 
      abs(rx[1L])
    else 1
    breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000, 
                      length.out = nb)
  }
  else {
    breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
    breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + 
                             dx/1000)
  }

Using the x from before and breaks = 15, you can see how negatives are introduced:

breaks <- 15
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
  dx <- if (rx[1L] != 0) 
    abs(rx[1L])
  else 1
  breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000, 
                    length.out = nb)
} else {
  breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
  breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
}
breaks
# [1]   -0.16300  10.86667  21.73333  32.60000  43.46667  54.33333  65.20000  76.06667  86.93333  97.80000 108.66667 119.53333 130.40000
# [14] 141.26667 152.13333 163.16300

levels(cut(x, breaks = 15))
# [1] "(-0.163,10.9]" "(10.9,21.7]"   "(21.7,32.6]"   "(32.6,43.5]"   "(43.5,54.3]"   "(54.3,65.2]"   "(65.2,76.1]"   "(76.1,86.9]"  
# [9] "(86.9,97.8]"   "(97.8,109]"    "(109,120]"     "(120,130]"     "(130,141]"     "(141,152]"     "(152,163]"

Using dplyr function group_by() with cut()

2 Answers