0
votes

I'm unable to subset the column "short_desc" in my dataframe and apply code that creates decile ranges corresponding to the value. The decile range is produced in the column "Value.fc".

The code I use to create the decile range is:

df <- df %>%
mutate(Value.fc = cut2(Value, g=10),
     Value.fc = factor(sapply(str_extract_all(Value.fc, "\\d+"),
                              function(x) paste(x, collapse="-"))),
     Value.fc = reorder(Value.fc, Value))

The code works when there is only one level for "short_desc". However, when I apply that code using the group_by() function the decile range is wrong. Sample data of what "Value.fc" looks like when I use group_by():

dput(head(df)) structure(list(
   state = c("Iowa", "Iowa", "Illinois"),
   short_desc = c("Corn, grain - yield, measured in bu / acre", "Corn, silage - yield, measured in tons / acre", "Corn, grain - yield, measured in bu / acre"), 
   Value = c(137.8, 13.5, 153.3), 
   FIPS = c("19001", "19001", "17001"), 
   Value.fc = c("135-0-150", "13-0-14-5", "150-4-157"))

The first value of "Value.fc" should look like "135-150" not "135-0-150". Any help would be appreciated.

1
I think it has something to do with that you have 3 observations but g in cut2 is set to 10 i.e. more cuts than observations. - Gautam
It’s just a sample of my data to share with you. I have 746 observations. - gm007

1 Answers

0
votes

Not quite an answer but can't post longer code in a comment. Works fine with data.table:

CODE

library(data.table)
library(Hmisc)
df <- data.table(state = state.name[1:50], 
                 value = unname(state.x77[, 1]))
df[, value.fc.inter := cut2(value, g = 10)]
df[, value.fc := unlist(lapply(stringr::str_extract_all(value.fc.inter, "\\d+"), 
                        function(z) paste(z, collapse = "-")))]

I have value.fc.inter as an intermediate variable to illustrate the process.

OUTPUT

> head(df)
        state value value.fc.inter    value.fc
1:    Alabama  3615  [ 2861, 3806)   2861-3806
2:     Alaska   365  [  365,  637)     365-637
3:    Arizona  2212  [ 1544, 2284)   1544-2284
4:   Arkansas  2110  [ 1544, 2284)   1544-2284
5: California 21198  [11197,21198] 11197-21198
6:   Colorado  2541  [ 2284, 2861)   2284-2861

If you want to order by value.fc, you can use the below code but I think the levels for value.fc need to be ordered before this would work as expected.

> df[order(value.fc, decreasing = F)]
             state value value.fc.inter    value.fc
 1:     California 21198  [11197,21198] 11197-21198
 2:       Illinois 11197  [11197,21198] 11197-21198
 3:       New York 18076  [11197,21198] 11197-21198
 4:   Pennsylvania 11860  [11197,21198] 11197-21198
 5:          Texas 12237  [11197,21198] 11197-21198
 6:        Arizona  2212  [ 1544, 2284)   1544-2284
 7:       Arkansas  2110  [ 1544, 2284)   1544-2284
 8:         Kansas  2280  [ 1544, 2284)   1544-2284
 9:       Nebraska  1544  [ 1544, 2284)   1544-2284
10:  West Virginia  1799  [ 1544, 2284)   1544-2284
11:       Colorado  2541  [ 2284, 2861)   2284-2861
12:    Mississippi  2341  [ 2284, 2861)   2284-2861
13:       Oklahoma  2715  [ 2284, 2861)   2284-2861
14:         Oregon  2284  [ 2284, 2861)   2284-2861
15: South Carolina  2816  [ 2284, 2861)   2284-2861
16:        Alabama  3615  [ 2861, 3806)   2861-3806
17:    Connecticut  3100  [ 2861, 3806)   2861-3806
18:           Iowa  2861  [ 2861, 3806)   2861-3806
19:       Kentucky  3387  [ 2861, 3806)   2861-3806
20:     Washington  3559  [ 2861, 3806)   2861-3806
21:         Alaska   365  [  365,  637)     365-637
22:       Delaware   579  [  365,  637)     365-637
23:         Nevada   590  [  365,  637)     365-637
24:        Vermont   472  [  365,  637)     365-637
25:        Wyoming   376  [  365,  637)     365-637
26:      Louisiana  3806  [ 3806, 4767)   3806-4767
27:       Maryland  4122  [ 3806, 4767)   3806-4767
28:      Minnesota  3921  [ 3806, 4767)   3806-4767
29:      Tennessee  4173  [ 3806, 4767)   3806-4767
30:      Wisconsin  4589  [ 3806, 4767)   3806-4767
31:        Georgia  4931  [ 4767, 5814)   4767-5814
32:        Indiana  5313  [ 4767, 5814)   4767-5814
33:       Missouri  4767  [ 4767, 5814)   4767-5814
34: North Carolina  5441  [ 4767, 5814)   4767-5814
35:       Virginia  4981  [ 4767, 5814)   4767-5814
36:        Florida  8277  [ 5814,11197)  5814-11197
37:  Massachusetts  5814  [ 5814,11197)  5814-11197
38:       Michigan  9111  [ 5814,11197)  5814-11197
39:     New Jersey  7333  [ 5814,11197)  5814-11197
40:           Ohio 10735  [ 5814,11197)  5814-11197
41:          Idaho   813  [  637,  868)     637-868
42:        Montana   746  [  637,  868)     637-868
43:  New Hampshire   812  [  637,  868)     637-868
44:   North Dakota   637  [  637,  868)     637-868
45:   South Dakota   681  [  637,  868)     637-868
46:         Hawaii   868  [  868, 1544)    868-1544
47:          Maine  1058  [  868, 1544)    868-1544
48:     New Mexico  1144  [  868, 1544)    868-1544
49:   Rhode Island   931  [  868, 1544)    868-1544
50:           Utah  1203  [  868, 1544)    868-1544
             state value value.fc.inter    value.fc

if you just want to group by then you can also use the by argument instead of using order:

> df[, .(state, value) , by = .(value.fc)][1:10]

     value.fc       state value
 1: 2861-3806     Alabama  3615
 2: 2861-3806 Connecticut  3100
 3: 2861-3806        Iowa  2861
 4: 2861-3806    Kentucky  3387
 5: 2861-3806  Washington  3559
 6:   365-637      Alaska   365
 7:   365-637    Delaware   579
 8:   365-637      Nevada   590
 9:   365-637     Vermont   472
10:   365-637     Wyoming   376