0
votes

I'm just learning R (and dplyr), and have what is likely a simple problem, though I have spent hours researching with no answer. I have two columns, A (character) and B (integer), in a data frame.

The data frame contains additional rows due to poor spelling in the original data (in column A) and so there are many rows that must be collapsed into a single row (whilst also collapsing their integer values (via sum()) in column B).

For example, there may be three rows that should instead be one:

CAR.............MPG

Mazda.........5

Mazzda...... 2

Mzda.......... 1

Should be

CAR...........MPG

Mazda........ 8

I'm using dplyr, and trying to find the manner in which I can collapse rows with similar characters (e.g. three letters), and pipe back into the original table.

Any thoughts or directions, preferably using base R or dplyr would be appreciated.

1
This will help you get an appropriate answer fast. stackoverflow.com/questions/5963269/…Hector Haffenden

1 Answers

1
votes

Using the input DF shown reproducibly in the Note at the end, use soundex or one of the other functions in the phonics package to derive a key for each CAR and then summarize by key:

library(dplyr)
library(phonics)

DF %>% 
  group_by(key = soundex(CAR)) %>%
  summarize(CAR = toString(CAR), MPG = sum(MPG)) %>%
  ungroup %>%
  select(-key)

giving:

# A tibble: 1 x 2
  CAR                   MPG
  <chr>               <int>
1 Mazda, Mazzda, Mzda     8

Note

Lines <- "CAR MPG
Mazda 5
Mazzda 2
Mzda 1"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)