1
votes

I need to compute the (scaled) Hamming string distance d(x,y) = #{x_i != y_i : i = 1,...,n}/n where x and y are strings of length n. I use R and dplyr/tidyverse and defined the Hamming distance as

hamdist = function(x,y) mean(str_split(x, "")[[1]] != str_split(y, "")[[1]])

This works perfectly fine. However, since I want to apply it columnwise, I have to use the rowwise verb (or use map2 from purrr package). The problem: my data set contains ~50 mio observations and the calculations thus takes hours.

My question is therefore: is there a smoother/more efficient way to implement the Hamming string distance for column operations?

(dplyr solutions are preferable)

An example:

n = 1000
l = 8

rstr = function(n, l = 1) replicate(n, paste0(letters[floor(runif(l, 1, 27))], collapse = ""))

hamdist = function(x,y) mean(str_split(x, "")[[1]] != str_split(y, "")[[1]])

df = tibble(a = rstr(n, l), b = rstr(n, l))

df %>% mutate(dist = hamdist(a, b)) # wrong!
df %>% rowwise() %>% mutate(dist = hamdist(a, b)) # correct! but slow for n = 50 mio
1
Could you make a reproducible example? - Aurèle
I added an example. - Syd Amerikaner

1 Answers

2
votes

See the stringdist package. Function stringdist takes a method argument that can be "hamming". The stringdist package claims to be:

Built for speed, using openMP for parallel computing.