0
votes

I've been struggling with the following for some time now:

I want to calculate the difference in wordcounts (frequency of occurrence of features) between two dataframes. The dataframes contain two columns: feature (words) an frequency.

I want to achieve the following result with df A en df B: All features/words from df A and frequency of A minus frequency of B. However when the feature in A does not appear in B I want the frequency of just A back.

I've tried with a two sapply functions: 1 to obtain a names vector the names: feature and frequency of A, and 1 to obtain the the frequency of the same feature in B if the feature exist otherwise 0. These two vectors where then combined to obtain the desired dataframe. The solution works, but is really slow.

Doe any of you know a faster way of obtaining such results?

2

2 Answers

0
votes

The basic operation you want here is a left join of the first data frame to the second data frame, using the feature/word as the join condition. One option would be to use the sqldf package:

library(sqldf)
sql <- "select a.feature, a.frequency - coalesce(b.frequency, 0) as difference "
sql <- paste0(sql, "from dfA a left join dfB b on a.feature = b.feature")

result <- sqldf(sql)

This probably isn't the fastest solution available in R, and base R probably offers a more efficient solution. But, the above solution is brief, requiring only a few lines of code, and it is easy to read.

0
votes

You can use tidy text mining for this.

Please refer the below link. tidy text mining