For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.
Naive solution:
for search, replace in replacements.iteritems():
text = text.replace(search, replace)
The regex method using re.sub
is x10 slower:
for search, replace in replacements.iteritems():
text = re.sub(search, replace, text)
At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.
Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.
Thanks!
str.join
– cmd