3
votes

First I know very little about algorithms, so bear with me.

As I understand it, the Boyer Moore algorithm is fastest with a long key. So what if I have a very shor key (example 10 chars), and alot of text to search (over 10,000 chars). Would Boyer Moore be the best search algorithm for this scenario?

If not what would be?

2

2 Answers

2
votes

According to String searching algorithm, "The Boyer–Moore string search algorithm has been the standard benchmark for the practical string search literature." It's not always the fastest, but in general it's the way to go.

Knuth-Morris-Pratt and Boyer-Moore are very close in running time when you're talking about a small text buffer like 10,000 characters. Even a naive string search will be blindingly fast on a 10K buffer when run on a modern computer. I suspect you'll find that the difference between the KMP and Boyer-Moore two searching for a 10-character string in a 10,000 character buffer will be on the order of nanoseconds.

The best search algorithm in this scenario? That's going to depend on how often you need to call it. If it's something that gets called a few times a second (at most), I'd probably write a naive search and leave it at that. The difference between Boyer-Moore and naive search on that small buffer would be insignificant compared to the running time of your program, and your optimization effort would be better spent somewhere else. If I had to call it hundreds or thousands of times a second, I'd take the time to write an optimized Boyer-Moore search.

0
votes

To answer your question you need to visit one link only: http://www.codeproject.com/Articles/250566/Fastest-strstr-like-function-in-C

In fact the homepage of fastest textual searcher is this: http://www.sanmayce.com/Railgun/index.html

>So what if I have a very shor key (example 10 chars), and alot of text to search (over 10,000 chars). Exactly this key range (10-11 chars) is tested in my heavy (1TB) strstr showdown. There 400,000 words are searched in 300,000 words, ONE-BY-ONE, an excellent load for a strstr function!

>Would Boyer Moore be the best search algorithm for this scenario? According to my tests the fastest text searcher (using Microsoft V16 /Ox) is Railgun_Quadruplet_7Gulliver, yet it is second best when /Ox is used and Intel 12.1, you can test it yourself, see below.

If you have a fast machine, say i7 37*, I would like to see your results of my latest console benchmark package (testing Microsoft v16 vs Intel 12.1 compilers):
http://www.sanmayce.com/Downloads/_KAZE_Benchmark_2013.zip

Test on my T7500 2200Mhz notebook:

E:\_KAZE_Benchmark_2013>RUNME.bat

E:\_KAZE_Benchmark_2013>SKYFALL_TXT2HTML_MicrosoftV16.exe MIX_Tesla_BABAJI_Castaneda_Poirot_Holmes_Sunnah_Hadith_Quran_Bible_Patanjali_Dao_Simplici
ssimus.wrd.txt MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd /Gulliver
SKYFALL_TXT2HTML, revision 2, written by Kaze.
Size of 1st input file: 8916987
Size of 2nd input file: 3869529

Doing Search for each word with Railgun_Quadruplet_7Gulliver ...
Railgun_Quadruplet_7Gulliver performance: 1944+KB/clock
Average Pattern Length: 11
Function Invocations: 420,640
Function Inner-Loop Iterations: 131,873,881,926
Function Really Traversed: 1,142,740,439KB

E:\_KAZE_Benchmark_2013>SKYFALL_TXT2HTML_MicrosoftV16.exe MIX_Tesla_BABAJI_Castaneda_Poirot_Holmes_Sunnah_Hadith_Quran_Bible_Patanjali_Dao_Simplici
ssimus.wrd.txt MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
SKYFALL_TXT2HTML, revision 2, written by Kaze.
Size of 1st input file: 8916987
Size of 2nd input file: 3869529

Doing Search for each word with Railgun_Quadruplet_7 ...
Railgun_Quadruplet_7 performance: 1747+KB/clock
Average Pattern Length: 11
Function Invocations: 420,640
Function Inner-Loop Iterations: 146,826,792,351
Function Really Traversed: 1,142,740,439KB

E:\_KAZE_Benchmark_2013>SKYFALL_TXT2HTML_MicrosoftV16.exe MIX_Tesla_BABAJI_Castaneda_Poirot_Holmes_Sunnah_Hadith_Quran_Bible_Patanjali_Dao_Simplici
ssimus.wrd.txt MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd /Hasherezade
SKYFALL_TXT2HTML, revision 2, written by Kaze.
Size of 1st input file: 8916987
Size of 2nd input file: 3869529

Doing Search for each word with Railgun_Hasherezade ...
Railgun_Hasherezade performance: 1774+KB/clock
Average Pattern Length: 11
Function Invocations: 420,640
Function Inner-Loop Iterations: 128,900,655,391
Function Really Traversed: 1,142,740,439KB

There are some up-and-downs when 32bit and 64bit code is used also significant margins between Microsoft & Intel.

Gulliver's "engine" is order 2 BM-Horspool tuned by me.

As you can see on my humble laptop Gulliver searches for patterns (Average Pattern Length: 11) at 1898MB/s, even the super beautiful BNDM bends a knee here.