9
votes

This question is to discuss how to code a spell corrector and is not duplicate of Delphi Spell Checker component.

Two years ago, I found and used the code of spell corrector by Peter Norvig at his website in Python. But the performance seemed not high. Quite interestingly, more languages that implement the same task have been appended in his webpage list recently.

Some lines in Peter's page include syntax like:

[a + c + b     for a, b in splits for c in alphabet]

How to translate it into delphi?

I am interested in how Delphi expert at SO will use the same theory and do the same task with some suitable lines and possible mediocre or better performance. This is not to downvote any language but to learn to compare how they implement the task differently.

Thanks so much in advance.

[Edit]

I will quote Marcelo Toledo who contributes C version, as saying "...While the purpose of this article [C version] was to show the algorithms, not to highlight Python...". Though his C version is with the second most lines, according to his article, his version is high performance when the dictionary file is huge. So this question is not highlight any language but to ask for delphi solution and it is not at all intended for competition, though Peter is influential in directing Google Research.

[Update]

I was enlightened by David's suggestion and studied theory and routine of Peter's page. A very rough and inefficient routine was done, slightly different from other languages, mine is GUI's. I am a beginner and learner in Delphi, I dare not post my complete code ( it is poorly written). I will outline my idea of how I did it. Your comment is welcome so that the routine will be improved.

My hardware and software is old. This is enough to my work (my specialization is not in computer or program related)

AMD Athlon Dual Core Processor
2.01 Ghz, 480 Memory
Windows XP SP2
IDE Delphi 7.0

This is the snapshot and record of processing time of 'correct' word. I tried Gettickcount, Tdatetime, and Queryperformancecounter to track correct time for word, but gettickcount and Tdatetime will output o ms for each check, so I have to use Queryperformancecounter. Maybe there are other ways to do it more precisely.

The total lines is 72, not including function that records check time. Number of lines may not be yardstick as mentioned above by Marcelo. The post is discuss how to do the task differently. Delphi Experts at SO will of course use minimum lines to do it with best performance.

Spell Checker

procedure Tmajorform.FormCreate(Sender: TObject);
begin
loaddict;
end;

procedure Tmajorform.loaddict;
var
fs: TFilestream;
templist: TStringlist;
p1: tperlregex;
w1: string;
begin
//load that big.txt (6.3M, is Adventures of Sherlock Holmes)
//templist.loadfromstream
//Use Tperlregex to tokenize ( I used regular expression by [Jan Goyvaerts][5])
//The load and tokenize time is about 7-8 seconds on my machine, Maybe there are other ways to
//speed up loading and tokenizing.
end;

procedure Tmajorform.edits1(str: string);
var
i: integer;
ch: char;
begin 
// This is to simulate Peter's page in order to fast generate all possible combinations.
// I do not know how to use set in delphi. I used array.
// Peter said his routine edits1 would generate 494 elements of 'something'. Mine will 
// generate 469. I do not know why. Before duplicate ignore, mine is over 500. After setting
// duplicate ignore, there are 469 unique elements for 'something'.
end;

procedure Tmajorform.correct(str: string);
var
i, j: integer;
begin
//This is a loop and binary search to add candidate word into list.
end;

procedure Tmajorform.Button2Click(Sender: TObject);
var
str: string;
begin
// Trigger correct(str: string);
end;

It seems by Tfilestream it can increase loading by 1-2 second. I tried using CreateFileMapping method but failed and it seemed a little complicated. Maybe there are other ways to load huge file fast. Because this big.txt will not be big considering availability of corpus, there should be more efficient way to load larger and larger file.

Another point is Delphi 7.0 does not have built-in regular expression. I have a look at other languages that do spell check at Perter's page, they are largely Directly calling their built-in regular expression. Of course, real expert does not need any built-in class or library and can build by himself. To beginner, some classes or libraries are convenience.

Your comment is welcome.

[Update]

I continued the research and further included edits2 function (edit distance 2). This will increase about another 12 lines of code. Peter said edit distance 2 would include almost all possibilities. 'something' will have 114,324 possibilities. My function will generate 102,727 UNIQUE possibilities for it. Of course, suggested words will also include more.

If with edits2, reponse time for correction obviously delay as it increases data by about 200 times. But I find some suggested corrections are obviously impossibilities as a typist will not type a error word that will be in the long corrected word list. So, edit distance 1 will be better provided that the big.txt file is sufficiently big to include more correct words.

Below is the snapshot of tracking edits 2 correct time.

enter image description here

1
Do you want from us to implement it in Delphi and tell you how many lines and how the performance was ? Sounds more like a competition than a serious question. Anyway we can't measure the results on different machines and compare it with Peter Norvig's one. -1 at least until you explain what you really want to know.user532231
@daemon_x, I want to know how to do it in delphi. How to translate the python syntax in Peter's page like "[(word[:i], word[i:]) for i in range(len(word) + 1)]" or "[a + c + b[1:] for a, b in splits for c in alphabet if b]". That Peter lists languages for the task does not mean competition but suggest that more than one language can do it. I have not tested all the language version. Though some languages are labeled with fewer lines, they may not mean better performance.Dylan
@user482742 - -1 removed; it might be very interesting to translate the code to Delphi, but it's not a good question (task) for SO.user532231
@daemon_x, Thanks. In order to avoid any possible loathing and misunderstanding, I edited and deleted languages and number of lines for the task from Peter's page.Dylan
Maybe you need to create separate question for each line you don't understand.. :)Harriv

1 Answers

8
votes

This is a Python list comprehension. It forms the Cartesian product of splits and alphabets.

Each item of splits is a tuple which is unpacked into a and b. Each item of alphabet is put into a variable called c. Then the 3 variables are concatenated, assuming that they are strings. The result of the list comprehension expression is a list containing elements of the form a + c + b, one element for each item in the Cartesian product.

In Python it could be written equivalently as

res = []
for a, b in splits:
  for c in alphabets:
    res.append(a + c + b)

In Delphi it would be

res := TStringList.Create;
for split in splits do
  for c in alphabets do
    res.Add(split.a + c + split.b);

I suggest you read up on Python list comprehensions to get a better understanding of this very powerful Python feature.