2
votes

I'm fascinated by the CAPTCHA system used on SO... I would like to know more about the "many factors" which make reCAPTCHA work. The developers, understandably given the potential for abuse, keep rather quiet about the exact inner workings of their system... But the behavior is well-documented, and so perhaps my curiosity can still be sated:

If I were to design a clone of reCAPTCHA, how might I go about it?


reCAPTCHA allows:

  1. a typing mistake
  2. at a place where people do them. This suggests me that you need to have historical data about errors, and then make an algorithm based on that.

The detection of typing mistakes requires extensive use of databases: one for words from books being digitized and the other for words which are known.

Technical known details

  1. two databases: one for known words and the other for unknown words
  2. subsequent database for combination of word

Unknown technical details

  1. How can the words be separated on fly such that you see a combination of words from different databases? This is about signal-processing.
  2. How can the data from two databases be given for user?
  3. Which is the initial form of data in two separate databases? PDF?
  4. Which is the subsequent form of data when data from two databases is combined? Pdf?
  5. How can the data be combined to from two pdf -files to one?
  6. How can you effectively rotate images?
  7. Which algorithms are used to separate the images from the book?

Related topics

  1. signal processing
  2. calculus: series such as Fourier and Laplace for algorithms in word detections.
  3. probability theory: to have a "computer-human" coefficient which is only passed if it is, for instance, with 95 confidence interval
  4. Perhaps number theory: we need to be effective in storing and comparing the data
1
see this question: stackoverflow.com/questions/8472/…z -
@yx: The post does not answer my question. I want to know how many typing mistakes the captcha allows, and how it know which is the correct letter and which is not.Léo Léopold Hertz 준영
Recaptcha works by pulling two word images from scanned books where the default ocr was unable to establish the exact text. One of the words shown is known to the system and the other is known only with a low degree of certainty (possibly even 0). You must enter the known word almost exactly and the lesser-known word within some computed distance of it's suspected value. Your input is then used to help establish the value of the unknown word, so that it can eventually move to the 'known' category.Joel Coehoorn
You won't get the math - the gritty details are of necessity not shared. However, I could tell you how I'd put something like that together, and it's much simpler than what you're proposing.Joel Coehoorn
@Masi: I've edited this in hope that it could be turned into something answerable. I understand your curiosity, but asking for details of a specific system on a public site when the developers aren't even putting those details on their own site is setting yourself up for disappointment.Shog9

1 Answers