I would like to create an algorithm that could detect credit card numbers (CCNs) from various types of files.
The simple scenario how to find CCNs is to use regular expressions as defined:
- Visa:
^4[0-9]{12}(?:[0-9]{3})?$
All Visa card numbers start with a4
. New cards have 16 digits. Old cards have 13. - MasterCard:
^5[1-5][0-9]{14}$
All MasterCard numbers start with the numbers51
through55
. All have 16 digits. - American Express:
^3[47][0-9]{13}$
American Express card numbers start with34
or37
and have 15 digits. - Diners Club:
^3(?:0[0-5]|[68][0-9])[0-9]{11}$
Diners Club card numbers begin with300
through305
,36
or38
. All have 14 digits. There are Diners Club cards that begin with5
and have 16 digits. These are a joint venture between Diners Club and MasterCard, and should be processed like a MasterCard. - Discover:
^6(?:011|5[0-9]{2})[0-9]{12}$
Discover card numbers begin with6011
or65
. All have 16 digits. - JCB:
^(?:2131|1800|35\d{3})\d{11}$
JCB cards beginning with2131
or1800
have 15 digits. JCB cards beginning with35
have 16 digits.
Then we can check found number with Luhn Mod-10 algorithm and if it fulfills the conditions we can say that we have found the CCN.
But this simple method have a very high number of false positives/negatives from my experience.
What algorithms or heuristics could be used to reduce the false positives/negatives matches? The advanced software like PCI Data Finder or Card Recon are providing more reliable results and that results definitely isn't achieved by simple regular expressions finding and Luhn check.