5
votes

[Background Story]

I am working with a 5 year old user identification system, and I am trying to add IDs to the database. The problem I have is that the system that reads the ID numbers requires some sort of checksum, and no-one working here now has ever worked with it, so no-one knows how it works.

I have access to the list of existing IDs, which already have correct checksums. Also, as the checksum only has 16 possible values, I can create any ID I want and run it through the authentication system up to 16 times until I get the correct checksum (but this is quite time consuming)

[Question]

What methods can I use to help guess the checksum algorithm of used for some data? I have tried a few simple methods such as XORing and summing, but these have not worked.

So my question is: if I have data (in hexadecimal) like this:

data        checksum
00029921    1
00013481    B
00026001    3
00004541    8

What methods can I use work out what sort of checksum is used? i.e. should I try sequential numbers such as 00029921,00029922,00029923,... or 00029911,00029921,00029931,... If I do this what patterns should I look for in the changing checksum?

Similarly, would comparing swapped digits tell me anything useful about the checksum? i.e. 00013481 and 00031481

Is there anything else that could tell me something useful? What about inverting one bit, or maybe one hex digit?

I am assuming that this will be a common checksum algorithm, but I don't know where to start in testing it. I have read the following links, but I am not sure if I can apply any of this to my case, as I don't think mine is a CRC.

stackoverflow.com/questions/149617/how-could-i-guess-a-checksum-algorithm stackoverflow.com/questions/2896753/find-the-algorithm-that-generates-the-checksum cosc.canterbury.ac.nz/greg.ewing/essays/CRC-Reverse-Engineering.html

[ANSWER]

I have now downloaded a much larger list of data, and it turned out to be simpler than I was expecting, but for completeness, here is what I did.

data:

00024901 A
00024911 B
00024921 C
00024931 D
00042811 A
00042871 0
00042881 1
00042891 2
00042901 A
00042921 C
00042961 0
00042971 1
00042981 2
00043021 4
00043031 5
00043041 6
00043051 7
00043061 8
00043071 9
00043081 A
00043101 3
00043111 4
00043121 5
00043141 7
00043151 8
00043161 9
00043171 A
00044291 E

From these, I could see that when just one value was increased by a value, the checksum was also increased by the same value as in:

00024901 A
00024911 B

Also, two digits swapped did not change the checksum:

00024901 A
00042901 A

This means that the polynomial value (for these two positions at least) must be the same

Finally, the checksum for 00000000 was A, so I calculated the sum of digits plus A mod 16:
( (Σxi) +0xA )mod16
And this matched for all the values I had. Just to check that there was nothing sneaky going on with the first 3 digits that never changed in my data, I made up and tested some numbers as Eric suggested, and those all worked with this too!

1
Could you post some more data, for instance a few sequential numbers with checksums?Matti Virkkunen
What's the checksum for 00000000?James
Matti- yes, I'll get some more (sequential) numbers off the database later. James - I was trying to avoid making up lots of numbers if I could, because it takes a while to get the checksum, but 00000000 is probably a very good one to start with!Ed.C

1 Answers

8
votes

Many checksums I've seen use simple weighted values based on the position of the digits. For example, if the weights are 3,5,7 the checksum might be 3*c[0] + 5*c[1] + 7*c[2], then mod 10 for the result. (In your case, mod 16, since you have 4 bit checksum)

To check if this might be the case, I suggest that you feed some simple values into your system to get an answer:

1000000 = ?
0100000 = ?
0010000 = ?

... etc. If there are simple weights based on position, this may reveal it. Even if the algorithm is something different, feeding in nice, simple values and looking for patterns may be enlightening. As Matti suggested, you/we will likely need to see more samples before decoding the pattern.