0
votes

Posts: 1 md5 hash issue Options 9m ago

Hi all I'm a SAS newbie and need some help with the following issue. I'm experimenting with the md5 hash function and testing it on the sashelp.shoes dataset.

In a datastep, for each input observation, I'm creating a new variable that is a concatenation of each field using catx with comma as the field separator. I then use that variable as input to the md5 function and sure enough I get a hash value back which I print out to the log using PUT. I repeated this for each input line. However I then exported the shoes dataset to a CSV text file. The export enclosed the fields in double quotes and included $ and commas for the currency fields which I removed manually from the CSV file. So visually the the lines of the CSV looked exactly like the lines output from SAS. Next I wrote a little python script to read each line of the CSV text file and calculate a md5 hash for that. Unfortunately none of the hashes for the CSV file matched the hashes from SAS. Has anybody done something similar and if so can you tell me where I'm going wrong. I know the python code is correct as I checked the results using the built-in md5 checker in windows.

When I get into work again tomorrow I'll post some of the code I'm using if that helps. Meanwhile if anyone can help that would be appreciated

I'm using enterprise guide V4.3

As promised , here is the SAS code I'm using and the first few output records I'm getting.


/* concatenate all fields of a dataset and compute a checksum */

proc sql;

select name into :varstr2 separated by ','

from dictionary.columns

where libname = "SASHELP" and

memname = "SHOES";

quit;

 

 

data stuff(drop=check all);

format check $hex32.;

set sashelp.shoes end=end1;

newvar2 = catx(',',&varstr2);

all = catx(',',&varstr2);



check = md5(all);

put all;

put check;

run;

Africa,Boot,Addis Ababa,12,29761,191821,769
0F7503F59119E8248D89ED645F886871
Africa,Men's Casual,Addis Ababa,4,67242,118036,2284
8066D31E7C2A254EAB127C121B526DF7
Africa,Men's Dress,Addis Ababa,7,76793,136273,2433
653E4A1DF8B5708DF9C8B97587A1E981
Africa,Sandal,Addis Ababa,10,62819,204284,1861
D59E63E5319B4E3018F28D46A4CED9F9
Africa,Slipper,Addis Ababa,14,68641,279795,1771
1612FC1FE23B55078B7693ECE1E6D028

Now here is the python code and the same output records I'm getting for that:

import hashlib

filename = "f:/test/shoes.csv"
md5_hash = hashlib.md5()
with open(filename,"r") as f:
    for x in f:
        result=hashlib.md5(x.encode('utf-8'))
        print (x)
        print(result.hexdigest())
Africa,Boot,Addis Ababa,12,29761,191821,769

7001aaebd146b10aaed951cb692c6c4b
Africa,Men's Casual,Addis Ababa,4,67242,118036,2284

916a0c39554b70d691d03c71e8daa763
Africa,Men's Dress,Addis Ababa,7,76793,136273,2433

ea9e85e9843d3bb02206bc0ba7c3d5d4
Africa,Sandal,Addis Ababa,10,62819,204284,1861

5865cfc5d443b5a2e0038c573b5b6fb9
Africa,Slipper,Addis Ababa,14,68641,279795,1771

0226115fb928f326044ca43e186ae23a2

Update. I was thinking it might be something to do with newlines/linefeeds on the python side so changed my code to just look at the first input string in isolation.

import hashlib

x="Africa,Boot,Addis Ababa,12,29761,191821,769"
md5_hash = hashlib.md5()
result=hashlib.md5(x.encode('utf-8')) 
print (x)
print(result.hexdigest())

Africa,Boot,Addis Ababa,12,29761,191821,769
65d38fa13c098fc3959b1eb0c19b0427

Hmmm, still doesn't match with the SAS version
1
A single extra space can cause your MD5 to not match so if you did a manual fix I wouldn't trust the results. Instead, modify your export process to export how you want. I will say that it is standard to include the quotes around text fields to make it easier to read, so you may want to leave those it to avoid issues with your actual data. It's massive pain to read currency data for example if it does have comma's and no quotes.Reeza

1 Answers

1
votes

The differences come from spaces. If you have a string in SAS that is defined as say, 10 characters long, and if you just assign with a string which is just 5 bytes long, then there will still be 5 extra spaces in that variable.

For e.g.:

data sha256 (obs = 2);
  set sashelp.shoes;

  concatenated = strip(Region) || strip(Product) || strip(Subsidiary) || strip(put(Stores,8.));
  shahash = lowcase(put(sha256(concatenated), $hex64.));

run;

This gives :

enter image description here

Now when I modify this to (note the additional strip function that removes all extra spaces on both sides of the concatenated string):

data md5;
  set sashelp.shoes (obs = 2);

  concatenated = strip(Region) || strip(Product) || strip(Subsidiary) || strip(put(Stores,8.));
  shahash = lowcase(put(sha256(strip(concatenated)), $hex64.));

run;

This gives:

enter image description here

The variable concatenated is 59 bytes long (25 from region + 14 from product + 12 from subsidiary + 8 from stores). However the different records different lengths because of their exact contents. Applying strip on the concatenated string results in the hash generated only on the exact contents. This matches with Python:

comb = [b"AfricaBootAddis Ababa12", b"AfricaMen's CasualAddis Ababa4"]

for item in comb:
    print(str(sha256(item).hexdigest()))


62e548b48b547b8dd112f1440d55db70fd8219e864f571ec58a84400efdba0c0
7b1f64aca891316fd7047d4b39917ee625668c26507b70358c3927f066938ecd