Python. Identity in sets of objects. And hashing

1

votes

How do __hash__ and __eq__ use in identification in sets? For example some code that should help to solve some domino puzzle:

class foo(object):
    def __init__(self, one, two):
        self.one = one
        self.two = two
    def __eq__(self,other):
        if (self.one == other.one) and (self.two == other.two): return True
        if (self.two == other.one) and (self.one == other.two): return True
        return False
    def __hash__(self):
        return hash(self.one + self.two)

s = set()

for i in range(7):
    for j in range(7):
        s.add(foo(i,j))
len(s) // returns 28 Why?

If i use only __eq__() len(s) equals 49. Its ok because as i understand objects (1-2 and 2-1 for example) not same, but represent same domino. So I have added hash function.
Now it works the way i want, but i did not understand one thing: hash of 1-3 and 2-2 should be same so they should counted like same object and shouldn't added to set. But they do! Im stuck.

pythonobjecthashsetidentity

4

votes

Equality for dict/set purposes depends on equality as defined by __eq__. However, it is required that objects that compare equal have the same hash value, and that is why you need __hash__. See this question for some similar discussion.

The hash itself does not determine whether two objects count as the same in dictionaries. The hash is like a "shortcut" that only works one way: if two objects have different hashes, they are definitely not equal; but if they have the same hash, they still might not be equal.

In your example, you defined __hash__ and __eq__ to do different things. The hash depends only on the sum of the numbers on the domino, but the equality depends on both individual numbers (in order). This is legal, since it is still the case that equal dominoes have equal hashes. However, like I said above, it doesn't mean that equal-sum dominoes will be considered equal. Some unequal dominoes will still have equal hashes. But equality is still determined by __eq__, and __eq__ still looks at both numbers, in order, so that's what determines whether they are equal.

It seems to me that the appropriate thing to do in your case is to define both __hash__ and __eq__ to depend on the ordered pair --- that is, first compare the greater of the two numbers, then compare the lesser. This will mean that 2-1 and 1-2 will be considered the same.

2

votes

The hash is only a hint to help Python arrange the objects. When looking for some object foo in a set, it still has to check each object in the set with the same hash as foo.

It's like having a bookshelf for every letter of the alphabet. Say you want to add a new book to your collection, only if you don't have a copy of it already; you'd first go to the shelf for the appropriate letter. But then you have to look at each book on the shelf and compare it to the one in your hand, to see if it's the same. You wouldn't discard the new book just because there's something already on the shelf.

If you want to use some other value to filter out "duplicates", then use a dict that maps the domino's total value to the first domino you saw. Don't subvert builtin Python behavior to mean something entirely different. (As you've discovered, it doesn't work in this case, anyway.)

1

votes

The requirement for hash functions is that if x == y for two values, then hash(x) == hash(y). The reverse need not be true.

You can easily see why this is the case by considering hashing of strings. Lets say that hash(str) returns a 32-bit number, and we are hashing strings longer than 4 characters long (i.e. contain more than 32 bits). There are more possible strings than there are possible hash values, so some non-equal strings must share the same hash (this is an application of the pigeonhole principle).

Python sets are implemented as hash tables. When checking whether an object is a member of the set, it will call its hash function and use the result to pick a bucket, and then use the equality operator to see if it matches any of the items in the bucket.

With your implementation, the 2-2 and 1-3 dominoes will end up in the hash bucket, but they don't compare equal. Therefore, the both can be added to the set.

0

votes

You can read about this in the Python data model documentation, but the short answer is that you can rewrite your hash function as:

def __hash__(self):
    return hash(tuple(sorted((self.one, self.two))))

0

votes

I like the sound of the answer provided by Eevee, but I had difficulty imagining an implementation. Here's my interpretation, explanation and implementation of the answer provided by Eevee.

Use the sum of two domino values as dictionary the key.
Store either of the domino values as the dictionary value.

For example, given the domino '12', the sum is 3, and therefore the dictionary key will be 3. We can then pick either value (1 or 2) to store in that position (we'll pick the first value, 1).

domino_pairs = {}
pair = '12'
pair_key = sum(map(int, pair))
domino_pairs[pair_key] = int(pair[0]) # Store the first pair's first value.
print domino_pairs

Outputs:

{3: '1'}

Although we're only storing a single value from the domino pair, the other value can easily be calculated from the dictionary key and value:

pair = '12'
pair_key = sum(map(int, pair))
domino_pairs[pair_key] = int(pair[0]) # Store the first pair's first value.
# Retrieve pair from dictionary.
print pair_key - domino_pairs[pair_key] # 3-1 = 2

Outputs:

But, since two different pairs may have the same total, we need to store multiple values against a single key. So, we store a list of values against a single key (i.e. sum of two pairs). Putting this into a function:

def add_pair(dct, pair):
  pair_key = sum(map(int, pair))
  if pair_key not in dct:
    dct[pair_key] = []
  dct[pair_key].append(int(pair[0]))

domino_pairs = {}
add_pair(domino_pairs, '22')
add_pair(domino_pairs, '04')
print domino_pairs

Outputs:

{4: [2, 0]}

This makes sense. Both pairs sum to 4, yet the first value in each pair differs, so we store both. The implementation so far will allow duplicates:

domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
print domino_pairs

Outputs

   {4: [4, 0]}

'40' and '04' are the same in Dominos, so we don't need to store both. We need a way of checking for duplicates. To do this we'll define a new function, has_pair:

 def has_pair(dct, pair):
  pair_key = sum(map(int, pair))
  if pair_key not in dct:
    return False
  return (int(pair[0]) in dct[pair_key] or 
    int(pair[1]) in dct[pair_key])

As normal, we get the sum (our dictionary key). If it it's not in the dictionary, then the pair cannot exist. If it is in the dictionary, we must check to see if either value in our pair exist in the dictionary 'bucket'. Let's insert this check into add_pair, and so we don't add duplicate domino pairs:

def add_pair(dct, pair):
  pair_key = sum(map(int, pair))
  if has_pair(dct, pair):
    return
  if pair_key not in dct:
    dct[pair_key] = []
  dct[pair_key].append(int(pair[0]))

Now adding duplicate domino pairs works correctly:

domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
print domino_pairs

Outputs:

{4: [4]}

Lastly, a print function shows how from storing only the sum of a domino pair, and a single value from the same pair, is the same as storing the pair itself:

def print_pairs(dct):
  for total in dct:
    for a in dct[total]:
      a = int(a)
      b = int(total) - int(a)
      print '(%d, %d)'%(a,b)

Testing:

domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
add_pair(domino_pairs, '23')
add_pair(domino_pairs, '50')
print_pairs(domino_pairs)

Outputs:

(4, 0)
(2, 3)
(5, 0)

Python. Identity in sets of objects. And hashing

5 Answers