10
votes

I have some data that looks something like this:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...  

where each row is a group.

My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.

For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

I can think of 3 options for this, and I'm wondering which is (generally) best:

  1. Check whether an ID is already in the list before adding it
  2. Create sets instead of lists, and add each ID to the set
  3. Add all IDs to the list, then convert all of the lists to sets at the end.
4
Best in what sense. speed? There's always the timeit module... - martineau

4 Answers

7
votes

Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

import pprint
import collections

data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''

groups = collections.defaultdict(set)

for row in data.split('\n'):
    cols = row.split()
    for groupcol in cols:
        for col in cols:
            if col is not groupcol:
                groups[groupcol].add(col)

pprint.pprint(dict(groups))

Results:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
 'ID2': set(['ID1', 'ID3']),
 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
 'ID4': set(['ID1', 'ID5']),
 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
 'ID6': set(['ID3', 'ID5', 'ID7']),
 'ID7': set(['ID3', 'ID5', 'ID6'])}
6
votes

TL;DR: Go with option 2. Just use sets from the start.

In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

0
votes

i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:

import time

class Timer(object):
  def __init__(self, desc):
    self.desc = desc
  def __enter__(self):
    self.start = time.time()
  def __exit__(self, type, value, traceback):
    self.finish = time.time()
    print self.desc, 'took', self.finish - self.start

data = list(range(4000000))

with Timer('option 2'):
  myset = set()
  for x in data: myset.add(x)

with Timer('option 3'):
  mylist = list()
  for x in data: mylist.append(x)
  myset = set(mylist)

the results were surprising to me:

$ python test.py 
option 2 took 0.652163028717
option 3 took 0.748883008957

i would have expected at least a 2x speed difference.

0
votes

So, I timed a few different options, and after a few iterations, came up with the following strategies. I thought that sets2 would be the winner, but listToSet2 was faster for every single type of group.

All of the functions except for listFilter were in the same ballpark - listFilter was much slower.

import random
import collections

small = [[random.randint(1,25) for _ in range(5)] for i in range(100)]
medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)]
mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)]
bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)]
huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)]

def sets(groups):
    results = collections.defaultdict(set)
    for group in groups:
        for i in group:
            for j in group:
                if i is not j:
                    results[i].add(j)
    return results

def listToSet(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            results[j] += group[:i] + group[i:]
    return {k:set(v) for k, v in results.iteritems()}

def listToSet2(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            results[j] += group
    return {k:set(v)-set([k]) for k, v in results.iteritems()}

def sets2(groups):
    results = collections.defaultdict(set)
    for group in groups:
        for i in group:
            results[i] |= set(group)
    return {k:v - set([k]) for k, v in results.iteritems()}

def listFilter(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            filteredGroup = group[:i] + group[i:]
            results[j] += ([k for k in filteredGroup if k not in results[j]])
    return results