3
votes

This question is an extension of a previous question: rebuild python array based on common elements - but different enough to warrant a new question:

I've been struggling with this for a bit now. My data is an array of dictionaries from an sql query. Each element in the array represents a shipment, and there are common values based on the keys.

data = [
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0001", "qty":"10", "memo":"blah1"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0002", "qty":"10", "memo":"blah2"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0003", "qty":"10", "memo":"blah3"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0004", "qty":"20", "memo":"blah4"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0005", "qty":"20", "memo":"blah5"},
    {"CustName":"customer3", "PartNum":"partXYZ", "delKey":"0006", "qty":"50", "memo":"blah6"},
    {"CustName":"customer3", "PartNum":"partABC", "delKey":"0007", "qty":"100", "memo":"blah7"}]

The output I want is grouped according to specific keys

dataOut = [
   {"CustName":"customer1", "Parts":[
        {"PartNum":"part1", "deliveries":[
            {"delKey":"0001", "qty":"10", "memo":"blah1"},
            {"delKey":"0002", "qty":"10", "memo":"blah2"},
            {"delKey":"0003", "qty":"10", "memo":"blah3"}]}]},
   {"CustName":"customer2", "Parts":[
        {"PartNum":"part3", "deliveries":[
            {"delKey":"0004", "qty":"20", "memo":"blah4"},
            {"delKey":"0005", "qty":"20", "memo":"blah5"}]}]},
   {"CustName":"customer3", "Parts":[
        {"PartNum":"partXYZ", "deliveries":[
            {"delKey":"0006", "qty":"50", "memo":"blah6"}]},
        {"PartNum":"partABC", "deliveries":[
            {"delKey":"0007", "qty":"100", "memo":"blah7"}]}]}]

I can get the grouping with a single level using defaultdict and list comprehension as provided by the previous question and modified slightly

d = defaultdict(list)
for item in data:
    d[item['CustName']].append(item)
print([{'CustName': key, 'parts': value} for key, value in d.items()])

But I can't seem to get the second level in the output array - the grouping b the PartNum key. Through some research, I think what I need to do is use defaultdict as the type of the outer `defaultdict' like so:

d = defaultdict(defaultdict(list))

which throws errors because defaultdict returns a function, so I need to use lambda (yes?)

d = defaultdict(lambda:defaultdict(list))
for item in data:
    d[item['CustName']].append(item) <----this?

My question is how to "access" the second level array in the loop and tell the "inner" defaultdict what to group on (PartNum)? The data comes to me from the database programmer and the project keeps evolving to add more and more data (keys), so I'd like this solution to be as general as possible in case more data gets thrown my way. I was hoping to be able to "chain" the defaultdicts depending on how many levels I need to go. I'm learning as I'm going, so I'm struggling trying to understand the lambda and the basics of the defaultdict type and where to go from here.

4
Sort the dict then apply groupby. Or do it beforehand with SQL if you can. I cannot help you more now, I am on a mobile phone... - Pynchia
Can there be two delKey's with the same number/value for a PartNum? - wwii
How many items in your actual data? - wwii
Potentially tens of thousands in the original data set. - guidoc
Do you care about the order of the values in the lists in your output? If not, you could easily get rid of those levels and make your structure just a nested set of dictionaries. Tree = lambda: defaultdict(Tree) is all the setup you need for that kind of structure. - Blckknght

4 Answers

2
votes

Using groupby as suggested by @Pynchia and using sorted for unordered data as suggested by @hege_hegedus:

from itertools import groupby
dataOut = []
dataSorted = sorted(data, key=lambda x: (x["CustName"], x["PartNum"]))
for cust_name, cust_group in groupby(dataSorted, lambda x: x["CustName"]):
    dataOut.append({
        "CustName": cust_name,
        "Parts": [],
    })
    for part_num, part_group in groupby(cust_group, lambda x: x["PartNum"]):
        dataOut[-1]["Parts"].append({
            "PartNum": part_num,
            "deliveries": [{
                "delKey": delivery["delKey"],
                "memo": delivery["memo"],
                "qty": delivery["qty"],
            } for delivery in part_group]
        })

If you look at the second for loop, this will hopefully answer your question about accessing the second level array in the loop.

2
votes

You could use a tree-like data structure based on an OrderedDefaultdict instead of a defaultdict(list). (The definition's from an unrelated answer of mine.)

from collections import OrderedDict

class OrderedDefaultdict(OrderedDict):
    def __init__(self, *args, **kwargs):
        if not args:
            self.default_factory = None
        else:
            if not (args[0] is None or callable(args[0])):
                raise TypeError('first argument must be callable or None')
            self.default_factory = args[0]
            args = args[1:]
        super(OrderedDefaultdict, self).__init__(*args, **kwargs)

    def __missing__ (self, key):
        if self.default_factory is None:
            raise KeyError(key)
        self[key] = default = self.default_factory()
        return default

Tree = lambda: OrderedDefaultdict(Tree)

d = Tree()
for rec in data:
    custName, partNum, delKey = rec['CustName'], rec['PartNum'], rec['delKey']
    details = {"qty": rec["qty"], "memo": rec["memo"]}
    d[custName]['Parts'][partNum]['deliveries'][delKey] = details

So, for the data shown in your question, d would end up containing:

d = {
    "customer1": {
        "Parts": {
            "part1": {
                "deliveries": {"0001": {"memo": "blah1", "qty": "10"},
                               "0002": {"memo": "blah2", "qty": "10"},
                               "0003": {"memo": "blah3", "qty": "10"}}}}},
    "customer2": {
        "Parts": {
            "part3": {
                "deliveries": {"0004": {"memo": "blah4", "qty": "20"},
                               "0005": {"memo": "blah5", "qty": "20"}}}}},
    "customer3": {
        "Parts": {
            "partXYZ": {
                "deliveries": {"0006": {"memo": "blah6", "qty": "50"}}},
            "partABC": {
                "deliveries": {"0007": {"memo": "blah7", "qty": "100"}}}}}
}

Which could just simply be printed out since it's now grouped the way you want.

0
votes

Sort by "CustName", "PartNum", "delKey". Iterate over the delivery items for each part, for each customer and accumulate to match your output spec.

I like to use operator.itemgetter - for me it makes things clearer.

import collections, itertools, operator

cust_name = operator.itemgetter('CustName')
part_num = operator.itemgetter('PartNum')
group_sort = operator.itemgetter('CustName', 'PartNum', 'delKey')
del_key = operator.itemgetter('delKey')
qty = operator.itemgetter('qty')
memo = operator.itemgetter('memo')


# sort on the relavent keys
data.sort(key = group_sort)
result = []

# iterate over customers
for custname, group1 in itertools.groupby(data, cust_name):
    cust_dict = {'CustName' : custname, 'Parts': []}
    # iterate over parts for this customer
    for partnum, group2 in itertools.groupby(group1, part_num):
        part_dict = {"PartNum" : partnum, 'deliveries' : []}
        # iterate over delivery items for this part
        for thing in group2:
            part_dict['deliveries'].append({'delKey':del_key(thing),
                                            'qty':qty(thing),
                                            'memo':memo(thing)})
        cust_dict['Parts'].append(part_dict)
    result.append(cust_dict)

This clearly iterates over the items in the original data multiple times which may be a performance hit -- but I don't see a way around multiple iteration for what you need to do.

0
votes

This is the prettiest way I could do it. It uses the same defaultdict idea to implement proper grouping, as python's builtin groupby function only works on ordered data.

Note that this version will mutate the items in the input dataset, so the leaf items in the result are the same dict instances as the input, but with "CustName" and "PartNum" entries deleted.

from collections import defaultdict

def groupby_mutate(seq, key):
  d = defaultdict(list)
  for item in seq:
    d[item[key]].append(item)
    del item[key]
  return d

def your_operation(data):
  return [ {
    'CustName': CustName,
    'Parts': [ { 
      'PartNum': PartNum,
      'deliveries': deliveries
    } for PartNum,deliveries in groupby_mutate(custItems, 'PartNum').items() ]
  } for CustName,custItems in groupby_mutate(data, 'CustName').items() ]


# try it
from pprint import *
data = [
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0001", "qty":"10", "memo":"blah1"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0002", "qty":"10", "memo":"blah2"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0003", "qty":"10", "memo":"blah3"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0004", "qty":"20", "memo":"blah4"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0005", "qty":"20", "memo":"blah5"},
    {"CustName":"customer3", "PartNum":"partXYZ", "delKey":"0006", "qty":"50", "memo":"blah6"},
    {"CustName":"customer3", "PartNum":"partABC", "delKey":"0007", "qty":"100", "memo":"blah7"}
]

pprint(your_operation(data))

EDIT:

Just in the case somebody needs it in the future, here is a version that does not mutate the original data:

from collections import defaultdict

def groupby_getitem(seq, key):
  d = defaultdict(list)
  for item in seq:
    d[item[key]].append(item)
  return d

def your_operation(data):
  return [ {
    'CustName': CustName,
    'Parts': [ { 
      'PartNum': PartNum,
      'deliveries': [ dict(
        (k,v) for k,v in delivery.items() if not k in ['CustName', 'PartNum']
      ) for delivery in deliveries ]
    } for PartNum,deliveries in groupby_getitem(custItems, 'PartNum').items() ]
  } for CustName,custItems in groupby_getitem(data, 'CustName').items() ]