1
votes

I have records in a collection of the following format:

"_id" : "2013-05-23",
    "authors_who_sold_books" : [
        {
            "id" : "Charles Dickens",
            "num_sold" : 1,
            "customers" : [
                {
                   "time_bought" : 1368627290,
                   "customer_id" : 9715923
                }
            ]
        },
        {
            "id" : "JRR Tolkien",
            "num_sold" : 2,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                },
                {
                    "date_bought" : 1368537290,
                    "customer_id" : 9163893
                }
            ]
        }
    ]
}

There is a record for each date, many of which will contain the same author. I'm after a query that returns the following:

{
    "_id" : "Charles Dickens",
    "num_sold" : 235,
    "customers" : [
        {
            "date_bought" : 1368627290,
            "customer_id" : 9715923
        },
        {
            "date_bought" : 1368622358,
            "customer_id" : 9876234
        },
        etc...
    ]
}

I've tried various combinations of aggregate, group, unwind and project but still can't quite get there and would really appreciate any suggestions.

For extra points, I'm actually doing this using the Ruby gem, so code specific to this would be great. I can convert normal MongoDB query language, however.

1
Did you try MapReduce?Philipp
The biggest problems I'm seeing is with how the documents are stored themselves. Is changing the structure of the data an option? The reason I ask, is if you have the _ID field set to the date, and each date document has a customers array, why do you need to store the date again in the customers array? Also, documents have a 16mb limit on their size, so if there's millions of sales in a single day, you may exceed that document size limit. I think it would be easier (again, speculating) if each sale was it's own record, then you could use the aggregation framework to create what you're seeking.Jesta
No, I haven't tried MapReduce yet...Serenthia
@Jesta Thanks for the feedback. The data is from a daily data dump. In my real situation (not looking at books and authors!) the id will actually be something along the lines of "2013-05-17_emails", "2013-05-17_banner-ads" etc, so contain a little more information. Also, the "date_bought" field is actually a timestamp, so I'll alter my question to better reflect this. I can also guarantee that the number of records will never be in the realm of millions! Thank you.Serenthia

1 Answers

9
votes

I took your sample data, slightly modified for a 2nd document, and then added them to a test collection. The documents I used are as follows:

{
    "_id" : "2013-05-23",
    "authors_who_sold_books" : [
        {
            "id" : "Charles Dickens",
            "num_sold" : 1,
            "customers" : [
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                }
            ]
        },
        {
            "id" : "JRR Tolkien",
            "num_sold" : 2,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                },
                {
                    "date_bought" : 1368537290,
                    "customer_id" : 9163893
                }
            ]
        }
    ]
}
{
    "_id" : "2013-05-21",
    "authors_who_sold_books" : [
        {
            "id" : "Charles Dickens",
            "num_sold" : 3,
            "customers" : [
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                }
            ]
        },
        {
            "id" : "JRR Tolkien",
            "num_sold" : 1,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                }
            ]
        }
    ]
}

Now, to get your expected results, I used the aggregation framework and ran this query:

db.collection.aggregate([
    {
        // First we unwind all the authors that sold books
        $unwind: '$authors_who_sold_books',
    },
    {
        // Next, we unwind each of the customers that purchased a book
        $unwind: '$authors_who_sold_books.customers'
    },
    {
        // Now we group them by "Author Name" (hoping they are unique!)
        $group: {
            _id: '$authors_who_sold_books.id',
            // Increment the number sold by each author
            num_sold: {
                $sum: 1
            },
            // Add the customer data to the array
            customers: {
                $push: '$authors_who_sold_books.customers'
            }
        }
    }
]);

I tried to document the above code, so that way it made a bit more sense. Basically, it's unwinding the data twice to create a document for each sale by author. First unwind by authors_who_sold_books, then unwinding authors_who_sold_books.customers.

The next step was just grouping them up and pushing all the customers into the customers array, and incrementing the num_sold by 1 for every unwinded document we had.

The results come out looking like:

{
    "result" : [
        {
            "_id" : "JRR Tolkien",
            "num_sold" : 3,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                },
                {
                    "date_bought" : 1368537290,
                    "customer_id" : 9163893
                },
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                }
            ]
        },
        {
            "_id" : "Charles Dickens",
            "num_sold" : 4,
            "customers" : [
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                }
            ]
        }
    ],
    "ok" : 1
}

Hope this helps you figure out your real solution :)