3
votes

I'm new to MongoDB and Mongoose and I'm trying to use it to save stock ticks for daytrading analysis. So I imagined this Schema:

symbolSchema = Schema({
    name:String,
    code:String
});

quoteSchema = Schema({
    date:{type:Date, default: now},
    open:Number, 
    high:Number,
    low:Number,
    close:Number,
    volume:Number
});

intradayQuotesSchema = Schema({
    id_symbol:{type:Schema.Types.ObjectId, ref:"symbol"},
    day:Date,
    quotes:[quotesSchema]
});

From my link I receive information like this every minute:

date | symbol | open | high | low | close | volume

2015-03-09 13:23:00|AAPL|127,14|127,17|127,12|127,15|19734

I have to:

  1. Find the ObjectId of the symbol (AAPL).
  2. Discover if the intradayQuote document of this symbol already exists (symbol and date combination)
  3. Discover if the minute OHLCV data of this symbol exists on the quotes array (because it could be repeated)
  4. Update or create the document and update or create the quotes inside the array

I'm able to accomplish this task without veryfing if the quotes already exists, but this method can creates repeated entries inside quotes array:

symbol.find({"code":mySymbol}, function(err, stock) {
    intradayQuote.findOneAndUpdate({
        { id_symbol:stock[0]._id, day: myDay },
        { $push: { quotes: myQuotes } },
        { upsert: true },
        myCallback
    });
});

I already tried:

  • $addToSet instead of $push, but unfortunatelly this doesn't seems to work with array of documents
  • { id_symbol:stock[0]._id, day: myDay, 'quotes["date"]': myDate } on the conditions of findOneAndUpdate; but unfortunatelly if mongo doesn't find it, it creates a new document for the minute instead of appending to the quotes array.

Is there a way to get this working without using one more query (I'm already using 2)? Should I rethink my Schema to facilitate this job? Any help will be appreciated. Thanks!

1
Sorry but it's not really clear here what you need to "update" in the the existing items in the array and how the other fields than your "unique" constraint there apply to those updates. No $addToSet will not work since a "partial distinct key" does not constitute a "set" of disctinct objects. So you need multiple queries. The actual pattern is not clear without knowing how the updates would be applied.Neil Lunn
Sorry if my explanation is confusing. I believe it's complicated and maybe it's not even possible. I need to update the array document item where date is equal to the date I'm trying to upsert (inside the intradayQuote model that has ObjectId equal to the symbol _id of my symbol code). If I can match those conditions, I'll update the entire document inside the array (all the keys: open, high...). If I can not match, I should create a new document inside array. And if the intradayQuote document also doesn't exists, I should create it before create the array item.Léo Muniz
So replace the whole array element if there is already that distinct "minute" field right? That was all I was asking. I also don't see the point of the outer query here. It seems if you a looping anything it's a list of "quotes" in the format you are showing and then transforming to your schema pattern. Is that what this code is really trying to do? Why is the embedded array important? Therefore how do you actually use the data otherwise that would make this an advantage? Nothing wrong with how you have presented, but you might be served better by breaking the task down to distinct questionsNeil Lunn
That's right. And thanks for your pacience. Actually I'm not sure on my schema. And I don't know yet how exactly I'll use that data. The point on having embedded array is: there are many symbols and every symbol has about 540 daily quotes. So I imagined to save a new document per symbol per day. Everyday I usually need for the last 2 or 3 days of quotes of each symbol to make my main analysis (but I'll need the long historical prices for sure). So I broke daily quotes in different documents to have a good performance. But I'm new to MongoDB and maybe there is a better way to do it.Léo Muniz

1 Answers

3
votes

Basically put an $addToSet operator cannot work for you because your data is not a true "set" by definition being a collection of "completely distinct" objects.

The other piece of logical sense here is that you would be working on the data as it arrives, either as a sinlge object or a feed. I'll presume its a feed of many items in some form and that you can use some sort of stream processor to arrive at this structure per document received:

{
    "date": new Date("2015-03-09 13:23:00.000Z"),
    "symbol": "AAPL",
    "open": 127.14
    "high": 127.17,
    "low": 127.12 
    "close": 127.15,
    "volume": 19734
}

Converting to a standard decimal format as well as a UTC date since any locale settings really should be the domain of your application once data is retrieved from the datastore of course.

I would also at least flatten out your "intraDayQuoteSchema" a little by removing the reference to the other collection and just putting the data in there. You would still need a lookup on insertion, but the overhead of the additional populate on read would seem to be more costly than the storage overhead:

intradayQuotesSchema = Schema({
    symbol:{
        name: String,
        code: String
    },
    day:Date,
    quotes:[quotesSchema]
});

It depends on you usage patterns, but it's likely to be more effective that way.

The rest really comes down to what is acceptable to

stream.on(function(data) {

    var symbol = data.symbol,
        myDay = new Date( 
            data.date.valueOf() - 
                ( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
    delete data.symbol;

    symbol.findOne({ "code": symbol },function(err,stock) {

        intraDayQuote.findOneAndUpdate(
            { "symbol.code": symbol , "day": myDay },
            { "$setOnInsert": { 
               "symbol.name": stock.name
               "quotes": [data] 
            }},
            { "upsert": true }
            function(err,doc) {
                intraDayQuote.findOneAndUpdate(
                    {
                        "symbol.code": symbol,
                        "day": myDay,
                        "quotes.date": data.date
                    },
                    { "$set": { "quotes.$": data } },
                    function(err,doc) {
                        intraDayQuote.findOneAndUpdate(
                            {
                                "symbol.code": symbol,
                                "day": myDay,
                                "quotes.date": { "$ne": data.date }
                            },
                            { "$push": { "quotes": data } },
                            function(err,doc) {

                            }
                       );    
                    }
                );
            }
        );    
    });
});

If you don't actually need the modified document in the response then you would get some benefit by implementing the Bulk Operations API here and sending all updates in this package within a single database request:

stream.on("data",function(data) {

    var symbol = data.symbol,
        myDay = new Date( 
            data.date.valueOf() - 
                ( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
    delete data.symbol;

     symbol.findOne({ "code": symbol },function(err,stock) {
         var bulk = intraDayQuote.collection.initializeOrderedBulkOp();
         bulk.find({ "symbol.code": symbol , "day": myDay })
             .upsert().updateOne({
                 "$setOnInsert": { 
                     "symbol.name": stock.name
                     "quotes": [data] 
                 }
             });

         bulk.find({
             "symbol.code": symbol,
             "day": myDay,
             "quotes.date": data.date
         }).updateOne({
             "$set": { "quotes.$": data }
         });

         bulk.find({
             "symbol.code": symbol,
             "day": myDay,
             "quotes.date": { "$ne": data.date }
         }).updateOne({
             "$push": { "quotes": data }
         });

         bulk.execute(function(err,result) {
             // maybe do something with the response
         });            
     });
});

The point is that only one of the statements there will actually modify data, and since this is all sent in the same request there is less back and forth between the application and server.

The alternate case is that it might just be more simple in this case to have the actual data referenced in another collection. This then just becomes a simple matter of processing upserts:

intradayQuotesSchema = Schema({
    symbol:{
        name: String,
        code: String
    },
    day:Date,
    quotes:[{ type: Schema.Types.ObjectId, ref: "quote" }]
});


// and in the steam processor

stream.on("data",function(data) {

    var symbol = data.symbol,
        myDay = new Date( 
            data.date.valueOf() - 
                ( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
    delete data.symbol;

    symbol.findOne({ "code": symbol },function(err,stock) {
         quote.update(
            { "date": data.date },
            { "$setOnInsert": data },
            { "upsert": true },
            function(err,num,raw) {
                if ( !raw.updatedExisting ) {
                    intraDayQuote.update(
                        { "symbol.code": symbol , "day": myDay },
                        { 
                            "$setOnInsert": {
                                "symbol.name": stock.name
                            },
                            "$addToSet": { "quotes": data }
                        },
                        { "upsert": true },
                        function(err,num,raw) {

                        }
                    );
                }
            }
        );
    });
});

It really comes down to how important to you is it to have the data for quotes nested within the "day" document. The main distinction is if you want to query those documents based on the data some of those "quote" fields or otherwise live with the overhead of using .populate() to pull in the "quotes" from the other collection.

Of course if referenced and the quote data is important to your query filtering, then you can always just query that collection for the _id values that match and use an $in query on the "day" documents to only match days that contain those matched "quote" documents.

It's a big decision where it matters most which path you take based on how your application uses the data. Hopefully this should guide you on the general concepts behind doing what you want to achieve.

P.S Unless you are "sure" that your source data is always a date rounded to an exact "minute" then you probably want to employ the same kind of date rounding math as used to get the discrete "day" as well.