0
votes

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate

age = creationtime - currentdate

:

hits = [
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2018-05-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2013-07-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2014-08-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2015-09-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   }
]

I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329, I want to efficiently do bulk update on age based on _id for each day on all documents using python.

Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value. But here _id I will get as and when the code runs.

The logic I had written was to fetch all doc & store their _id & then for each _id update the age . But its not an efficient way if I want to update all documents in bulk for each day of 6 months.

Can anyone give me some ideas for this or point me in the right direction.

1
What exactly do you need the _id for? Do you do another lookup with that id to update the age? Or is the age rather based on the difference of the two timestamps?Joe - Elasticsearch Handbook
@JoeSorocin _id I need so that by mistake I do not update some other document. Just to keep track of which doc I am updating or in case of failure, which document was last updatedJermy Fields
That won't be a problem because the updates will be atomic. Can you explain a little more the logic behind the age calculation? Is is just the timestamp difference or does anything else come into play?Joe - Elasticsearch Handbook
@JoeSorocin age is just a difference , so creationtime is what we are reading from the original data from the server & currentdate we are inserting as the the time when the document was insertedJermy Fields

1 Answers

0
votes

As mentioned in the comments, fetching the IDs won't be necessary. You don't even need to fetch the documents themselves!

A single _update_by_query call will be enough. You can use ChronoUnit to get the difference after you've parsed the dates:

POST your-index-name/_update_by_query
{
  "query": {
    "match_all": {}
  },
  "script": {
    "source": """
      def created =  LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));

      def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
    
      def months = ChronoUnit.MONTHS.between(created, currentdate);
      ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
    """,
    "lang": "painless"
  }
}

The official python client has this method too. Here's a working example.

🔑 Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.


💡 It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.

You see, if your index mapping's dates are properly defined like so:

{
  "mappings": {
    "properties": {
      "creationtime": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss"
      },
      "currentdate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      ...
    }
  }
}

the age can be calculated as a script field:

POST ttimes/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "age_calculated": {
      "script": {
        "source": """
          def months = ChronoUnit.MONTHS.between(
                          doc['creationtime'].value,
                          doc['currentdate'].value );
          return months + ' month' + (months > 1 ? 's' : '');
        """
      }
    }
  }
}

The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once!).

"hits" : [
  {
    ...
    "_id" : "FFfPuncBly0XYOUcdIs5",
    "fields" : {
      "age_calculated" : [ "32 months" ]   <--
    }
  },
  ...