0
votes

I have one scenario for retrieving millions of records from elastic search.

I am a beginner at Elastic-search and not able to use elastic search very efficiently.

I am indexing Author Model as shown below in elastic search and I am using NEST Client for using elastic search with a .net application.

Below I am explaining my models.

Author
--------------------------------
AuthorKey           string
List<Study>         Nested


Study
---------------------------------
PMID              int
PublicationDate   date
PublicationType   string
MeshTerms         string
Content           string

We have almost 10 Millions of authors and each author has completed minimum 3 studies.

So there are approximate 30 millions records available in the elastic index.

Now I would like to get authors data along with its total study count

Below is sample JSON Data:

{
  "Authors": [
    {
      "AuthorKey": "Author1",
      "AuthorName": "karan",
      "AuthorLastName": "shah",
      "Study": [
        {
          "PMId": 1000,
          "PublicationDate": "2019-01-17T06:35:52.178Z",
          "content": "this is dummy content.how can i solve this",
          "MeshTerms": "karan,dharan,nilesh,manan,mehul sir,manoj",
          "PublicationType": [
            "ClinicalTrial",
            "Medical"
          ]
        },
        {
          "PMId": 1001,
          "PublicationDate": "2019-01-16T05:55:14.947Z",
          "content": "this is dummy content.how can i solve this",
          "MeshTerms": "karan1,dharan1,nilesh1,manan1,mehul1 sir,manoj1",
          "PublicationType": [
            "ClinicalTrial",
            "Medical"
          ]
        },
        {
          "PMId": 1002,
          "PublicationDate": "2019-01-15T05:55:14.947Z",
          "content": "this is dummy content for record2.how can i solve 
           this",
          "MeshTerms": "karan2,dharan2,nilesh2,manan2,mehul2 sir,manoj2",
          "PublicationType": [
            "ClinicalTrial1",
            "Medical2"
          ]
        },
        {
          "PMId": 1003,
          "PublicationDate": "2011-01-15T05:55:14.947Z",
          "content": "this is dummy content for record3.how can i solve this",
          "MeshTerms": "karan3,dharan3,nilesh3,manan3,mehul3 sir,manoj3",
          "PublicationType": [
            "ClinicalTrial1",
            "Medical3"
          ]
        }
      ]
    },
    {
      "AuthorKey": "Author2",
      "AuthorName": "dharan",
      "AuthorLastName": "shah",
      "Study": [

        {
          "PMId": 2001,
          "PublicationDate": "2011-01-16T05:55:14.947Z",
          "content": "this is dummy content for author 2.how can i solve 
           this",
          "MeshTerms": "karan1,dharan1,nilesh1,manan1,mehul1 sir,manoj1",
          "PublicationType": [
            "ClinicalTrial",
            "Medical"
          ]
        },
        {
          "PMId": 2002,
          "PublicationDate": "2019-01-15T05:55:14.947Z",
          "content": "this is dummy content for author 2.how can i solve 
           this",
          "MeshTerms": "karan2,dharan2,nilesh2,manan2,mehul2 sir,manoj2",
          "PublicationType": [
            "ClinicalTrial1",
            "Medical2"
          ]
        },
        {
          "PMId": 2003,
          "PublicationDate": "2015-01-15T05:55:14.947Z",
          "content": "this is dummy content for record2.how can i solve 
           this",
          "MeshTerms": "karan3,dharan3,nilesh3,manan3,mehul3 sir,manoj3",
          "PublicationType": [
            "ClinicalTrial1",
            "Medical3"
          ]
        }
      ]
    },
    {
      "AuthorKey": "Author3",
      "AuthorName": "Nilesh",
      "AuthorLastName": "Mistrey",
      "Study": [
        {
          "PMId": 3000,
          "PublicationDate": "2012-01-16T05:55:14.947Z",
          "content": "this is dummy content for author 2 .how can i solve 
           this",
          "MeshTerms": "karan2,dharan2,nilesh2,manan2,mehul sir2,manoj2",
          "PublicationType": [
            "ClinicalTrial",
            "Medical"
          ]
        }

  ]
}

How to retrieve all authors along with their total studies count in descending order?

Expected output:

{
  "Authors": [
    {
      "AuthorKey": "Author1",
      "AuthorName": "karan",
      "AuthorLastName": "shah",
      "StudyCount": 4
    },
    {
      "AuthorKey": "Author2",
      "AuthorName": "dharan",
      "AuthorLastName": "shah",
      "StudyCount": 3
    },

    {
      "AuthorKey": "Author3",
      "AuthorName": "Nilesh",
      "AuthorLastName": "Mistrey",
      "StudyCount": 1
    }
  ]
}

Below is mapping of the index:

{
  "authorindex": {
    "mappings": {
      "_doc": {
        "properties": {
          "AuthorKey": {
            "type": "keyword"
          },
          "AuthorLastName": {
            "type": "keyword"
          },
          "AuthorName": {
            "type": "keyword"
          },
          "Study": {
            "type": "nested",
            "properties": {
              "MeshTerms": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "PMId": {
                "type": "long"
              },
              "PublicationDate": {
                "type": "date"
              },
              "PublicationType": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "content": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
1
May you please provide the mapping you are using? Did you already try to solve the problem? How? - Nikolay Vasiliev
@NikolayVasiliev I have tried but did not get how to write the query to fulfill this requirement - Karan Shah
Please don't add things like that in the comments, use the edit link - James Z

1 Answers

0
votes

There are a couple of options to tackle this.

  1. use scripting like is suggested in this answer to a similar question;

  2. precompute the desired number of studies, store it in the index as a simple integer and sort the results.

Depending on the situation you are facing, either option can work for you.

Option 1) will do if you need to experiment with data and make casual queries. It is not performant but should work with existing data and mapping.

Option 2) instead will require complete reindexing and adding an additional (yet easy) step before data is being sent to Elasticsearch. On the positive side this will guarantee best possible performance.

You may read about other ways of handling relations in Elasticsearch in the Handling relationships chapter of the Definitive Guide.

Hope that helps!