0
votes

Note: This is cross-posted on the elasticsearch forum (https://discuss.elastic.co/t/store-size-1-000-times-the-document-byte-size/74258/4).

I am experiencing a roughly 1,000x increase in store.size over the document byte size. I've got a very simple mapping with very small documents (less than 1kb) and I've compared my mapping to Elasticsearch's internal mapping and they are the same, so it does not appear that there is any dynamic mapping going on.

So far, I have ingested 60,437 documents and have a store.size of 19.6Gb (average of 300kb per document), but the average byte size (String.getBytes().length) of the JSON is 300-400 bytes per document. In another run, the documents were averaging about 1MB - 3MB per document.

I'm using Elasticsearch 5.2 on an M4.2xlarge EC2 instance. Elasticsearch was installed with mostly all defaults, except what I needed to do in order to pass the boostrap checks and bind to a non-local IP. I've allocated 16GB (half of my physical memory) to Elasticsearch.

I used to run Elasticsearch 2.x and was ingesting FAR more fields and much larger documents than just these handful of fields and was only experiencing about 20k / document, which was still substantial, though manageable.

If anyone can point out anything that would fix this, I would appreciate it. Or is there an ES 5.x configuration I haven't seen that will resolve this?

Below is my mapping.

{
    "settings": {
        "index.query.default_field": "tweetText"
    },
    "mappings": {
        "tweet": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "tweetDate": {
                    "type": "date",
                    "format": "EEE MMM dd HH:mm:ss Z YYYY||strict_date_optional_time||epoch_millis"
                },
                "userId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "screenName": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "tweetText": {
                    "type": "text"
                },
                "cleanedText": {
                    "type": "text"
                },
                "tweetId": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "location": {
                    "type": "geo_point",
                    "ignore_malformed": true
                },
                "placeName": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": false
                },
                "placeCountry": {
                    "type": "keyword",
                    "doc_values": true,
                    "eager_global_ordinals": true
                },
                "placeCountryCode": {
                    "type": "keyword",
                    "doc_values": false,
                    "eager_global_ordinals": false,
                    "index": false
                },
                "placeBoundingBox": {
                    "type": "geo_shape",
                    "tree": "quadtree",
                    "precision": "1m"
                },
                "resolvedUrls": {
                    "type": "text",
                    "index": "not_analyzed"
                },
                "hashtags": {
                    "type": "text"
                },
                "mentions": {
                    "type": "text"
                },
                "geoInferences": {
                    "properties": {
                        "matchedName": {
                            "type": "text"
                        },
                        "asciiName": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": false
                        },
                        "country": {
                            "type": "keyword",
                            "doc_values": true,
                            "eager_global_ordinals": true
                        },
                        "county": {
                            "type": "text"
                        },
                        "countryCode": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "city": {
                            "type": "text"
                        },
                        "admin1Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin2Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin3Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "admin4Code": {
                            "type": "keyword",
                            "doc_values": false,
                            "eager_global_ordinals": false,
                            "index": false
                        },
                        "confidence": {
                            "type": "float",
                            "doc_values": false,
                            "ignore_malformed": false,
                            "index": false
                        },
                        "coordinates": {
                            "type": "geo_point",
                            "ignore_malformed": true
                        }
                    }
                },
                "temporalInferences": {
                    "type": "date",
                    "ignore_malformed": true
                }
            }
        }
    }
}

A sample document:

{
  "_index": "twitter",
  "_type": "tweet",
  "_id": "AVoZivLca9LOhnR10_ll",
  "_score": null,
  "_source": {
    "tweetDate": 1486487211000,
    "userId": "123456789",
    "screenName": "removed",
    "tweetText": "RT @wef: America’s dominance is over. By 2030, we'll have a handful of global powers https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer #wef17 https://twitter.com/wef/status/828994745200435200/photo/1",
    "cleanedText": "RT @wef: America s dominance is over. By 2030, we'll have a handful of global powers https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer #wef17 https://twitter.com/wef/status/828994745200435200/photo/1",
    "tweetId": "829013568288796672",
    "resolvedUrls": [
      "https://www.weforum.org/agenda/2016/11/america-s-dominance-is-over/?utm_content=buffer73cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
    ],
    "hashtags": [
      "wef17"
    ],
    "mentions": [
      "wef"
    ],
    "geoInferences": [
      {
        "matchedName": "America",
        "asciiName": "United States",
        "country": "United States",
        "countryCode": "US",
        "coordinates": [
          -98.5,
          39.76
        ],
        "admin1Code": "00",
        "admin2Code": "",
        "admin3Code": "",
        "admin4Code": "",
        "confidence": 1
      }
    ],
    "temporalInferences": [
      1893474000000
    ]
  },
  "fields": {
    "temporalInferences": [
      1893474000000
    ],
    "tweetDate": [
      1486487211000
    ]
  },
  "sort": [
    1486487211000
  ]
}

The output from

GET /_cat/indices/twitter?pri&v&h=health,index,pri,rep,docs.count,mt,pri,rep,docs.count,store.size,pri.store.size

health | index | pri | rep | docs.count | mt | pri.mt | store.size | pri.store.size | pri.store.size
yellow | twitter | 5 | 1 | 26860 | 74 | 74 | 10.1gb | 10.1gb | 10.1gb

The output from:

GET /twitter/_stats

{
  "_shards": {
    "total": 10,
    "successful": 5,
    "failed": 0
  },
  "_all": {
    "primaries": {
      "docs": {
        "count": 26860,
        "deleted": 0
      },
      "store": {
        "size_in_bytes": 11027965678,
        "throttle_time_in_millis": 0
      },
      "indexing": {
        "index_total": 27397,
        "index_time_in_millis": 3568991,
        "index_current": 1,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 195961
      },
      "get": {
        "total": 0,
        "time_in_millis": 0,
        "exists_total": 0,
        "exists_time_in_millis": 0,
        "missing_total": 0,
        "missing_time_in_millis": 0,
        "current": 0
      },
      "search": {
        "open_contexts": 0,
        "query_total": 55,
        "query_time_in_millis": 294,
        "query_current": 0,
        "fetch_total": 36,
        "fetch_time_in_millis": 3209,
        "fetch_current": 0,
        "scroll_total": 0,
        "scroll_time_in_millis": 0,
        "scroll_current": 0,
        "suggest_total": 0,
        "suggest_time_in_millis": 0,
        "suggest_current": 0
      },
      "merges": {
        "current": 0,
        "current_docs": 0,
        "current_size_in_bytes": 0,
        "total": 76,
        "total_time_in_millis": 350987,
        "total_docs": 45409,
        "total_size_in_bytes": 4027595474,
        "total_stopped_time_in_millis": 0,
        "total_throttled_time_in_millis": 48633,
        "total_auto_throttle_in_bytes": 82233108
      },
      "refresh": {
        "total": 857,
        "total_time_in_millis": 2994887,
        "listeners": 0
      },
      "flush": {
        "total": 15,
        "total_time_in_millis": 291939
      },
      "warmer": {
        "current": 0,
        "total": 876,
        "total_time_in_millis": 534
      },
      "query_cache": {
        "memory_size_in_bytes": 0,
        "total_count": 0,
        "hit_count": 0,
        "miss_count": 0,
        "cache_size": 0,
        "cache_count": 0,
        "evictions": 0
      },
      "fielddata": {
        "memory_size_in_bytes": 24808,
        "evictions": 0
      },
      "completion": {
        "size_in_bytes": 0
      },
      "segments": {
        "count": 139,
        "memory_in_bytes": 186032131,
        "terms_memory_in_bytes": 185758725,
        "stored_fields_memory_in_bytes": 43976,
        "term_vectors_memory_in_bytes": 0,
        "norms_memory_in_bytes": 77888,
        "points_memory_in_bytes": 714,
        "doc_values_memory_in_bytes": 150828,
        "index_writer_memory_in_bytes": 1316180948,
        "version_map_memory_in_bytes": 42250,
        "fixed_bit_set_memory_in_bytes": 0,
        "max_unsafe_auto_id_timestamp": -1,
        "file_sizes": {

        }
      },
      "translog": {
        "operations": 11997,
        "size_in_bytes": 5555179
      },
      "request_cache": {
        "memory_size_in_bytes": 0,
        "evictions": 0,
        "hit_count": 195,
        "miss_count": 195
      },
      "recovery": {
        "current_as_source": 0,
        "current_as_target": 0,
        "throttle_time_in_millis": 0
      }
    },
    "total": {
      "docs": {
        "count": 26860,
        "deleted": 0
      },
      "store": {
        "size_in_bytes": 11027965678,
        "throttle_time_in_millis": 0
      },
      "indexing": {
        "index_total": 27397,
        "index_time_in_millis": 3568991,
        "index_current": 1,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 195961
      },
      "get": {
        "total": 0,
        "time_in_millis": 0,
        "exists_total": 0,
        "exists_time_in_millis": 0,
        "missing_total": 0,
        "missing_time_in_millis": 0,
        "current": 0
      },
      "search": {
        "open_contexts": 0,
        "query_total": 55,
        "query_time_in_millis": 294,
        "query_current": 0,
        "fetch_total": 36,
        "fetch_time_in_millis": 3209,
        "fetch_current": 0,
        "scroll_total": 0,
        "scroll_time_in_millis": 0,
        "scroll_current": 0,
        "suggest_total": 0,
        "suggest_time_in_millis": 0,
        "suggest_current": 0
      },
      "merges": {
        "current": 0,
        "current_docs": 0,
        "current_size_in_bytes": 0,
        "total": 76,
        "total_time_in_millis": 350987,
        "total_docs": 45409,
        "total_size_in_bytes": 4027595474,
        "total_stopped_time_in_millis": 0,
        "total_throttled_time_in_millis": 48633,
        "total_auto_throttle_in_bytes": 82233108
      },
      "refresh": {
        "total": 857,
        "total_time_in_millis": 2994887,
        "listeners": 0
      },
      "flush": {
        "total": 15,
        "total_time_in_millis": 291939
      },
      "warmer": {
        "current": 0,
        "total": 876,
        "total_time_in_millis": 534
      },
      "query_cache": {
        "memory_size_in_bytes": 0,
        "total_count": 0,
        "hit_count": 0,
        "miss_count": 0,
        "cache_size": 0,
        "cache_count": 0,
        "evictions": 0
      },
      "fielddata": {
        "memory_size_in_bytes": 24808,
        "evictions": 0
      },
      "completion": {
        "size_in_bytes": 0
      },
      "segments": {
        "count": 139,
        "memory_in_bytes": 186032131,
        "terms_memory_in_bytes": 185758725,
        "stored_fields_memory_in_bytes": 43976,
        "term_vectors_memory_in_bytes": 0,
        "norms_memory_in_bytes": 77888,
        "points_memory_in_bytes": 714,
        "doc_values_memory_in_bytes": 150828,
        "index_writer_memory_in_bytes": 1316180948,
        "version_map_memory_in_bytes": 42250,
        "fixed_bit_set_memory_in_bytes": 0,
        "max_unsafe_auto_id_timestamp": -1,
        "file_sizes": {

        }
      },
      "translog": {
        "operations": 11997,
        "size_in_bytes": 5555179
      },
      "request_cache": {
        "memory_size_in_bytes": 0,
        "evictions": 0,
        "hit_count": 195,
        "miss_count": 195
      },
      "recovery": {
        "current_as_source": 0,
        "current_as_target": 0,
        "throttle_time_in_millis": 0
      }
    }
  },
  "indices": {
    "twitter": {
      "primaries": {
        "docs": {
          "count": 26860,
          "deleted": 0
        },
        "store": {
          "size_in_bytes": 11027965678,
          "throttle_time_in_millis": 0
        },
        "indexing": {
          "index_total": 27397,
          "index_time_in_millis": 3568991,
          "index_current": 1,
          "index_failed": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
          "delete_current": 0,
          "noop_update_total": 0,
          "is_throttled": false,
          "throttle_time_in_millis": 195961
        },
        "get": {
          "total": 0,
          "time_in_millis": 0,
          "exists_total": 0,
          "exists_time_in_millis": 0,
          "missing_total": 0,
          "missing_time_in_millis": 0,
          "current": 0
        },
        "search": {
          "open_contexts": 0,
          "query_total": 55,
          "query_time_in_millis": 294,
          "query_current": 0,
          "fetch_total": 36,
          "fetch_time_in_millis": 3209,
          "fetch_current": 0,
          "scroll_total": 0,
          "scroll_time_in_millis": 0,
          "scroll_current": 0,
          "suggest_total": 0,
          "suggest_time_in_millis": 0,
          "suggest_current": 0
        },
        "merges": {
          "current": 0,
          "current_docs": 0,
          "current_size_in_bytes": 0,
          "total": 76,
          "total_time_in_millis": 350987,
          "total_docs": 45409,
          "total_size_in_bytes": 4027595474,
          "total_stopped_time_in_millis": 0,
          "total_throttled_time_in_millis": 48633,
          "total_auto_throttle_in_bytes": 82233108
        },
        "refresh": {
          "total": 857,
          "total_time_in_millis": 2994887,
          "listeners": 0
        },
        "flush": {
          "total": 15,
          "total_time_in_millis": 291939
        },
        "warmer": {
          "current": 0,
          "total": 876,
          "total_time_in_millis": 534
        },
        "query_cache": {
          "memory_size_in_bytes": 0,
          "total_count": 0,
          "hit_count": 0,
          "miss_count": 0,
          "cache_size": 0,
          "cache_count": 0,
          "evictions": 0
        },
        "fielddata": {
          "memory_size_in_bytes": 24808,
          "evictions": 0
        },
        "completion": {
          "size_in_bytes": 0
        },
        "segments": {
          "count": 139,
          "memory_in_bytes": 186032131,
          "terms_memory_in_bytes": 185758725,
          "stored_fields_memory_in_bytes": 43976,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 77888,
          "points_memory_in_bytes": 714,
          "doc_values_memory_in_bytes": 150828,
          "index_writer_memory_in_bytes": 1316180948,
          "version_map_memory_in_bytes": 42250,
          "fixed_bit_set_memory_in_bytes": 0,
          "max_unsafe_auto_id_timestamp": -1,
          "file_sizes": {

          }
        },
        "translog": {
          "operations": 11997,
          "size_in_bytes": 5555179
        },
        "request_cache": {
          "memory_size_in_bytes": 0,
          "evictions": 0,
          "hit_count": 195,
          "miss_count": 195
        },
        "recovery": {
          "current_as_source": 0,
          "current_as_target": 0,
          "throttle_time_in_millis": 0
        }
      },
      "total": {
        "docs": {
          "count": 26860,
          "deleted": 0
        },
        "store": {
          "size_in_bytes": 11027965678,
          "throttle_time_in_millis": 0
        },
        "indexing": {
          "index_total": 27397,
          "index_time_in_millis": 3568991,
          "index_current": 1,
          "index_failed": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
          "delete_current": 0,
          "noop_update_total": 0,
          "is_throttled": false,
          "throttle_time_in_millis": 195961
        },
        "get": {
          "total": 0,
          "time_in_millis": 0,
          "exists_total": 0,
          "exists_time_in_millis": 0,
          "missing_total": 0,
          "missing_time_in_millis": 0,
          "current": 0
        },
        "search": {
          "open_contexts": 0,
          "query_total": 55,
          "query_time_in_millis": 294,
          "query_current": 0,
          "fetch_total": 36,
          "fetch_time_in_millis": 3209,
          "fetch_current": 0,
          "scroll_total": 0,
          "scroll_time_in_millis": 0,
          "scroll_current": 0,
          "suggest_total": 0,
          "suggest_time_in_millis": 0,
          "suggest_current": 0
        },
        "merges": {
          "current": 0,
          "current_docs": 0,
          "current_size_in_bytes": 0,
          "total": 76,
          "total_time_in_millis": 350987,
          "total_docs": 45409,
          "total_size_in_bytes": 4027595474,
          "total_stopped_time_in_millis": 0,
          "total_throttled_time_in_millis": 48633,
          "total_auto_throttle_in_bytes": 82233108
        },
        "refresh": {
          "total": 857,
          "total_time_in_millis": 2994887,
          "listeners": 0
        },
        "flush": {
          "total": 15,
          "total_time_in_millis": 291939
        },
        "warmer": {
          "current": 0,
          "total": 876,
          "total_time_in_millis": 534
        },
        "query_cache": {
          "memory_size_in_bytes": 0,
          "total_count": 0,
          "hit_count": 0,
          "miss_count": 0,
          "cache_size": 0,
          "cache_count": 0,
          "evictions": 0
        },
        "fielddata": {
          "memory_size_in_bytes": 24808,
          "evictions": 0
        },
        "completion": {
          "size_in_bytes": 0
        },
        "segments": {
          "count": 139,
          "memory_in_bytes": 186032131,
          "terms_memory_in_bytes": 185758725,
          "stored_fields_memory_in_bytes": 43976,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 77888,
          "points_memory_in_bytes": 714,
          "doc_values_memory_in_bytes": 150828,
          "index_writer_memory_in_bytes": 1316180948,
          "version_map_memory_in_bytes": 42250,
          "fixed_bit_set_memory_in_bytes": 0,
          "max_unsafe_auto_id_timestamp": -1,
          "file_sizes": {

          }
        },
        "translog": {
          "operations": 11997,
          "size_in_bytes": 5555179
        },
        "request_cache": {
          "memory_size_in_bytes": 0,
          "evictions": 0,
          "hit_count": 195,
          "miss_count": 195
        },
        "recovery": {
          "current_as_source": 0,
          "current_as_target": 0,
          "throttle_time_in_millis": 0
        }
      }
    }
  }
}

EDIT 1 I've discovered the source of this issue. It seems that it's the bounding box that is at fault, though I've no idea why.

Once I remove the bounding box from the data being ingested, the index is a normal size (600 documents --> 550kb), but as soon as I add the bounding box back in (with a brand new index), the size skyrockets (3,593 documents --> 1.6GB) with only 84 documents containing a bounding box.

Below is the JSON of the bounding box:

"placeBoundingBox": {
    "type": "polygon",
    "coordinates": [
      [
        [
          -71.191421,
          42.227797
        ],
        [
          -71.191421,
          42.399542
        ],
        [
          -70.986004,
          42.399542
        ],
        [
          -70.986004,
          42.227797
        ],
        [
          -71.191421,
          42.227797
        ]
      ]
    ]
  }

The mapping associated with the bounding box (from calling GET /INDEX_NAME):

"placeBoundingBox": {
    "type": "geo_shape",
    "tree": "quadtree",
    "precision": "1.0m"
  }

To demonstrate that the mapping does infact work and is creating a proper geo_shape (even though Kibana doesn't recognize it as a geo_shape), I ran the following query and got back a successful hit:

GET /_search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {

        }
      },
      "filter": {
        "geo_shape": {
          "placeBoundingBox": {
            "shape": {
              "type": "polygon",
              "coordinates": [
                [
                  [
                    -71.191421,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.399542
                  ],
                  [
                    -70.986004,
                    42.227797
                  ],
                  [
                    -71.191421,
                    42.227797
                  ]
                ]
              ]
            },
            "relation": "within"
          }
        }
      }
    }
  }
}

I'd like to have the bounding box kept in, is there something wrong with either the mapping or the data? Is 1.0m too fine-grained?

1
some questions, what about shards/replicas for previous use case? cause now you have 5 shards and 5 replicas (if i'm not mistaken), may be this is the reason that you see size as a big figureMysterion
I added the replica field to the mapping and set it to 1 and no change.Brooks
could you set it to 0? will it change size?Mysterion

1 Answers

0
votes

The problem was the precision in the mapping, which was simply a typo (Our index for Elasticsearch 2.x had the precision as 1km). One tiny letter made all the difference...

A 1 meter ("1m") precision creates an extremely bloated index.

Removing the "precision" field from the mapping altogether will default to 50m and a well-sized index.