0
votes

I'm trying to import a rather large (~200M docs) documentdb into Azure Search, but I'm finding the indexer times out after ~24hrs. When the indexer restarts, it starts again from the beginning, rather than from where it got to, meaning I can't get more than ~40M docs into the search index. The data source has a highwater mark set like this:

        var source = new DataSource();
        source.Name = DataSourceName;
        source.Type = DataSourceType.DocumentDb;
        source.Credentials = new DataSourceCredentials(myEnvDef.ConnectionString);
        source.Container = new DataContainer(myEnvDef.CollectionName, QueryString);
        source.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("_ts");
        serviceClient.DataSources.Create(source);

The highwater mark appears to work correctly when testing on a small db.

Should the highwater mark be respected when the indexer fails like this, and if not how can I index such a large data set?

1

1 Answers

1
votes

The reason the indexer is not making incremental progress even while timing out after 24 hours (the 24 hour execution time limit is expected) is that a user-specified query (QueryString argument passed to the DataContainer constructor) is used. With a user-specified query, we cannot guarantee and therefore cannot assume that the query response stream of documents will be ordered by the _ts column, which is a necessary assumption to support incremental progress.

So, if a custom query isn't required for your scenario, consider not using it.

Alternatively, consider partitioning your data and creating multiple datasource / indexer pairs that all write into the same index. You can use Datasource.Container.Query parameter to provide a DocumentDB query that partitions your data using a WHERE filter. That way, each of the indexers will have less work to do, and with sufficient partitioning, will fit under the 24 hour limit. Moreover, if your search service has multiple search units, multiple indexers will run in parallel, further increasing the indexing throughout and decreasing the overall time to index your entire dataset.