When using the Azure-Cosmosdb-spark connector to pass an sql query to CosmosDB, it seems to add columns from the other documents in the collection. There are a number of documents in the collection for example user and company which are separated by a entity type. For example the user is set up as
{ "id": "user-000003",
"email": "someemail",
"firstName": "firstname",
"lastName": "lastname",
"username": "someusername",
"companyId": "company-000003",
"entity": "user"
}
and the company is set up as:
{ "id": "company-000003",
"contactName": "namegoes here",
"addressLine1": "Address line 1",
"addressLine2": "Address line 2",
"entity": "company"
}
Using the Azure-Cosmosdb-spark sdk I create my connection
cosmosConfig = {
"Endpoint" : "my endpoint goes here",
"Masterkey" : "my key goes here",
"Database" : "my database goes here",
"preferredRegions" : "my region goes here",
"Collection" : "my collection",
"SamplingRatio" : "1.0",
"schema_samplesize" : "1000",
"query_pagesize" : "2147483647",
}
and then
set it up to use that connection
cosmosdbConnection = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**cosmosConfig).load()
cosmosdbConnection.createOrReplaceTempView("c")
I then run the query
exampleQuery= 'SELECT c.* FROM c WHERE c.entity = "user"'
users = spark.sql(exampleQuery)
I expected to get a dataframe with the columns, id, email, firstName, lastName, username, companyId and entity as defined in the user document. It is however pulling through the column names from the company document as well, but all null values. Running the same query in the Azure Cosmos DB storage explorer or azure portal, just brings back the user documents.
I can just specify the columns names I would like, but if the schema changes I'll need to add those columns.
I assume its the query? I'm looking for the way to just get the columns from the documents in the sql query. I did think that it would just pass the sql query to the cosmosdb SQL API.
This is my first time using databricks with cosmos db, have googled around, but can't seem to see what I've done wrong.