I believe you are not able to get a connection to the database, since you'd need to download it to the instance running your Cloud Function. Notice that in the Cloud Functions environment the /tmp folder will be the only writeable directory and that since Cloud Functions is an in-memory system the RAM assigned to your Cloud Function at the time of deployment will be used to host the .db file.
The following code snippet should work (it is based on the famous chinook SQLite sample database).
It is assumed that you have already created the table within the BigQuery dataset with the following schema:
ArtistId INTEGER NULLABLE
Name STRING NULLABLE
based on the schema of the Cloud SQL database (as the specific query to be run depends on that) and that the .db object stored within your Cloud Storage bucket is not nested under any folders.
- Within your local development setup create a directory with the following files:
a. requirements.txt
google-cloud-storage
google-cloud-bigquery
pandas
pyarrow
b. main.py
from google.cloud import storage
from google.cloud import bigquery
import sqlite3
import pandas as pd
BUCKET_NAME = "[YOUR-BUCKET]"
OBJECT_NAME = "chinook.db"
DATABASE_NAME_IN_RUNTIME = "/tmp/chinook.db"
QUERY = "SELECT * FROM artists;"
TABLE_ID = "[PROJECT-ID].[DATASET-ID].[TABLE-ID]"
storage_client = storage.Client()
bigquery_client = bigquery.Client()
def get_db_file_from_gcs(bucket_name, object_name, filename):
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(object_name)
return blob.download_to_filename(filename)
def connect_to_sqlite_db(complete_filepath_to_db):
connection = sqlite3.connect(complete_filepath_to_db)
return connection
def run_query_to_db(connection, query):
with connection:
cursor = connection.cursor()
cursor.execute(query)
return cursor.fetchall()
def run_query_to_db_with_pandas(connection, query):
with connection:
df = pd.read_sql_query(query, connection)
return df
def gcssqlite_to_bq(request):
print("Getting .db file from Storage")
get_db_file_from_gcs(BUCKET_NAME, OBJECT_NAME, DATABASE_NAME_IN_RUNTIME)
print("Downloaded .db file in CF instance RAM")
print("Trying to connect to database using sqlite")
cnx = connect_to_sqlite_db(DATABASE_NAME_IN_RUNTIME)
print("Connected to database")
print("Attempting to perform a query")
results = run_query_to_db_with_pandas(cnx, QUERY)
print("Writing data to BigQuery")
bigqueryJob = bigquery_client.load_table_from_dataframe(results, TABLE_ID)
bigqueryJob.result()
print("The Job to write to Big Query is finished")
return "Executed Function"
Update the main.py file as per your requirements.
Use the following command to deploy a publicly invokable cloud function by changing the parameters as per your needs:
gcloud functions deploy [CLOUD_FUNCTION_NAME] --region [REGION] --entry-point gcssqlite_to_bq --timeout 540 --memory 1024MB --runtime python38 --trigger-http --allow-unauthenticated
DISCLAIMER: This approach is prone to get OOM errors depending on the size of your .db file and the complexity of the queries issued to the database. Notice that this are the memory limits offered by Cloud Functions (currently 4GB), and if you have a lot of data (in the order of GB or TB) on your .db files maybe a better approach would be to migrate your data to Cloud SQL (SQLite is not supported though) and use federated queries directly from BigQuery.