Snowflake Data Pipeline problems - in particular stream issue

Question

Background

I have implemented a snowflake data pipeline (s3 log file > SNS > pipe > stage table > stream > task > stored proc/UDF > final table) in our production snowflake database.

While things were working on a smaller scale in our dev database, it seems the production pipeline has stopped working given the amount of data (6416006096 records and growing) attempting to flow throw it.

Problem

After some investigation so far it looks like s3 log > SNS > pipe > stage table is OK, but I things are stuck where the task retrieves records from the stream... The stream is NOT stale. I have spent a lot of time reading the docs on streams and have not found any help in there for my current issue.

It looks like the stream has too much data to return -- when I try to get a count(*) or * with limit 10 from the stream it is not returning after 8 minutes (and counting)...

Even if I could limit the data returned, I have experimented where once you select from the stream within a transaction, you can lose all changes even if you don't want them all (i.e., use a where clause to filter)...

Question

Is there any way to get anything to return from the stream without resetting it?

Is there anyway to chunk the results from the stream without losing all changes within a transaction?

Is there some undocumented limit with streams--have I hit that?

Concern

I don't want to shut down the data pipeline bc that means I may have to start all over but I guess I will have to if I get no answers (I have contacted support too but have yet to hear back). Given that streams and tasks are still only preview I guess this shouldn't be a surprise, but I was told they would be GA by now from Snowflake.

I've personally been using streams and tasks without issue for months now. I've found you can SELECT without problems or without clearing the stream. The transactions come into place once you have DML involved (like running an UPDATE, INSERT, or DELETE on a table while referencing your stream), which will clear your stream. Are you able to increase your Warehouse size to SELECT the contents of the stream? The selection from the Stream might be a good support case directly to Snowflake. — Brock
Thanks for the response - I have a task that does involve DML (insert while referencing the stream which I was hoping to chunk but realize I can't). I got the select to work by increasing the the warehouse size to 4XL (ugh). I have a support case open but it might just be reality with the size of the data. — rakcs

Grump Grump · Accepted Answer · 2020-03-20T18:56:16

Is there any way to get anything to return from the stream without resetting it?

You should be able to select from the stream without resetting it. Only using it in a DML (ex: insert into mytable as select * from stream) will reset it.

Is there anyway to chunk the results from the stream without losing all changes within a transaction?

No, streams don't support chunking.

Is there some undocumented limit with streams--have I hit that?

I don't think there are undocumented limits, streams are essentially ranges on a table so if there's a lot of data in the underlying table, it could take awhile to scan it.

Some other considerations:

Are you using the right sized warehouse? If you have a lot of data in the stream, and a lot of DMLs consisting of updates, deletes, and inserts you might want to reconsider your warehouse size. I believe Snowflake does some partition level comparisons to reconcile added and deleted data.
Can you "tighten" up how often you read from the stream so that there's less data to process each time?
Depending on the type of data you're interested in, Snowflake offers an append only stream type, which only shows added data. This makes scanning much faster.