I'd like to scrape a javascript website using Scrapy + Splash in Google App Engine. The Splash plugin is a Docker image. Is there any way to use this within Google App Engine? App Engine itself uses a Docker image, but I'm not sure how to load and access a secondary image (which is how Splash is used). Here are the Splash install instructions
2 Answers
You can use Custom Runtimes in the App Engine Flexible Environment.
Custom runtimes let you build apps that run in an environment defined by a Dockerfile. By using a Dockerfile, you can use languages and packages that are not part of the Google Cloud Platform and use the same resources and tooling that are used in the App Engine flexible environment.
Explore more About Custom Runtimes. Please note when you use a custom runtime, you have to write your application code to deal with some flexible environment life-cycle and health checking requests. Check how to build a custom runtime for more information.
Deploying the Splash service separately is the proper way to accomplish this.
I went ahead and tested a few different setups and the only approach that allowed me to have Splash on App Engine was to deploy it as a custom domain, setting the forwarded_ports to able to connect directly to one of the service’s instances through its IP address.
This is clearly not an adequate solution, as it comes with many limitations and, in the end, it becomes basically using Google Compute Engine without all the control it provides.
My suggestion is that you only deploy the Scrapy service of your application to App Engine, and leave the Splash service somewhere else, like in a GCE instance.
Once you have that, all you will need to do is set a static IP address for the instance and connect to it from your App Engine app through that.