Trying to download all of the google search images using javascript

Question

I'm trying to make a script that downloads all the Google search images for making dataset of my ml project. I was following this tutorial to download the high-resolution image but suddenly an error appears which says:

Refused to load the script 'https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js' because it violates the following Content Security Policy directive: "script-src 'report-sample' 'nonce-Q6xQOKx7e+e0TlGbQFPX3g' 'unsafe-inline'". Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback

Some help would be greatly appreciated. I run this code by pasting it into the javascript console. Thanks!

var script = document.createElement('script');
script.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(script);

// grab the URLs
var urls = $('.rg_di .rg_meta').map(function() {
  return JSON.parse($(this).text()).ou;
});

// write the URls to file (one per line)
var textToSave = urls.toArray().join('\n');
var hiddenElement = document.createElement('a');
hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
hiddenElement.target = '_blank';
hiddenElement.download = 'urls.txt';
hiddenElement.click();

The Content Security Policy is used here to prevent you from doing, what you are trying to do. Google has instructed the browser to not load scripts from just any arbitrary sources, other than what they explicitly allowed. There’s probably browser extentions available that can override these policies locally in your browser, so if you want to use this only for yourself, I’d suggest you do some research into that direction. — CBroe

Tschallacka Tschallacka · Accepted Answer · 2020-04-02T11:08:03

You are using jQuery for something that can be done in native javascript.

document.querySelectorAll works with selectors mainly as jQuery does. It does not return an array, but an (in my opinion) unwieldy NodeList.

To get it to iterate properly, I prefer to spread it into an array and then call forEach on it.

[...document.querySelectorAll('.foo')].forEach((element, index) => {
   console.log(element.innerText);
});

<div class="foo">bar</div>
<div class="foo">baz</div>
<div class="foo">bal</div>

Also, the method of getting the data is diffent currently.

On all the images you need to trigger a click first.
This will activate javascript event handlers that will set the href of the image grandparent.
You need let the google event handlers run first, so we detach the rest of our execution flow so the google script can do it's thing and update the DOM. We do this with setTimeout().
Then when the google scripts have run, the DOM elements have been updated, our scheduled timeouts get a chance to run, and now the href's have been populated.

Before the click the link looks like this:

after click

we now see that the href has been populated. The url that has been entered is:

https://www.google.com/imgres?imgurl=https%3A%2F%2Fwww.researchgate.net%2Fprofile%2FJerome_Droniou%2Fpublication%2F305983658%2Ffigure%2Ffig5%2FAS%3A668650201690119%401536430039650%2FMesh-patterns-for-the-tests-using-the-HMM-method-left-Test-1-right-Test-2.png&imgrefurl=https%3A%2F%2Fwww.researchgate.net%2Ffigure%2FMesh-patterns-for-the-tests-using-the-HMM-method-left-Test-1-right-Test-2_fig5_305983658&tbnid=_UuLNMPCQAT0uM&vet=12ahUKEwjhsu31zcnoAhWbgKQKHR3jAdUQMygAegUIARDTAQ..i&docid=LThLi5REXoitfM&w=428&h=428&q=hmm%20test&ved=2ahUKEwjhsu31zcnoAhWbgKQKHR3jAdUQMygAegUIARDTAQ

In this url we see after imgurl= something starting with https. This is our target image url, but it has been urlencoded and is part of a larger url.
So we manipulate the string with some simple substring manipulation.

Then we still have strange characters

https%3A%2F%2Fwww.researchgate.net%2Fprofile%2FJerome_Droniou%2Fpublication%2F305983658%2Ffigure%2Ffig5%2FAS%3A668650201690119%401536430039650%2FMesh-patterns-for-the-tests-using-the-HMM-method-left-Test-1-right-Test-2.png

for that we can use decodeURIComponent() to transform it into a normal url

document.write(decodeURIComponent('https%3A%2F%2Fwww.researchgate.net%2Fprofile%2FJerome_Droniou%2Fpublication%2F305983658%2Ffigure%2Ffig5%2FAS%3A668650201690119%401536430039650%2FMesh-patterns-for-the-tests-using-the-HMM-method-left-Test-1-right-Test-2.png'))

We then add this to our array.

When we've handled everything, we create the urls file and download it.

var urls = [];
var count = 0;
[...document.querySelectorAll('.rg_i')].forEach((element, index) => {
   let el = element.parentElement.parentElement;
   el.click();
   count++;
   setTimeout(() => {
       let google_url = el.href;

       let start = google_url.indexOf('=' , google_url.indexOf('imgurl'))+1;
       let encoded = google_url.substring(start, google_url.indexOf('&', start));
       let url = decodeURIComponent(encoded);
       urls.push(url);
       console.log(count);
       if(--count == 0) {
          let textToSave = urls.join('\n');
          let hiddenElement = document.createElement('a');
          hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
          hiddenElement.target = '_blank';
          hiddenElement.download = 'urls.txt';
          hiddenElement.click();
       }

   }, 50);

});

Trying to download all of the google search images using javascript

3 Answers