0
votes

I’m writing a nodejs application for download entire web sites using “wget” unix command, but I have a problem with some urls inside the downloaded pages, .html appeares at the end of the files e.g

<img src=“images/photo.jpeg.html”> or <script src=“js/scripts.js.html”>

The code i’m using is the following:

    var util = require('util'),
    exec = require('child_process').exec,
    child,
    url = 'http://www.example.com/';
child = exec('wget --mirror -p --convert-links --html-extension -e robots=off -P /destination_folder/ ' + url,
  function (error, stdout, stderr) {
    console.log('stdout: ' + stdout);
    console.log('stderr: ' + stderr);
    if (error !== null) {
      console.log('exec error: ' + error);
    }
});

N.B If i use this command (wget --mirror -p --html-extension --convert-links -e robots=off -P . http://www.example.com) directly on the Unix shell it works correctly.

Edit: this is the log returned after running the nodejs script:

--2017-04-04 11:49:49--  http://www.example.com/css/style.min.css
Reusing existing connection to www.example.com:80.
HTTP request sent, awaiting response... 304 Not Modified
File ‘/destination_folder/www.example.com/css/style.min.css.html’ not modified on server. Omitting download.

FINISHED --2017-04-04 11:50:11--
Total wall clock time: 22s
Downloaded: 50 files, 1.2M in 1.4s (855 KB/s)
/destination_folder/www.example.com/css/style.min.css.html: No such file or directory
Converting links in /destination_folder/www.example.com/css/style.min.css.html... nothing to do.
exec error: Error: stderr maxBuffer exceeded

I don’t understand where is the problem, could you help me please?

Thank you

1
You should probably describe in more detail how you've determined that it isn't working. Is error set (if so, what is the message)? Is there stderr output (if so, what does it contain)? Are neither of these the case but you're still not seeing anything in your destination directory? Something else? - mscdex
Is Node.JS executing the same version of wget that you execute from command line? - tiblu
@tiblu I think is the same version, How can I check? - S Madry
@SMadry exec('wget --version') and run the same in terminal. - tiblu
@tiblu is the same: 1.19.1 - S Madry

1 Answers

0
votes

exec uses a buffer between stdout and sterr which is limited.

If the files to download are big the buffer may run out of space. Try using spawn intestad of exec. For your reference: Difference between spawn and exec of Node.js