18
votes

When I'm browsing a website A using normal browser (Chrome) and when I click on a link on the website A, Chrome imediatelly downloads report in a form of CSV file.

When I checked a server response headers I get the following results:

Cache-Control:private,max-age=31536000
Connection:Keep-Alive
Content-Disposition:attachment; filename="report.csv"
Content-Encoding:gzip
Content-Language:de-DE
Content-Type:text/csv; charset=UTF-8
Date:Wed, 22 Jul 2015 12:44:30 GMT
Expires:Thu, 21 Jul 2016 12:44:30 GMT
Keep-Alive:timeout=15, max=75
Pragma:cache
Server:Apache
Transfer-Encoding:chunked
Vary:Accept-Encoding

Now, I want to download and parse this file using PhantomJS. I set page onResourceReceived listener to see if Phantom will receive/download the file.

clientRequests.phantomPage.onResourceReceived = function(response) {
    console.log('Response (#' + response.id + ', stage "' + response.stage + '"): ' + JSON.stringify(response));
};

When I make Phantom request to download a file (this is page.open('URL OF THE FILE')), I can see in Phantom log that file is downloaded. Here are logs:

"contentType": "text/csv; charset=UTF-8",
    "headers": {
        "name": "Date",
        "value": "Wed, 22 Jul 2015 12:57:41 GMT"
    },
    "name": "Content-Disposition",
    "value": "attachment; filename=\"report.csv\"",
    "status":200,"statusText":"OK"

I received the file and its content, but how to access file data? When I print current PhantomJS page object, I get the HTML of the page A and I don't want that, I want CSV file, which I need to parse using JavaScript.

3
Wtf man, if Im telling my coworkers to upvote my every post I will have more than 600 points in this few years on StackOverflow and other networks. I was also surprised when I saw 3 upvotes in one hour but that is good not bad. If you investigate this problematic, too much people are fronting the same issue and here I want to see if anyone found a good solution.MrD
After writing my comment I've looked at your post history and found it unlikely that voting-fraud is at play here. Though, I still find it strange that you received 3 upvotes in less than 10 minutes in such low votes tags such as [phantomjs] and [casperjs]. Might be because of [http], but I somehow doubt it.Artjom B.
Regarding the duplicate, I grabbed the wrong link, but it still contains a viable answer to your question, but it is wrapped in CasperJS code. I'm talking about page.onFileDownload of the PhantomJS fork.Artjom B.
After days and days of investigation, this is almost imposible to do with PhantomJS. There are some solutions, but there are not so elegant. After just spending 3 hours on CasperJS I did it, so use CasperJS not only because of this problem, CasperJS is just more intuitive and easier to work with.MrD

3 Answers

12
votes

I found a solution for PhantomJS. Reading through this discussion I found a jsfiddle which downloads a url via jQuery's ajax method and encodes the file as base64.

The file I wanted to download was plain text (CSV) so I have removed the encoding functions. My target page also already had jQuery included so I didn't need to inject jQuery into the target page.

My code assumes you have already opened the page you want to download the file from using PhantomJS, and that page has jQuery in it. In my case I had to first login to the site in order to get the download link.

var fs = require('fs');

var page=this;

var result = page.evaluate(function() {

    var out;
    $.ajax({
        'async' : false,
        'url' : 'fullurltodownload.csv',
        'success' : function(data, status, xhr) {
            out = data;
        }
    });
    return out;

});

fs.write('mydownloadedfile.csv', result);
8
votes

After days and days of investigation, I have to say that there are some solutions:

  • In your evaluate function you can make AJAX call to download and encode your file, then you can return this content back to phantom script
  • You can use some custom Phantom library available on some GitHub pages

If you need to download a file using PhanotmJS, then run away from PhantomJS and use CasperJS. CasperJS is based on PhantomJS, but it has much better and intuitive syntax and program flow.

Here is good post explaining "Why CasperJS is better than PhantomJS". In this post you can find section about file download.

How to download CSV file using CasperJS (this works even when server sends header Content-Disposition:attachment; filename='file.csv)

Here you can find some custom csv file available for download: http://captaincoffee.com.au/dump/items.csv

In order to download this file using CasperJS execute the following code:

var casper = require('casper').create();

casper.start("http://captaincoffee.com.au/dump/", function() {
    this.echo(this.getTitle())
});
casper.then(function(){
    var url = 'http://captaincoffee.com.au/dump/csv.csv';
    require('utils').dump(this.base64encode(url, 'get'));
});

casper.run();

The code above will download http://captaincoffee.com.au/dump/csv.csv CSV file and will print results as base64 string. So this way, you don't even have to download data to file, you have your data as base64 string.

If you explicitly want to download file to file system, you can use download function which is available in CasperJS.

1
votes

The previous 2 answers assume you can know in advance the URL of the final CSV file. That won't be the case if the link goes to an HTML page that does a Javascript-computed redirect to the file and you don't want to evaluate that Javascript outside of PhantomJS. Your options then are:

  1. put PhantomJS behind an upstream proxy, and use said upstream proxy to intercept the download URL (and its expected Cookie and Referer headers)—but you'd have to be careful to positively identify the real download URL and not some random data 'blob' if the page makes binary XMLHttpRequests as well;
  2. instead of PhantomJS use Headless Chrome which can automatically save downloaded files (or Firefox with PyVirtualDisplay, which can also be set to do this, or wait for Headless Firefox) and monitor the downloads directory—but you'd have to be able to figure out by yourself when the download has completed (or use an upstream proxy to monitor it for completion, but Headless Chrome/Firefox cannot currently be set to ignore SSL certificates, which means if the site goes "secure" it's much more difficult to monitor the requests of Headless Chrome/Firefox than it is to monitor the requests of PhantomJS, at least until Chromium issue 721739 is fixed; you could watch a CONNECT request but if it's kept alive you will have no way of knowing for sure that a transfer has finished);
  3. put PhantomJS behind an upstream proxy that changes all unknown content types to text/plain and deletes Content-Disposition headers, so you can read the file from PhantomJS in the normal way—that should work for a CSV file but won't work for binaries with 0-bytes in them.

The first of these options (PhantomJS + upstream proxy) is made easier if the upstream proxy can monitor the Accept header that PhantomJS sends to the remote site. At least in PhantomJS version 2.1.1, main requests have Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, stylesheet requests have Accept: text/css,*/*;q=0.1, and all other requests (images, scripts, XMLHttpRequest) default to Accept: */* although this can be overridden by sites that use XMLHttpRequest.setRequestHeader(). Therefore if the upstream proxy sees a request with an Accept header containing text/html, and passing on this request to the server results in a CSV file or other non-HTML document, then there's a good chance this is the one to save.