Accessing PDF webpages by PhantomJS (Watir, GhostDriver)

Question

I am using watir-webdriver in combination with chromedriver (on Mac OS X) for the visualized testing. Once it started working I switched to headless testing using PhantomJS.

def browser_init
  client = Selenium::WebDriver::Remote::Http::Default.new
  client.timeout = @browser_timeout 

  case @browsing_type
    when 'visual'
      @b = Watir::Browser.new :chrome, :http_client => client
    when 'headless'
      @b = Watir::Browser.new :phantomjs, :http_client => client
  end
end

Here is my problem and the question. Some of the web pages that I test are either fully stored in PDF format (and have an URL ending with .pdf) or consist of an embedded PDF content. These webpages are not reflected properly in :phantomjs case: their @b.title or @b.url point to the previously visited page. In the case of :chrome this information of a pdf page is accessible and hence verifiable for testing purposes.

Since PhantomJS is well known for its ability to produce PDF screenshots of the webpages, I am in doubt that it is not able to open PDF page on the web.

Am I right to understand that there is no PDF plugin for PhantomJS or do I do something wrong? I will highly appreciate any advice about headless testing of PDF pages in either case.

It doesn't seem like it is possible and I doubt that PhantomJS 2 will change something about it. Although, I haven't tried it yet. — Artjom B.
Thanks, ArtjomB.! Do you know about any other tool that would work? What about headless/Xvfb? I know that this one should not run on Mac, but I have an option switching to Linux platform. — tudeng

Chuck van der Linden Chuck van der Linden · Accepted Answer · 2015-02-16T23:53:06

The answer to your question really depends on what you mean when you say "accessing". If your purpose is to render the PDF and capture an image, you may be outside what phantomJS can do and need a real browser. OTOH if your purpose is to PARSE the PDF and gain access to the text within the document, then you may be able to follow this example and use some JS libraries for dealing with PDF files and run them via PhantomJS

In terms of phantomJS being able to render a PDF, you made the following assumption:

Since PhantomJS is well known for its ability to produce PDF screenshots of the webpages, I am in doubt that it is not able to open PDF page on the web.

Formatting output as an embedded image using the PDF spec is one thing. Rendering PDF (which includes images, text, text effects, backgrounds, etc etc) is another, much more difficult thing. There are a LOT of programs that can 'write' things in PDF format but cannot 'read' PDF files.

The main purpose of PhantomJS is to allow for the execution of Javascript code without the overhead of a browser and page rendering. You can use this to do things like test JS code that is part of a AJAX driven web page. You can also use it as an interpreter to execute JS code I believe (but have not found confirmation) that for performance reasons, the only time PhantomJS actually incurs the overhead required to render web pages is when it creates screenshots. Also since it is designed to be headless, there is no support for plugins which is how rendering of PDF is typically supported by most browsers. That would be the major roadblock that would prevent you from actually rendering and then capturing screenshots of PDF 'pages'. for that you would need a real browser, an adobe or other (foxit?) plugin that supports PDF. Then to run headless you would need to use something like XVFB.

Accessing PDF webpages by PhantomJS (Watir, GhostDriver)

1 Answers