Using wget to download dokuwiki pages in plain xhtml format only

Question

I'm currently modifying the offline-dokuwiki[1] shell script to get the latest documentation for an application for automatically embedding within instances of that application. This works quite well except in its current form it grabs three versions of each page:

The full page including header and footer
Just the content without header and footer
The raw wiki syntax

I'm only actually interested in 2. This is linked to from the main pages by a html <link> tag in the <head>, like so:

<link rel="alternate" type="text/html" title="Plain HTML" 
href="/dokuwiki/doku.php?do=export_xhtml&amp;id=documentation:index" />

and is the same url as the main wiki pages only they contain 'do=export_xhtml' in the querystring. Is there a way of instructing wget to only download these versions or to automatically add '&do=export_xhtml' to the end of any links it follows? If so this would be a great help.

[1] http://www.dokuwiki.org/tips:offline-dokuwiki.sh (author: samlt)

I'm beginning to suspect that the solution might be to download just the content page, then parse it with sed to extract all the urls to include the 'do=export_xhtml', then recursively do the same with each url so extracted. But I'd prefer it if wget can be instructed to just get the plain html versions in the first place or auto add the query string to every url followed if possible. — AntonChanning
You might get more feedback if you include a tag for xml or xhtml AND some sort of processing tool, xmlstartlet, xslt, awk, perl, (NOT sed (there are dozens of posts about why sed can't do html parsing, ignore them at your peril! ;-) ) — shellter

Andreas Gohr Andreas Gohr · Accepted Answer · 2011-05-28T10:54:53

DokuWiki accepts the do parameter as HTTP header as well. You could run wget with the parameter --header "X-DokuWiki-Do: export_xhtml"

Using wget to download dokuwiki pages in plain xhtml format only

1 Answers