4
votes

With my firefox browser I log in to a download site and click on one of its query-buttons. A small window pops up, named "Opening report1.csv" and I can choose to 'Open with' or 'Save File'. I save the file.

For this action Live HTTP headers shows me:

https:// myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR

GET /ReportPage?download&NAME=ALL&DATE=THISYEAR HTTP/1.1
Host: myserver
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language: en-US,en;q=0.8,de-DE;q=0.5,de;q=0.3
Accept-Encoding: gzip, deflate, br
Referer: https:// myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
Cookie: JSESSIONID=88DEDBC6880571FDB0E6E4112D71B7D6
Connection: keep-alive
Upgrade-Insecure-Requests: 1

HTTP/1.1 200 OK
Date: Sat, 30 Dec 2017 22:37:40 GMT
Server: Apache-Coyote/1.1
Last-Modified: Sat, 30 Dec 2017 22:37:40 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Pragma: no-cache
Cache-Control: no-cache, no-store
Content-Disposition: attachment; filename="report1.csv"; filename*=UTF-8''report1.csv
Content-Type: text/csv
Content-Length: 332369
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive

Now I try to emulate this with requests.

$ python3
>>> import requests
>>> from lxml import html
>>>
>>> s = requests.Session()
>>> s.verify = './myserver.crt'  # certificate of myserver for https
>>>
>>> # get the login web page to enter username and password
... r = s.get( 'https://myserver' )
>>>
>>> # Get url for logging in. It's the action-attribute in the form anywhere.
... # We use xpath.
... tree = html.fromstring(r.text)
>>> loginUrl = 'https://myserver/' + list(tree.xpath("//form[@id='id4']/@action"))[0]
>>> print( loginUrl )   # it contains a session-id
https://myserver/./;jsessionid=77EA70CB95252426439097E274286966?0-1.loginForm
>>>
>>> # logging in with username and password
... r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
>>> print( r.status_code )
200
>>> # try to get the download file using url from Live HTTP headers
... downloadQueryUrl = 'https://myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR'
>>> r = s.get( downloadQueryUrl )
>>> print( r.status_code)
200
>>> print( r. headers )
{'Connection': 'Keep-Alive',
'Date': 'Sun, 31 Dec 2017 14:46:03 GMT',
'Cache-Control': 'no-cache, no-store',
'Keep-Alive': 'timeout=5, max=94',
'Transfer-Encoding': 'chunked',
'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT',
'Pragma': 'no-cache',
'Content-Encoding': 'gzip',
'Content-Type': 'text/html;charset=UTF-8',
'Server': 'Apache-Coyote/1.1',
'Vary': 'Accept-Encoding'}
>>> print( r.url )
https://myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
>>>

The request succeeds but I don't get the file download page. There is no "Content-Disposition: attachment;" entry in the header. I only get the page the query starts from, e.g. the page from the referer.

Has this something to do with the session-cookie? Seems requests manages this automagically. Is there a special handling for csv-files? Do I have to use streams? Is the download-Url shown by Live HTTP Headers the right one? Maybe there is a dynamic creation?

How can I get a web page with "Content-Disposition: attachment;" from myserver and download its file with requests?

1
maybe you have ot add some headers to request - ie. "User-Agent" - furas
did you check r.text ? Maybe there are useful inforamtion - ie. it can be warning message. You could write it in file and open this file in browser. - furas
@furas thanks for pointing this. I will look at this an try it. - Ingo
The application may be checking the path you take, so, after login, instead of going straight to the URL you previously sniffed for the report, try to go to the report page first, by following links on the webpage you got after login, and then do the appropriate action (filling forms, etc...) to start downloading your file. Have a look at the mechanize Python module. - Patrick Mevzek
@furas yes, I scanned r.text but cannot find anything, but I will inspect it more careful. Next I will try to extract downloadQueryUrl from the download page. Maybe we have a dynamic creation. - Ingo

1 Answers

2
votes

I get it. @Patrick Mevzek points me to the right direction. Thank you for this.

After login I do not stay on the first logged in page and call the query. Instead I request the report page, extract the query-url from it and request the query-url. Now I get the response with “Content-Disposition: attachment;” in its header. It's now simple to print it's text to stdout. I prefer that because I can redirect output to any file. Info-messages goes to stderr so they don't mess up the redirected output. Typical call is ./download >out.csv.

For completeness here is the script template without any error checking to clarify its working.

#!/usr/bin/python3

import requests
import sys
from lxml import html

s = requests.Session()
s.verify = './myserver.crt'  # certificate of myserver for https

# get the login web site to enter username and password
r = s.get( 'https://myserver' )

# Get url for logging in. It's the action-attribute in the form anywhere.
# We use xpath.
tree = html.fromstring(r.text)
loginUrl = 'https://myserver/' + tree.xpath("//form[@id='id4']/@action")[0]

# logging in with username and password and go to ReportPage with queries
r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
queryUrl = 'https://myserver/ReportPage?NAME=ALL&DATE=THISYEAR'
r = s.get( queryUrl )

# Get the download link for this query from this site. It's a link anywhere
# with value 'Download (UTF8)'
tree = html.fromstring( r.text )
downloadUrl = 'https://myserver/' + tree.xpath("//a[.='Download (UTF8)']/@href")[0]

# get the download file
r = s.get( downloadUrl )
if r.headers.get('Content-Disposition'):
    print( 'Downloading ...', file=sys.stderr )
    print( r.text )

# log out
r = s.get( 'https://myserver/logout' )