1
votes

I want to scrape a website. The website I want to scrape doesn’t have an API.

What I want to do is this (in Python):

import requests

with requests.Session() as conn:
    url = "http://demo.ilias.de/login.php"
    auth = {
        "username": "benjamin",
        "password": "iliasdemo"
    }
    conn.post(url, data=auth)
    response = conn.get(url)
    do_work(response)

When trying to do the same thing with HTTPoison, the website responds with "Please enable session cookies in your browser!". Elixir code:

HTTPoison.post "http://demo.ilias.de/login.php", 
  "{\"username\":\"benjamin\", \"password\":\"iliasdemo\"}"

I guess the problem is with cookies.

UPD#1. It seems that not all cookies are saved since :hackney.cookies(headers) (headers being from %HTTPoison.Response{headers: headers}) does not output some of those cookies (e.g. authchallenge) I see both in my browser and in the response of the Python code above. Could it be the case that hackney doesn't actually post anything?

1
I'm not entirely sure, because I have never directly used HTTPoison, but does line 108 in this test file help you at all?ham-sandwich
@ham-sandwich In the test file they send a cookie. At the time of my request I don't have any cookies to send -- I should somehow store it. There is a library for hackney that does that, I think. I just don't know how to use it.koolhazcker

1 Answers

1
votes

I had a similar problem:

I make a GET request to a server api and it responds with a 301 redirection at the same location and a "Set-Cookie" header with a sessionId. If you follow the redirection without sending back their cookie, they respond with the same redirection and a new SessionId cookie. And this motive continues like this if you never sent them back their cookie. On the other hand if you send them back their cookie they respond with a 200 status code and your asked data.

The problem is hackney and consequently HTTPoison cannot follow this scenario. It actually has a :follow_redirect option that when set it follows redirections but falls short in grabbing the cookies and sending them back in between redirections.

All browsers I tried (firefox, chrome, IE) where able to pass this scenario. Python and wget did the job as well.

Anyway, to make it short, I wrote a workaround for my case which may give some ideas to others with similar problems:

defmodule GVHTTP do
  defmacro __using__(_) do
    quote do
      use HTTPoison.Base

      def cookies_from_resp_headers(recv_headers) when is_list(recv_headers) do
        List.foldl(recv_headers, [], fn
          {"Set-Cookie", c}, acc -> [c|acc]
          _, acc -> acc
        end)
        |> Enum.map(fn(raw_cookie) ->
            :hackney_cookie.parse_cookie(raw_cookie)
            |> (fn
                  [{cookie_name, cookie_value} | cookie_opts] ->
                    { cookie_name, cookie_value,
                      cookie_opts
                    }
                  _error ->
                    nil
                end).()
        end)
        |> Enum.filter(fn
          nil -> false
          _ -> true
        end)
      end

      def to_request_cookie(cookies) do
        cookies
        |> Enum.map(fn({ cookie_name, cookie_value, _cookie_opts}) ->
            cookie_name <> "=" <> cookie_value
          end)
        |> Enum.join("; ")
        |> (&("" == &1 && [] || [&1])).() # "" => [], "foo1=abc" => ["foo1=abc"]
      end

      def get(url, headers \\ [], options \\ []) do
        case options[:follow_redirect] do
          true ->
            hackney_options = case options[:max_redirect] do
              0 -> options # allow HTTPoison to handle the case of max_redirect overflow error
              _ -> Keyword.drop(options, [:follow_redirect, :max_redirect])
            end
            case request(:get, url, "", headers, hackney_options) do
              {:ok, %HTTPoison.Response{status_code: code, headers: recv_headers}} when code in [301, 302, 307] ->
                {_, location} = List.keyfind(recv_headers, "Location", 0)
                req_cookie =
                  cookies_from_resp_headers(recv_headers)
                  |> to_request_cookie()

                new_options =
                  options
                    |> Keyword.put(:max_redirect, (options[:max_redirect] || 5) - 1)
                    |> Keyword.put(:hackney, [cookie:
                        [options[:hackney][:cookie]|req_cookie]
                        |> List.delete(nil)
                        |> Enum.join("; ")
                      ]) # add any new cookies along with the previous ones to the request
                get(location, headers, new_options)
              resp ->
                resp
            end
          _ ->
            request(:get, url, "", headers, options)
        end
      end

    end # quote end
  end  # __using__ end
end