0
votes

I have one URLs that works just fine in all browsers (5 tested on 2 computers), but if I try to get the page content using Get() of the Indy Http client, it returns error code 404, page not found. This is with the latest Indy SVN build (4985).

Why does this web server return code 404 for Indy, but code 200 for every browser?

I suspect this may be a bug in Indy because of the "#" character in the URL (Indy cuts everything off after #). If so, is there any way to work-around this. Maybe replace the # char with escape code?

Here is my example code. All that is needed for this is Delphi with Indy components and a form with a button and a memo.

procedure TForm1.Button1Click(Sender: TObject);
var HTTPCLIENT1: TIdHTTP;
begin
  try
   try
     HTTPCLIENT1 := TIdHTTP.Create(nil);
     Memo1.Clear;
     with HTTPCLIENT1 do
     begin
          HandleRedirects := True;
          Request.UserAgent   := 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31';
          Memo1.Text := Get('http://www.visionofhumanity.org/gpi-data/#/2011/scor/');
          Caption := ResponseText;
     end;
   except
     On e: Exception do
     begin
          Memo1.Lines.Add('Exception: '+e.Message);
     end;
   end;
  finally
     HTTPCLIENT1.Free;
  end;
end;
2
According to the answer to your previous question, revision 4985 was supposed to resolve that issue. Are you sure you've installed it correctly? Have you used WireShark to compare what browsers send with what your program sends?Rob Kennedy
I agree with Rob. I fixed that bug so # and everything after it would not be sent to the server anymore. I will double check again, but it was definately working correctly last time.Remy Lebeau
@RobKennedy this is another issue. The other post was about getting error code 500 with a anchor within an URL. this was fixed. This issue is about getting code 404 without an anchor withint the URL.Casady
@Casady: Like Rob said, you should be using Wireshark, or other packet sniffer, to diagnose problems like that. Your earlier problem was caused by Indy sending a different URL than browsers send, leading to a series of redirects that ultimately failed on the server end. This problem is likely similar. So you need to make sure Indy is actually sending the same URL that browsers send.Remy Lebeau
@RemyLebeau You are right. This problem is similar but not the same, For some reason Indy does not strip the anchor from the URL, even in build 4985.Casady

2 Answers

3
votes

# is a reserved character in URLs. If you want to use reserved characters inside of a URL, you need to url-encode them. TIdHTTP does not do that for you. It requires you to pass in an encoded URL, but you are passing in an unencoded URL instead. Since # is unencoded, it gets treated as an anchor and stripped off, so you are actually requesting http://www.visionofhumanity.org/gpi-data/, hense the 404 reply.

# is url-encoded as %23, so use this:

Memo1.Text := Get('http://www.visionofhumanity.org/gpi-data/%23/2011/scor/');

Or this:

Memo1.Text := Get(TIdURI.URLEncode('http://www.visionofhumanity.org/gpi-data/#/2011/scor/'));

Update: I tracked down the problem. It is another TIdURI parsing bug, this time related to having a / character after the # character. TIdURI checks for / characters before it checks for a # character, so the anchor portion of the URL was ending up in the TIdURI.Path property (previously it was ending up in the TIdURI.Params property) and thus submitted to the server. I have checked in a new fix (SVN rev 4987).

3
votes

Your suspicion is correct. You've included the # section of the address in your request. Browsers don't do that because that section is reserved for in-page navigation. The server doesn't know that, so it tries to fetch the resource that corresponds to the full URL you gave it, including the # and everything afterward. Nothing matches, so it fails with status 404.

Either do as the browsers do and strip that section from the URL prior to sending the request to the server, or update Indy to revision 4987 so that it will happen automatically. Merely escaping the character will continue to yield status 404.