60
votes

I am writing a small Java program to get the amount of results for a given Google search term. For some reason, in Java I am getting a 403 Forbidden but I am getting the right results in web browsers. Code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;


public class DataGetter {

    public static void main(String[] args) throws IOException {
        getResultAmount("test");
    }

    private static int getResultAmount(String query) throws IOException {
        BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection()
                .getInputStream()));
        String line;
        String src = "";
        while ((line = r.readLine()) != null) {
            src += line;
        }
        System.out.println(src);
        return 1;
    }

}

And the error:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at DataGetter.getResultAmount(DataGetter.java:15)
    at DataGetter.main(DataGetter.java:10)

Why is it doing this?

4
@Perception um... what's an SSL endpoint? (sorry I'm clueless about this kind of stuff)tckmn
SSL (secure socket layer) is a method of ensuring security of data passed back and forth between a client and server. An SSL endpoint is a regular URL, but with https instead of http. Using SSL is more complicated than regular http because there needs to be handshaking between the client and server. Which in your case is unnecessary, since you can just use the 'normal' http endpoint for Google (http;//www.google.com/search)Perception
@Perception if I use normal http:// the same thing happenstckmn
Add the query you are working with too the question.Perception

4 Answers

121
votes

You just need to set user agent header for it to work:

URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();

BufferedReader r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));

StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
    sb.append(line);
}
System.out.println(sb.toString());

The SSL was transparently handled for you as could be seen from your exception stacktrace.

Getting the result amount is not really this simple though, after this you have to fake that you're a browser by fetching the cookie and parsing the redirect token link.

String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0];
Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\"");
Matcher m = pattern.matcher(response);
if( m.find() ) {
    String url = m.group(1);
    connection = new URL(url).openConnection();
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
    connection.setRequestProperty("Cookie", cookie );
    connection.connect();
    r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
    sb = new StringBuilder();
    while ((line = r.readLine()) != null) {
        sb.append(line);
    }
    response = sb.toString();
    pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>");
    m = pattern.matcher(response);
    if( m.find() ) {
        long amount = Long.parseLong(m.group(1).replaceAll(",", ""));
        return amount;
    }

}

Running the full code I get 2930000000L as a result.

5
votes

For me it worked by adding the header: "Accept": "*/*"

2
votes

You probably aren't setting the correct headers. Use LiveHttpHeaders (or equivalent) in the browser to see what headers the browser is sending, then emulate them in your code.

0
votes

It's because the site uses SSL. Try using the Jersey HTTP Client. You will probably also have to learn a little about HTTPS and the certificates, but I think Jersey can bet set to ignore most of the details relating to the actual security.