Googlebot guesses urls. How to avoid/handle this crawling

Question

Googlebot is crawling our site. Based on our URL structure it is guessing new possible URLs.

Our structure is of the kind /x/y/z/param1.value. Now google bot exchanges the values of x,y,z and value with tons of different keywords. Problem is, that each call triggers a very expensive operation and it will return positive results only in very rare cases.

I tried to set an url parameter in the crawling section of the webmasters tools (param1. -> no crawling). But this seems not to work, probably cause of our inline url format (would it be better to use the html get format ?param1=..?)

As Disallow: */param1.* seems not to be an allowed robots.txt entry, is there another way to disallow google from crawling this sites?

As another solution I thought of detecting the googlebot and returning him a special page. But I have heard that this will be punished by google.

Currently we always return a http status code 200 and a human readable page, which says: "No targets for your filter critera found". Would It help to return another status code?

Are you sure that it's the googlebot crawling unknown URLs? Do the tested pages show up in webmaster tools? — Joachim Isaksson
Also, these are Google recommendations for non existing pages. — Joachim Isaksson
Hi Joachim, I am pretty sure it it the googlebot crawling my pages. webmaster tools shows a lot more pages than in my sitemap. Also the webmaster crawling activity fits to my logs — Zensursula
Hi Joachim, thanks for the link to the Google recommendations. There are a lot infos about the handling of wrong and unexpected 404 errors. Now I return a 404 if I found no results for my filter. I will observe the googlebot behaviour and share the results here in the next couple of days. Maybe the bot stops guessing and crawling those kind of urls. — Zensursula

Zensursula Zensursula · Accepted Answer · 2014-01-03T05:56:41

Note: This is probably not a general answer!

Joachim was right. It turned out that the googlebot is not guessing URLs.

Doing a bit of research I found out that I added a new DIV in my site containing those special URLs half a year ago (which I unfortunately forgot). A week ago googlebot has started crawling it.

My solution: I deleted the DIV and also I return a 404 status code on those URLs. I think, sooner or later, googlebot will now stop crawling the URLs after revisiting my site.

Thanks for the help!

Googlebot guesses urls. How to avoid/handle this crawling

1 Answers