1
votes

I have 2 questions regarding crawlers and robots.

Background info

I only want Google and Bing to be excluded from the “disallow” and “noindex” limitations. In other words, I want ALL search engines except Google and Bing to follow the “disallow” and “noindex” rules. In addition, I would also like a “nosnippet” function for the search engines I mentioned (which all support “nosnippet”). Which code do I use to do this (using both robots.txt and X-Robots-Tag)?

I want to have it in both the robots.txt file as well as the htacess file as an X-Robots-Tag. I understand that robots.txt may be outdated, but I would like clear instructions to crawlers even if they’re considered “ineffective” and “outdated” unless you think otherwise.

Question 1

Did I get the following code right to only allow Google and Bing to index (to prevent other search engines from showing in their results), and, furthermore, prevent Bing and Google from showing snippets in their search results?

X-Robots-Tag code (Is this correct? Don't think I need to add "index" to googlebot and bingbot due to "index" being a default value, but not sure.)

X-Robots-Tag: googlebot: nosnippet
X-Robots-Tag: bingbot: nosnippet
X-Robots-Tag: otherbot: noindex

robots.txt code (Is this correct? I think the 1st one is, but not sure.)

    User-agent: Googlebot
    Disallow:
    User-agent: Bingbot
    Disallow:
    User-agent: *
    Disallow: /

or

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Disallow:
    User-agent: Bingbot
    Disallow:

Question 2: Conflicts between robots.txt and X-Robots-Tag

I anticipate conflicts between the robots.txt and the X-Robots-Tag due to the disallow function and the noindex functions not being allowed to work in conjunction (Is there any advantage of using X-Robot-Tag instead of robots.txt? ). How do I get around this, and what is your recommendation?

End goal

As mentioned, the main goal of this is to explicitly tell all older robots (still using the robots.txt) and all the newer ones except Google and Bing (using X-Robots-Tag) to not show any of my pages in their search results (which I'm assuming is summed up in the noindex function). I understand they may not all follow it, but I want them ALL to know except Google and Bing to not show my pages in search results. To this end, I am looking to find the right codes for both the robots.txt code and X-Robots-Tag code that will work without conflict for this function for the HTML sites I am trying to build.

1
"I understand that robots.txt may be outdated": Where is this coming from?unor
Hey, unor, I must be wrong about that. I guess robots.txt is still the main standard for instructing crawlers. I think I incorrectly assumed that everything is changing from robots.txt to X Robots Tag. Very new to this whole thing, and appreciate your efforts to get me on the right track. Thanks for that.VinceJ

1 Answers

1
votes

robots.txt is not outdated. It’s still the only open/vendor-agnostic way to control what should not get crawled. X-Robots-Tag (and the corresponding meta-robots) is the only open/vendor-agnostic way to control what should not get indexed.

As you‘re aware, you can’t disallow both for the same URL. There is no way around this. If a bot wants to crawl https://example.com/foo, it (hopefully) checks https://example.com/robots.txt to see if it’s allowed to crawl it:

  • If crawling is allowed, the bot requests the document, and only then learns that it’s not allowed to index it. It has obviously already crawled the document, and it’s still allowed to crawl it.

  • If crawling is disallowed, the bot doesn’t request the document, and thus never learns that it’s also not allowed to index it, because it would need to crawl the document to see the HTTP header or the HTML element.

A Noindex field in robots.txt would solve this conflict, and Google seems to have supported it as experimental feature, but you can’t expect it to work.

So, you have to choose: don’t you want to appear in other search engines’ results (→ X-Robots-Tag), or don’t you want other search engines’ bots to crawl your documents (→ robots.txt).

X-Robots-Tag

If you want to target all other bots (instead of listing each one, like your otherbot suggests, which would virtually be impossible), you should use

X-Robots-Tag: bingbot: nosnippet
X-Robots-Tag: googlebot: nosnippet
X-Robots-Tag: noindex

(I suppose Bingbot/Googlebot ignore the last line, as they already matched a previous line, but to be sure, you could add index to the lines of both bots.)

robots.txt

Records (each record starts with a User-agent line) need to be separated by empty lines:

User-agent: *
Disallow: /

User-agent: Bingbot
Disallow:

User-agent: Googlebot
Disallow:

The order of the records doesn’t matter, unless the bot "listens" to multiple names in your robots.txt (it will follow the first record that matches its name; and only if no name matches, it will follow the * record). So, after adding empty lines, both of your robots.txt files are fine.

You can also use one record for both bots:

User-agent: *
Disallow: /

User-agent: Bingbot
User-agent: Googlebot
Disallow: