0
votes

I was wondering how recrawling strategy for some huge search engines works. For example let considering google. We know that google is using dynamic interval for recrawling each web-site. Suppose according to google dynamic interval there is 100k number of sites that should be recrawl every 10 mins. So the crawling process of these 100000 sites should be done in less than 10 mins. Probably there are two possible situations:

1) google bot will fetch the first page of each of these sites and then generate the list of URLs in this page. For each URL it will check whether this url is fetched before or not. If it is new it will fetch the new page. This process will continue until the end of crawl or specific deep threshold.

2) google bot will fetch every page again (no matter it has updated or not)

Suppose google using first strategy, then how a page with same url but updated content will be crawled and indexed? Suppose google using second one, then how it can recrawl all of these pages in less than 10 mins? what about other web pages? probably there are more than 6 billion web page available how recrawling all of these pages in timely manner is possible? I really think it is not possible with using some new technologies like nutch and solr on the hadoop infrastructure.

Regards.

1

1 Answers

1
votes

We use a huge set of computers to fetch (or "crawl") billions of pages on the web. Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.

Googlebot's crawl process begins with a list of webpage URLs, generated from previous crawl processes and augmented with Sitemap data provided by webmasters. As Googlebot visits each of these websites it detects links (SRC and HREF) on each page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.

https://support.google.com/webmasters/answer/182072?hl=en

First why does it has to finish its job in 10 minutes?

As in the first paragraph, not all sites are recrawled at same interval. They have an algorithm to determine this.

So googlebot will fetch every page again, but at very different intervals. Its option (2) in your question but with an added algorithm.

They use hadoop infrastructure for scalability.