I was wondering how recrawling strategy for some huge search engines works. For example let considering google. We know that google is using dynamic interval for recrawling each web-site. Suppose according to google dynamic interval there is 100k number of sites that should be recrawl every 10 mins. So the crawling process of these 100000 sites should be done in less than 10 mins. Probably there are two possible situations:
1) google bot will fetch the first page of each of these sites and then generate the list of URLs in this page. For each URL it will check whether this url is fetched before or not. If it is new it will fetch the new page. This process will continue until the end of crawl or specific deep threshold.
2) google bot will fetch every page again (no matter it has updated or not)
Suppose google using first strategy, then how a page with same url but updated content will be crawled and indexed? Suppose google using second one, then how it can recrawl all of these pages in less than 10 mins? what about other web pages? probably there are more than 6 billion web page available how recrawling all of these pages in timely manner is possible? I really think it is not possible with using some new technologies like nutch and solr on the hadoop infrastructure.
Regards.