0
votes

In a 2nd hand car seller website there is thousands of cars ads This is a typical ad -> alfa-romeo

If I crawl all these ads pages, all diferent cars, I index all these useless text that I dont want, i would like to just crawl something like

title, description, km of the car, power cv(hp), not the whole page,

Im using nutch since it has good integration with solr but nutch its prepared to crawl everything, and in terms of plugins didnt found a good one to solve my problem.

Already used nutch-custom-search didnt worked.

Do you know something to solve my problem, I just want to crawl the pages of a specific website, and just specific parts of the pages, and index it to solr

maybe another crawler with good integration with solr?

Ty

1

1 Answers

0
votes

Take also a look at https://issues.apache.org/jira/browse/NUTCH-1870 which is an XPath plugin for Nutch, this will allow you to extract the desired elements of the webpage and store this in individual fields.

If you're willing to take a look at a different crawler, take a look at https://github.com/DigitalPebble/storm-crawler/ which is a set of resources for building your own crawler based on Apache Storm. The main gain with this approach is that is a NRT crawler.