Crawl Image using Apache Nutch

2

votes

I installed Apache Nutch 2.3.1 and Solr 6.5.1 and MongoDB 3.4.7. After I crawl urls that contain many images, in Solr and mongoDB isn't any image and video. I also changed regex-urlfilter.txt file in apache nutch and delete postfix that were related to image(.png,.jpeg,.gift,...). After that I changed suffix-urlfilter.txt file and comment jpeg,gif,png too.
After do that works the Apache Nutch doesn't crawl image. Now I want to know how I can crawl image and see that in Solr? As I read about it, I understand that I should create plug-ins.Is my impression correct?

mongodbapachesolrweb-crawlernutch

0

votes

Nutch supports several formats: Plain Text, HTML/XHTML+XML, XML, MS Office files, Adobe PDF, RSS, RTF, MP3. Unfortunately, there is not support for any sort of image files. Apart from this, I'm curious, what do you want to index in image file?

0

votes

If I understand your question what you want to accomplish is extracting all the metadata from the images and indexing only this in Solr, right?

If Nutch is not even fetching your images then is more likely that some of the URL filters is excluding the URL from being fetched (check the logs). You need to describe your changes to the different files otherwise it will be impossible to help you.

Now, back to the original question, if you want to only index image URLs (along with the metadata) then you need to filter what you index into Solr. Unfortunately Nutch 2.3 doesn't offer (out of the box) this feature. In Nutch 1.x you could use mimetype-filter which allows you to specify what you want to index into Solr/ES depending on the mime type of the URL. My suggestion is to use Nutch 1.x unless you have a very good reason to use Nutch 2.x. Otherwise you could port the mimetype-filter plugin to 2.x or write your own IndexingFiler that supports your own logic.

Keep in mind that the information that you'll get in Solr is only limited to what tika can extract from the image file (metadata) which is usually not very well curated.

Crawl Image using Apache Nutch

2 Answers