Getting clear content (without markup) with Nutch 1.9

Question

Using Nutch 1.9, how do I get clear content (without html markup) of crawled pages and save the .content in readable form. Is Solr way to do that or can it be done without it and how?

And a subquestion, how do I control the crawling depth with bin/crawl script? There was an option to that (and topN) in bin/nutch crawl command, but it is deprecated now and won't execute.

Ramanan Ramanan · Accepted Answer · 2014-11-07T12:29:38

Add this in nutch site.xml

<!-- tika properties to use BoilerPipe, according to Marcus Jelsma --> 
<property> 
  <name>tika.use_boilerpipe</name> 
  <value>true</value> 
</property> 
<property> 
  <name>tika.boilerpipe.extractor</name> 
  <value>ArticleExtractor</value> 
</property>

// This is for nutch 1.7, I'm not sure about 1.9

Use jsoup to get plain text.

Getting clear content (without markup) with Nutch 1.9

1 Answers