I have added a set of seeds to crawl using this command
./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4
For first iteration all of the commands(inject, generate, fetch, parse, update-table, Indexer & delete duplicates.) got executed successfully. For second iteration, "CrawlDB update" command got failed (please see error log for reference), because of failure of this command the whole process gets terminated.
Software stack is nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2, hbase-0.98.8-hadoop2
16/01/20 02:45:19 INFO parse.ParserJob: ParserJob: finished at 2016-01-20 02:45:19, time elapsed: 00:06:57
CrawlDB update for 1
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1453230757-13191 -crawlId 1
16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at 2016-01-20 02:45:27
16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: 1453230757-13191
16/01/20 02:45:27 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-root/hadoop-unjar5654418190157422003/classes/plugins
16/01/20 02:45:28 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Plugins:
16/01/20 02:45:28 INFO plugin.PluginRepository: HTTP Framework (lib-http)
16/01/20 02:45:28 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
16/01/20 02:45:28 INFO plugin.PluginRepository: MetaTags (parse-metatags)
16/01/20 02:45:28 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
16/01/20 02:45:28 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
16/01/20 02:45:28 INFO plugin.PluginRepository: XML Libraries (lib-xml)
16/01/20 02:45:28 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
16/01/20 02:45:28 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic)
16/01/20 02:45:28 INFO plugin.PluginRepository: Language Identification Parser/Filter (language-identifier)
16/01/20 02:45:28 INFO plugin.PluginRepository: Metadata Indexing Filter (index-metadata)
16/01/20 02:45:28 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
16/01/20 02:45:28 INFO plugin.PluginRepository: Subcollection indexing and query filter (subcollection)
16/01/20 02:45:28 INFO plugin.PluginRepository: SOLRIndexWriter (indexer-solr)
16/01/20 02:45:28 INFO plugin.PluginRepository: Rel-Tag microformat Parser/Indexer/Querier (microformats-reltag)
16/01/20 02:45:28 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient)
16/01/20 02:45:28 INFO plugin.PluginRepository: JavaScript Parser (parse-js)
16/01/20 02:45:28 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
16/01/20 02:45:28 INFO plugin.PluginRepository: Top Level Domain Plugin (tld)
16/01/20 02:45:28 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
16/01/20 02:45:28 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex)
16/01/20 02:45:28 INFO plugin.PluginRepository: Link Analysis Scoring Plug-in (scoring-link)
16/01/20 02:45:28 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
16/01/20 02:45:28 INFO plugin.PluginRepository: More Indexing Filter (index-more)
16/01/20 02:45:28 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
16/01/20 02:45:28 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons)
16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/20 02:45:28 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Index Cleaning Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/01/20 02:45:29 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x60a2630a connecting to ZooKeeper ensemble=localhost:2181
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:host.name=cism479
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_65
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
16/01/20 02:45:35 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/20 02:45:35 INFO mapreduce.JobSubmitter: number of splits:2
16/01/20 02:45:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1453210838763_0011
16/01/20 02:45:36 INFO impl.YarnClientImpl: Submitted application application_1453210838763_0011
16/01/20 02:45:36 INFO mapreduce.Job: The url to track the job: http://cism479:8088/proxy/application_1453210838763_0011/
16/01/20 02:45:36 INFO mapreduce.Job: Running job: job_1453210838763_0011
16/01/20 02:45:48 INFO mapreduce.Job: Job job_1453210838763_0011 running in uber mode : false
16/01/20 02:45:48 INFO mapreduce.Job: map 0% reduce 0%
16/01/20 02:47:31 INFO mapreduce.Job: map 33% reduce 0%
16/01/20 02:47:47 INFO mapreduce.Job: map 50% reduce 0%
16/01/20 02:48:08 INFO mapreduce.Job: map 83% reduce 0%
16/01/20 02:48:16 INFO mapreduce.Job: map 100% reduce 0%
16/01/20 02:48:31 INFO mapreduce.Job: map 100% reduce 31%
16/01/20 02:48:34 INFO mapreduce.Job: map 100% reduce 33%
16/01/20 02:50:30 INFO mapreduce.Job: map 100% reduce 34%
16/01/20 03:01:18 INFO mapreduce.Job: map 100% reduce 35%
16/01/20 03:11:58 INFO mapreduce.Job: map 100% reduce 36%
16/01/20 03:22:50 INFO mapreduce.Job: map 100% reduce 37%
16/01/20 03:24:22 INFO mapreduce.Job: map 100% reduce 50%
16/01/20 03:24:35 INFO mapreduce.Job: map 100% reduce 82%
16/01/20 03:24:38 INFO mapreduce.Job: map 100% reduce 83%
16/01/20 03:26:33 INFO mapreduce.Job: map 100% reduce 84%
16/01/20 03:37:35 INFO mapreduce.Job: map 100% reduce 85%
16/01/20 03:39:38 INFO mapreduce.Job: Task Id : attempt_1453210838763_0011_r_000001_0, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/20 03:39:39 INFO mapreduce.Job: map 100% reduce 50%
16/01/20 03:39:52 INFO mapreduce.Job: map 100% reduce 82%
16/01/20 03:39:55 INFO mapreduce.Job: map 100% reduce 83%
16/01/20 03:41:56 INFO mapreduce.Job: map 100% reduce 84%
16/01/20 03:53:39 INFO mapreduce.Job: map 100% reduce 85%
16/01/20 03:55:49 INFO mapreduce.Job: Task Id : attempt_1453210838763_0011_r_000001_1, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/20 03:55:50 INFO mapreduce.Job: map 100% reduce 50%
16/01/20 03:56:01 INFO mapreduce.Job: map 100% reduce 83%
16/01/20 03:58:02 INFO mapreduce.Job: map 100% reduce 84%
16/01/20 04:10:09 INFO mapreduce.Job: map 100% reduce 85%
16/01/20 04:12:33 INFO mapreduce.Job: Task Id : attempt_1453210838763_0011_r_000001_2, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/20 04:12:34 INFO mapreduce.Job: map 100% reduce 50%
16/01/20 04:12:45 INFO mapreduce.Job: map 100% reduce 82%
16/01/20 04:12:48 INFO mapreduce.Job: map 100% reduce 83%
16/01/20 04:14:46 INFO mapreduce.Job: map 100% reduce 84%
16/01/20 04:26:53 INFO mapreduce.Job: map 100% reduce 85%
16/01/20 04:29:09 INFO mapreduce.Job: map 100% reduce 100%
16/01/20 04:29:10 INFO mapreduce.Job: Job job_1453210838763_0011 failed with state FAILED due to: Task failed task_1453210838763_0011_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1
16/01/20 04:29:11 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=38378343
FILE: Number of bytes written=115957636
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2382
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=2
Launched reduce tasks=5
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=789909
Total time spent by all reduces in occupied slots (ms)=30215090
Total time spent by all map tasks (ms)=263303
Total time spent by all reduce tasks (ms)=6043018
Total vcore-seconds taken by all map tasks=263303
Total vcore-seconds taken by all reduce tasks=6043018
Total megabyte-seconds taken by all map tasks=808866816
Total megabyte-seconds taken by all reduce tasks=30940252160
Map-Reduce Framework
Map input records=49929
Map output records=1777904
Map output bytes=382773368
Map output materialized bytes=77228942
Input split bytes=2382
Combine input records=0
Combine output records=0
Reduce input groups=754170
Reduce shuffle bytes=38318183
Reduce input records=881156
Reduce output records=754170
Spilled Records=2659060
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=17993
CPU time spent (ms)=819690
Physical memory (bytes) snapshot=4080136192
Virtual memory (bytes) snapshot=15234293760
Total committed heap usage (bytes)=4149739520
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
Exception in thread "main" java.lang.RuntimeException: job failed: name=[1]update-table, jobid=job_1453210838763_0011
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Error running:
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1453230757-13191 -crawlId 1
Failed with exit value 1.
Please advise.