1
votes

I use mysql as storage backend with nutch.

Job failed when crawling some sites. Got the following exception and exit nutch when reaching this page: http://www.appchina.com/users.html

Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
    at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

So I modify the ./src/java/org/apache/nutch/util/NutchJob.java change the if (getConfiguration().getBoolean("fail.on.job.failure", true)) { to if (getConfiguration().getBoolean("fail.on.job.failure", false)) {

After recompiling, I won't get any exception, but unlimited restart crawling.

FetcherJob : timelimit set for : -1
FetcherJob: threads: 30
FetcherJob: parsing: false
FetcherJob: resuming: false
Using queue mode : byHost
Fetcher: threads: 30
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.appchina.com/
fetching http://www.appchina.com/users.html
-finishing thread FetcherThread0, activeThreads=29
-finishing thread FetcherThread29, activeThreads=28
...
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0.4 pages/s, 137 137 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
Parsing http://www.appchina.com/
Parsing http://www.appchina.com/users.html

UPDATE error in hadoop.log

2012-09-17 18:48:51,257 WARN  mapred.LocalJobRunner - job_local_0004
java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
        at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
        at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
        at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
        ... 6 more
Caused by: java.sql.SQLException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
        at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
        ... 8 more

UPDATE again

I've drop the table gora created and create a similar table with a VARCHAR(128) id and utf8mb4 DEFAULT CHARSET. It works now. Why?

Anyone help?

1

1 Answers

0
votes

You need to add the hadoop logs for the Parse job. The stack trace attached is not showing that info. After u did that code change, did parsing happen successfully ?