How to deploy a nutch job on google dataproc?

Question

I have been trying to deploy a nutch job (with custom plugins) on my Google Hadoop dataproc cluster but I've been encountering many errors (some basic I suspect).

I need a step by step explicit guide on how to do this. The guide should include how to set permissions and access file both in the gs bucket and on local file system (Windows 7).

I have tried this configuration but to no success:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-<p>asia/ deploy/bin/nutch
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

I have also tried:

Region: global 
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

And:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-SNAPSHOT.job
Main class or jar: org.apache.nutch.crawl.Crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

Follow up: I have made some progress but i am getting this error now:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

I know it has to do with the filesystem. How do I access the gcp filesystem and hadoop file?

Follow up: I have made some progress with this config:

{ "reference": { "projectId": "ageless-valor-174413", "jobId": "108a7d43-671a-4f61-8ba8-b87010a8a823" }, "placement": { "clusterName": "first-cluster", "clusterUuid": "f3795563-bd44-4896-bec7-0eb81a3f685a" }, "status": { "state": "ERROR", "details": "Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput'.", "stateStartTime": "2017-07-28T18:59:13.518Z" }, "statusHistory": [ { "state": "PENDING", "stateStartTime": "2017-07-28T18:58:57.660Z" }, { "state": "SETUP_DONE", "stateStartTime": "2017-07-28T18:59:00.811Z" }, { "state": "RUNNING", "stateStartTime": "2017-07-28T18:59:02.347Z" } ], "driverOutputResourceUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput", "driverControlFilesUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/", "hadoopJob": { "mainClass": "org.apache.nutch.crawl.Injector", "args": [ "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls/", "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb/" ], "jarFileUris": [ "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-SNAPSHOT.job" ], "loggingConfig": {} } }

But I am getting this error now:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

This is rather too broad, and is likely to close. It is much better for you to show what you have tried, including those errors you mentioned, and then someone can help with that. — halfer
i have tried this configuration but to no success Region global Cluster first-cluster Job type Hadoop Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-SNAPSHOT.job Main class or jar gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/'nutch Arguments gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed.txt -depth 4 — user2665137

Hans Brende Hans Brende · Accepted Answer · 2018-01-30T07:32:32

You can solve this problem by using the correct scheme (gs) to refer to google cloud storage files, and then changing the default filesystem to google cloud storage.

Step 1:

Replace
https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
with
gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls

Step 2:

Add the following property to your nutch-site.xml file:

<property> <name>fs.defaultFS</name> <value>gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

(this property is called "fs.default.name" in older versions of hadoop.)

How to deploy a nutch job on google dataproc?

1 Answers

Step 1:

Step 2: