0
votes

I cannot crawl links in Facebook's API response.Everything works fine when i crawl other Web-pages. I'm using Nutch 2.2.1, Hbase 0.9 for storing and Solr for indexing. As seed i'm using

https://graph.facebook.com/v2.10/me?fields=friends%7Bfeed%7Bpermalink_url%7D%2Cname%7D&access_token=<MY_ACC_TOKEN>

Injecting it's ok. At the end of the crawling cycle , i have my seed saved in my db. But during the fetching, nutch doesn't see any URL

Fetcher: throughput threshold: -1
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

I've just tried to edit any file that discards URLs that contain charatchers as probable queries, but nothing happened. I've already implemented https, it does not work by default.

How can i solve this?

1
What do you mean, it “doesn't see any URL”? Where exactly would it be supposed to see those …? - CBroe
In my api request i've specified the permalink_url field for any feed node of my friend list.I just tried to curl that api request and i can get the urls that i want to crawl with nutch."Doesn t see "mean that the output says "0 pages fetched and 0 urls" - Lorenzo Epifani
You’re aware that each one of those friends would have to login to your app and grant it user_posts permission, before you can access their feed, right? - CBroe
“and i can get the urls that i want to crawl with nutch” - you mean you want to do actual scraping? That is not allowed by Facebook. - CBroe
Yes i know , i ve got like 10 friend with this-and others- permissions.The problem is that nutch"doesn t see" links out of html tags.This doesn' t happen only with facebook api, but with other web pages too.For example, if i try to use nutch on some pages of wikipedia, i can get all links in exception of some "references section" links, that are in the body-text out of tags.I don t know the proper meaning about "web scraping" , i need text written by people for a semantic analysis with Stanbol without tracking anybody. - Lorenzo Epifani

1 Answers

0
votes

Automated scraping is not allowed on Facebook.

  1. You will not engage in Automated Data Collection without Facebook's express written permission.

See the full ToS here:

https://www.facebook.com/apps/site_scraping_tos_terms.php