0
votes

I have this function running in a Google Sheets script that pulls HTML from subreddits and returns them to a spreadsheet. It works for me some/most of the time, but other times I get an error "Could not parse text. (line 13)" which is the line with var doc = Xml.parse(page, true);. Any idea why this is happening or is this just a bug with Google Scripts? Here's the code that works...sometimes.

function getRedditHTML() {

  var entries_array = [];
  var subreddit_array = ['https://www.reddit.com/r/news/','https://www.reddit.com/r/funny/','https://www.reddit.com/r/science/'];

  for (var s = 0; s < subreddit_array.length; s++) {

    var page = UrlFetchApp.fetch(subreddit_array[s]);

    //this is Line 13 that is breaking
    var doc = Xml.parse(page, true);

    var bodyHtml = doc.html.body.toXmlString();
    doc = XmlService.parse(bodyHtml);
    var root = doc.getRootElement();
    var entries = getElementsByClassName(root,'thing');

    for (var i = 0; i < entries.length; i++) {

      var title = getElementsByClassName(entries[i],'title');
      title = XmlService.getRawFormat().format(title[1]).replace(/<[^>]*>/g, "");

      var link = getElementsByClassName(entries[i],'comments');
      link = link[0].getAttribute('href').getValue();

      var rank = getElementsByClassName(entries[i],'rank');
      rank = rank[0].getValue();

      var likes = getElementsByClassName(entries[i],'likes');
      likes = likes[0].getValue();


      entries_array.push([rank, likes, title, link]);
    }
  }

  return entries_array.sort(function (a, b) { 
    return b[1] - a[1];
  });
}
1
Your script didn't work at all when I tried it - also I don't see any class called thing by itself - there are tags such as thing_id - Is it mandatory to do it via apps script for you? Because I have a couple other really easy ways to import it directly into the sheetAurielle Perlmann
Just tried the above code in a Google Sheets script and it worked for me. The class thing is on each div that contains a post in the HTML. Each div also has a unique id but I want to pull all posts, which is why I pulled by class name. I tried other built-in functions in Google Sheets (e.g. IMPORTFEED) but it couldn't give me specific values I wanted like "rank" or "likes" so I decided to make this script. If you have other methods let me know.Keith

1 Answers

0
votes

Here is what I found upon playing with importXML (my usual way of doing this) - for some reason I cannot narrow down - it DOES appear to randomly stall out and return null for a few minutes - so I'm guessing the issue with your thing is not the code but that the site or google temporarily blocks/won't return the data -

however I found the JSON endpoint to the piece you want - and I noticed that when XML went down - the JSON didnt.

You can take that and fix it to push your own array of topics/urls - I just left it for one link for now to show you how the URL breaks down and where it should be modified:

The URL is 'https://www.reddit.com/r/news/hot.json?raw_json=1&subredditName=news&sort=top&t=day&feature=link_preview&sr_detail=true&app=mweb-client

News is mentioned in 2 places so just modify all your URLs to follow that method - you can easily load that javascript in a browser to see all the fields available

Also the portion hot.json is where you can change whether you want the ranked list (called hot), or new,top,promoted, etc. you just change that keyword.

Score is the same as the upvotes/likes

function getSubReddit() {  
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getActiveSheet(); //get Active sheet
  var subject = 'news';
  var url = 'https://www.reddit.com/r/' + subject + '/hot.json?raw_json=1&subredditName=' + subject + '&sort=top&t=day&feature=link_preview&sr_detail=true&app=mweb-client'; //json endpoint for data

  var response = UrlFetchApp.fetch(url); // get api endpoint
  var json = response.getContentText(); // get the response content as text
  var redditData = JSON.parse(json); //parse text into json

  Logger.log(redditData); //log data to logger to check

  //create empty array to hold data points
  var statsRows = [];
  var date = new Date(); //create new date for timestamp


    //The following lines push the parsed json into empty stats array
    for (var j=0;j<25;j++){      
    for (var i =0;i<25;i++){
    var stats=[];

    stats.push(date);//timestamp 
    stats.push(i+1);
    stats.push(redditData.data.children[i].data.score); //score
    stats.push(redditData.data.children[i].data.title); //title
    stats.push(redditData.data.children[i].data.url); //article url
  // stats.push('http://www.reddit.com' + redditData.data.children[i].data.permalink); //reddit permalink

  statsRows.push(stats)

    }
      //append the stats array to the active sheet
      sheet.appendRow(statsRows[j])
  }


}

enter image description here