2
votes

I'm about tool write a small web-scraping program in Clojure / ClojureScript. It's quite a simple command-line app (for Linux), which visits a webpage, filters the results and prints it to the console.

However, this raises a few questions - not least because I come from a JS/Node.js background and Clojure is quite new to me.

(1) First of all: Is this a good task for a Clojure program, which will be delivered for the JVM as a .jar file. Starting the JVM is slow, but the program needs to be started and stoped quickly, since it's for everyday use. But I guess there are ways to keep one JVM running in the background, which is waiting then to execute jar files on demand. (?)

(2) The other approach would be to Use ClojureScript and compile it to node-friendly JavaScript. This would certainly solve the point of the previous paragraph. But I'm not sure if it's necessary.

(3) The other question is, which library to use. And this is also of course related to the previous points. Is there a good Clojure/ClojureScript library for this purpose? Basically for querying the DOM with CSS selectors. In JS I would use JsDom, which reads HTML Strings and creates a "Shadow DOM" from it. Which are the equivalents in the Clojure world?

(4) A plus would certainly be a library, that deals with common web-scraping tasks. Such as: Handling information which is spread over several numbered pages. (like e.g. the results of a search engine)

Anyone has some hints for me?

1

1 Answers

3
votes

As you've already identified, Clojure programs don't have to be compiled into JVM bytecode. As you have a background with JS, I would recommend compiling your scraper for Node. If you're new to Clojure then having some familiar tools can help.

This way you can set up the same toolchain for making client and server builds. You can also take advantage of the near instant startup times of the V8. Although there are actually plenty of ways to make the JVM startup less painful.

You might want to take a look at hickory ― a library for parsing HTML strings and operating on them with css-like selectors. However, there are also a wealth of scraping libraries and tools available on NPM that you can reach for if you are compiling to JS.

For a scraping library with more features, you might want to checkout enlive and this tutorial, as it seems like a great place to start.