3
votes

could someone explain me how to scrape content from <td> tags where the <th> has content value (actually in this case I need content of <b> tag for matching operation) "Row1 title", but without scraping <th> tag (or any of its content) in process? Here is my test HTML:

<table class="table_class"> 
                    <tbody> 
                       <tr> 
                         <th>
                           <b>
                              Row1 title
                           </b>
                         </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                       <tr> 
                         <th>                                
                              Row2 title                                
                          </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                    </tbody>
</table>

Data which I want to extract should come from these tags:

                     <td>2.660.784</td> 
                     <td>2.944.552</td> 
                     <td>Correct, has 3 td elements</td> 

I have managed to create function which returns entire content of the table, but I would like to exclude the <th> node from result, and to return only data from <td> nodes, which content I can use for further parsing. Can anyone help me with this?

1

1 Answers

2
votes

With enlive something like this

(ns tutorial.so-scrape
  (:require [net.cgrand.enlive-html :as html])

(defn parse-tds [url] 
 (html/select (html/html-resource (java.net.URL. url)) [:table :td])) 

should give you a sequence of all the td nodes, something of the form {:tag :td :attrs {...} :content (...)}. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.

You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))

In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type function like this

(defn tag-type [node]
  (case (:tag node) 
   :td    ::TerminalNode
   ::IgnoreNode)

(change the return value of all nodes to ::IgnoreNode except for :td then it just gives you a sequence of the content of the :tds which is probably close to what you want. Let me know if you need more help.

EDIT (in reply to comments below) I don't think selecting nodes based on their :content is possible with enlive alone - but you can certainly do so with Clojure.

for example you could do something like

(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
  (:content line))

could work. (you might have to tweak the (:content line) form a little..