I have some data in text form, taken from a webpage. It's quite lengthy but follows the form:
<p><span class="monthyear">Jan 2001</span>
<br><b>Foo text (2)</b></p>
<p><span class="monthyear">Nov 2006</span>
<br><b>Bar text (29)</b>
<br><b>More bar text (4)</b>
<br><b>Yet more bar text (102)</b></p>
<p><span class="monthyear">Apr 2004</span>
<br><b>Further foo text (1)</b>
<br><b>Combination foo and bar text (41)</b></p>
I want to extract the relevant parts of this into a data frame, like so:
monthyear info n
1 Jan 2001 Foo text 2
2 Nov 2006 Bar text 29
3 Nov 2006 More bar text 4
...but I'm not sure how to do it. If I have the html in a character vector called text I can extract the monthyear data using a function from the stringr package:
monthyear <- str_extract_all(
text[1],perl("(?<=\\\"monthyear\\\">).*?20[0-9]{2}")
)
and I could extract the info and n data in the same sort of way, but given that there are multiple info and n entries for each monthyear entry, I'm not sure how to combine them. Am I going about this all wrong?
XML
package and use its parsing functions (specificallyhtmlTreeParse
) instead of trying toregex
your way back to data. – Justin