2
votes

I would like to extract sentences (not just an html-stripped text) from web pages. I wonder if such functionality is supported by popular HTML parsing libraries such as Jsoup?

Thanks,

Edit:

Sorry if the post was not clear. I need natural language sentences so not necessarily separated by a dot.

Thanks everyone. I just found this library http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html and it seems exactly what I want.

2
Be more precise. Give an example of an HTML content and tell us what you want to extract.sp00m
Do you talk about "Natural language processing" or is a sentence any list of words seperate by a dot ?PeterMmm
take a look at diffbot.com, they do it in cloudyegor256

2 Answers

1
votes

JSoup does provides a very convenient API for extracting and manipulating data, in short... Yes it does provide this functionality...

-1
votes

You can use jquery for that

var t = $('p').text();
var sentences = t.split('.');