I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action.
So my questions are:
- How can I possibly trawl the internet to find related web-pages?
- And what algorithm should I use to extract data from web-page is textual analysis and word frequency the only way to do it?
- Lastly what platform is best suited for this problem. I have heard of Apache mahout and it comes with some re-usable algos, does it sound like a good fit?