Interesting NLP/machine-learning style project -- analyzing privacy policies

Question

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.

I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:

First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.

The problem isn't really document classification. You would like to split each document into chunks, then label/categorize/summarize each chunk. A naive approach could treat each paragraph or grammatical sentence as a chunk, but it might be too crude. — tripleee
Only some paragraphs are actually salient to a typical user's privacy, though. I'm interested in the "hot-button" issues, like grabbing of location, selling to 3rd parties, etc. The standard boilerplate is irrelevant. — bgcode
One of the points I tried to make is that it would be a rather grave error for a system like this to fail to distinguish between "I know what this is, and I can ignore it" and "I don't know what this is". So I think in fact you do need to identify what you call "standard boilerplate". If indeed it is standard and boilerplate, it should be easy, compared to the main task. — tripleee

Fred Foo Fred Foo · Accepted Answer · 2012-03-14T21:18:41

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.

The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.

Interesting NLP/machine-learning style project -- analyzing privacy policies

3 Answers