Graph Database techniques: discovering degrees of congruence

Question

Assume we have a collection of 100,000 people who each have a list of attributes such as:

height: "between 130 and 140 cm"
eyecolor: "blue"
age_rangee: "16-18"
favorite_music_type: "jazz"
home_city: "NYC"
owns_a_boat: "no"
preferred_flower: "hyacinth"
bathing_frequency_per_month: 60
car_type: "minivan"
house_type: "apartment"
wears_jeans: "often"
wears_sandals: "never"
wears_boots: "sometimes"

The set of attributes may vary somewhat from person to person. The number of attributes can change, and the types of attributes may change. And of course, the values of the attributes may change.However, given one individual, we assume there is some overlap of his attributes with those a number of people in our collection.

The question I have is: "What is the best way of expressing these various attributes in a graph database such that I can most quickly select a group of say 50 people whose attributes most similarly match that of a particular individual, and order them from best-match to worst-match"

Thanks Kenny,

In your example Cypher query, do I understand that each feature node contains a key:value pair that identifies the attribute and it's corresponding value?

Here is a somewhat more complex congruence matching problem.

Assume we have a feature set, (A, B, C, D, E, F), and 100,000 people who have preferences that match to some degree this preference set. But each feature, not only may they have a preference, but they may have NO preference.

For example Lena's preferences are, (A, B, C, X, Y, Z), and Robert's preferences are, (A, B, C, _, _, ), (where underbar, (), signifies that any choice is OK)

We would like to rate Robert higher than Lena in terms of preference matching because while he and Lena have the same number of matching preferences, Robert has fewer mis-matching preferences

Here is a more concrete example:

Lets say we have 100,000 people who are interested in cars, and we know what features of cars are important to them. We have, say, 10 cars, with various features and we want to select a group of say, 50 people, whose desired-car-features best match each of the 10 cars.

Some people will have no preference about a subset all car features. For example Lester has no preference with respect to transmission, either 'automatic' or 'manual', would be just fine, and Rebecca has no preference with respect to 'color', 'power_windows', and 'power_door_locks'. Any color would be fine, and she does not care if the car has power windows and door locks.

So for example, here is a car with a defined set of features

engine: '4cylinder' transmission: 'automatic' color: 'dark blue' size: 'subcompact' age: 'less than 4 years' power_windows: 'yes' power_door_locks: 'yes' average_gas_milage: 'greater than 30mpg'

And here we have two individuals, Lester, and Rebecca who have indicated features that are important to them:

Lester: engine: '4cylinder' color: 'dark blue' size: 'subcompact' age: 'less than 4 years' power_windows: 'yes' power_door_locks: 'yes' average_gas_milage: 'greater than 30mpg'

Rebecca: engine: '4cylinder' transmission: 'automatic' size: 'subcompact' age: 'less than 4 years' average_gas_milage: 'greater than 30mpg'

So how can we best select and order a group of 50 people whose feature-preferences best match each car? In this case we want the people with the maximal matching-feature preferences ranked first, but we also want to include those people who would be happy with any value of particular attributes.

Kenny Bastani Kenny Bastani · Accepted Answer · 2014-03-26T01:52:57

Great question. First, I recommend taking the free online training course that masterfully introduces you to the basic concepts behind Neo4j and Cypher query language: http://www.neo4j.org/training

Your problem is a simple data modeling exercise I'm happy to walk you through. As you model data as a graph, certain attributes or properties of a class, for instance a person, can be represented themselves as a node.

By looking at some sample data for the class person, you will of course notice some redundancies in the value of the properties. These overlaps allow us to select on that class and group results by a shared property. That's fairly easy to do in most databases. What Neo4j lets you do is to take an arbitrary set of features belonging to a person and to then select all similar people based on those shared features.

MATCH (john:Person {name: "John Doe"})-[:HAS_FEATURE]->(feature),
      (feature)<-[:HAS_FEATURE]-(people)
WITH john, count(DISTINCT feature) as feature_count, people
RETURN john.name, people.name, feature_count
ORDER BY feature_count DESC

This query finds a person named John Doe and all the features that belong to him. Each feature is a node that represents a value that can be attributed to a person. Each feature is unique, as a single node, and groups people together.

The query then finds all people who share a feature with john. Then in the WITH clause, the query counts the features that john shares with each person. Finally, the query returns john's name, the name of the person he shares features with and the count of the number of features they share. Then the query is ordered by feature_count descending.

This returns back the people who share the most features with John Doe.

Graph Database techniques: discovering degrees of congruence

2 Answers