euclidean docs

mneedham · mneedham · commit bf62e58362d9 · 2018-09-18T12:22:30.000+01:00
diff --git a/doc/asciidoc/algorithms-similarity.adoc b/doc/asciidoc/algorithms-similarity.adoc
@@ -12,6 +12,8 @@ These algorithms help calculate the similarity of nodes:
 
 * <<algorithms-similarity-jaccard, Jaccard Similarity>> (`algo.similarity.jaccard`)
 * <<algorithms-similarity-cosine, Cosine Similarity>> (`algo.similarity.cosine`)
+* <<algorithms-similarity-euclidean, Euclidean Distance>> (`algo.similarity.euclidean`)
 
 include::similarity-jaccard.adoc[leveloffset=2]
 include::similarity-cosine.adoc[leveloffset=2]
+include::similarity-euclidean.adoc[leveloffset=2]
diff --git a/doc/asciidoc/scripts/similarity-euclidean.cypher b/doc/asciidoc/scripts/similarity-euclidean.cypher
@@ -0,0 +1,85 @@
+// tag::function[]
+RETURN algo.similarity.euclideanDistance([3,8,7,5,2,9], [10,8,6,6,4,5]) AS similarity
+// end::function[]
+
+// tag::create-sample-graph[]
+
+MERGE (french:Cuisine {name:'French'})
+MERGE (italian:Cuisine {name:'Italian'})
+MERGE (indian:Cuisine {name:'Indian'})
+MERGE (lebanese:Cuisine {name:'Lebanese'})
+MERGE (portuguese:Cuisine {name:'Portuguese'})
+
+MERGE (zhen:Person {name: "Zhen"})
+MERGE (praveena:Person {name: "Praveena"})
+MERGE (michael:Person {name: "Michael"})
+MERGE (arya:Person {name: "Arya"})
+MERGE (karin:Person {name: "Karin"})
+
+MERGE (praveena)-[:LIKES {score: 9}]->(indian)
+MERGE (praveena)-[:LIKES {score: 7}]->(portuguese)
+
+MERGE (zhen)-[:LIKES {score: 10}]->(french)
+MERGE (zhen)-[:LIKES {score: 6}]->(indian)
+
+MERGE (michael)-[:LIKES {score: 8}]->(french)
+MERGE (michael)-[:LIKES {score: 7}]->(italian)
+MERGE (michael)-[:LIKES {score: 9}]->(indian)
+
+MERGE (arya)-[:LIKES {score: 10}]->(lebanese)
+MERGE (arya)-[:LIKES {score: 10}]->(italian)
+MERGE (arya)-[:LIKES {score: 7}]->(portuguese)
+
+MERGE (karin)-[:LIKES {score: 9}]->(lebanese)
+MERGE (karin)-[:LIKES {score: 7}]->(italian)
+
+// end::create-sample-graph[]
+
+// tag::stream[]
+MATCH (p:Person), (c:Cuisine)
+OPTIONAL MATCH (p)-[likes:LIKES]->(c)
+WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
+WITH collect(userData) as data
+CALL algo.similarity.euclidean.stream(data)
+YIELD item1, item2, count1, count2, similarity
+RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
+ORDER BY similarity
+// end::stream[]
+
+// tag::stream-similarity-cutoff[]
+MATCH (p:Person), (c:Cuisine)
+OPTIONAL MATCH (p)-[likes:LIKES]->(c)
+WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
+WITH collect(userData) as data
+CALL algo.similarity.euclidean.stream(data, {similarityCutoff: 17.0})
+YIELD item1, item2, count1, count2, similarity
+RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
+ORDER BY similarity
+// end::stream-similarity-cutoff[]
+
+// tag::stream-topk[]
+MATCH (p:Person), (c:Cuisine)
+OPTIONAL MATCH (p)-[likes:LIKES]->(c)
+WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
+WITH collect(userData) as data
+CALL algo.similarity.euclidean.stream(data, {topK:1})
+YIELD item1, item2, count1, count2, similarity
+RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
+ORDER BY from
+// end::stream-topk[]
+
+// tag::write-back[]
+MATCH (p:Person), (c:Cuisine)
+OPTIONAL MATCH (p)-[likes:LIKES]->(c)
+WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
+WITH collect(userData) as data
+CALL algo.similarity.euclidean(data, {topK: 1, write:true})
+YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
+RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
+// end::write-back[]
+
+// tag::query[]
+MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
+      (other)-[:LIKES]->(cuisine)
+RETURN cuisine.name AS cuisine
+// end::query[]
diff --git a/doc/asciidoc/similarity-euclidean.adoc b/doc/asciidoc/similarity-euclidean.adoc
@@ -0,0 +1,247 @@
+[[algorithms-similarity-euclidean]]
+= The Euclidean Distance algorithm
+
+[abstract]
+--
+This section describes the Euclidean Distance algorithm in the Neo4j Graph Algorithms library.
+--
+
+// tag::introduction[]
+Euclidean Distance measures the straight line distance between two points in n-dimensional space.
+// end::introduction[]
+
+
+[[algorithms-similarity-euclidean-context]]
+== History and explanation
+
+// tag::explanation[]
+
+Euclidean Distance is computed using the following formula:
+
+[subs = none]
+\( similarity(p_1, p_2) = \sqrt{\sum_{i~\in~\textrm{item}} (s_{p_1} - s_{p_2})^2} \)
+
+The library contains both procedures and functions to calculate similarity between sets of data.
+The function is best used when calculating the similarity between small numbers of sets.
+The procedures parallelize the computation and are therefore a better bet when computing similarities on bigger datasets.
+
+Euclidean similarity is only calculated over non-NULL dimensions.
+When calling the function we should provide lists that contain the overlapping items.
+The procedures expect to receive the same length lists for all items so we need to pad those lists with 0s where necessary.
+
+// end::explanation[]
+
+[[algorithms-similarity-euclidean-usecase]]
+== Use-cases - when to use the Euclidean Distance algorithm
+
+// tag::use-case[]
+We can use the Euclidean Similarity algorithm to work out the similarity between two things.
+We might then use the computed similarity as part of a recommendation query.
+e.g. recommend some movies to me based on the preferences of users who have rated other movies that I've seen in a similar way
+// end::use-case[]
+
+
+[[algorithms-similarity-euclidean-sample]]
+== Euclidean algorithm sample
+
+.The following will return the Euclidean similarity of two lists of numbers
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=function]
+----
+
+// tag::function[]
+.Results
+[opts="header",cols="1"]
+|===
+| similarity
+| 8.426149773176359
+|===
+// end::function[]
+
+// tag::function-explanation[]
+These two lists of numbers have a Euclidean Distance of 8.42.
+
+// end::function-explanation[]
+
+.The following will create a sample graph:
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=create-sample-graph]
+----
+
+.The following will return a stream of node pairs along with their intersection and Euclidean similarities
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=stream]
+----
+
+// tag::stream[]
+.Results
+[opts="header"]
+|===
+| from     | to       |similarity
+| Arya     | Karin    | 7.681145747868608  
+| Zhen     | Michael  | 7.874007874011811  
+| Zhen     | Praveena | 12.569805089976535 
+| Praveena | Michael  | 12.727922061357855 
+| Michael  | Karin    | 15.033296378372908 
+| Praveena | Karin    | 16.1245154965971   
+| Zhen     | Karin    | 16.30950643030009  
+| Praveena | Arya     | 16.76305461424021  
+| Michael  | Arya     | 17.406895185529212 
+| Zhen     | Arya     | 19.621416870348583 
+
+|===
+// end::stream[]
+
+Arya and Karin have the most similar food preferences, with a Euclidean Distance of 7.68.
+Lower scores are better here - a score of 0 would indicate that users have exactly the same preferences
+
+We might decide that we don't want to see users with a similarity above 17 returned in our results.
+We can filter those out by passing in the `similarityCutoff` parameter.
+
+.The following will return a stream of node pairs that have a similarity of at most 17 along with their Euclidean Distance
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=stream-similarity-cutoff]
+----
+
+// tag::stream-similarity-cutoff[]
+.Results
+[opts="header"]
+|===
+| from     | to       |similarity
+| Arya     | Karin    | 7.681145747868608
+| Zhen     | Michael  | 7.874007874011811
+| Zhen     | Praveena | 12.569805089976535
+| Praveena | Michael  | 12.727922061357855
+| Michael  | Karin    | 15.033296378372908
+| Praveena | Karin    | 16.1245154965971
+| Zhen     | Karin    | 16.30950643030009
+| Praveena | Arya     | 16.76305461424021
+|===
+// end::stream-similarity-cutoff[]
+
+We can see that those users with a high score have been filtered out.
+If we're implementing a k-Nearest Neighbors type query we might instead want to find the most similar `k` users for a given user.
+We can do that by passing in the `topK` parameter
+
+.The following will return a stream of users along with the most similar user to them i.e. k=1
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=stream-topk]
+----
+
+// tag::stream-topk[]
+.Results
+[opts="header",cols="1,1,1"]
+|===
+| from     | to       | similarity
+| Arya     | Karin   | 7.681145747868608  
+| Karin    | Arya    | 7.681145747868608  
+| Michael  | Zhen    | 7.874007874011811  
+| Praveena | Zhen    | 12.569805089976535 
+| Zhen     | Michael | 7.874007874011811  
+
+|===
+// end::stream-topk[]
+
+These results will not be symmetrical.
+For example, the person most similar to Praveena is Zhen, but the person most similar to Zhen is Michael.
+
+.Parameters
+[opts="header",cols="1,1,1,1,4"]
+|===
+| Name             | Type   | Default        | Optional | Description
+| data             | list   | null           | no       | A list of maps of the following structure: `{item: nodeId, weights: [weight, weight, weight]}`
+| top              | int    | 0              | yes      | The number of similar pairs to return. If `0` it will return as many as it finds.
+| topK             | int    | 0              | yes      | The number of similar values to return per node. If `0` will return as many as it finds.
+| similarityCutoff | int    | -1             | yes      | The threshold for Euclidean distance. Values above this will not be returned.
+| degreeCutoff     | int    | 0              | yes      | The threshold for the number of items in the `targets` list. If the list contains less than this amount that node will be excluded from the calculation.
+| concurrency      | int    | available CPUs | yes      | The number of concurrent threads
+|===
+
+.Results
+[opts="header",cols="1,1,6"]
+|===
+| Name         | Type | Description
+| item1      | int  | The ID of one node in the similarity pair
+| item2      | int  | The ID of other node in the similarity pair
+| count1       | int  | The size of the `targets` list of one node
+| count2       | int  | The size of the `targets` list of other node
+| intersection | int  | The number of intersecting values in the two nodes `targets` lists
+| similarity   | int  | The Euclidean distance between the two nodes
+|===
+
+.The following will find the most similar user for each user and store a relationship between those users.
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=write-back]
+----
+
+// tag::write-back[]
+.Results
+[opts="header"]
+|===
+| nodes | similarityPairs | write | writeRelationshipType | writeProperty | min  | max  | mean | p95
+| 5      | 5                |true   | SIMILAR              | score       | 7.681121826171875 | 12.569793701171875 | 8.736004638671876 | 12.569793701171875
+|===
+// end::write-back[]
+
+We might then write a query to find out what types of cuisine people similar to us like:
+
+.The following will find the most similar user to `Praveena` and return their favourite cuisine
+[source, cypher]
+----
+include::scripts/similarity-euclidean.cypher[tag=query]
+----
+
+// tag::query[]
+.Results
+[opts="header",cols="1"]
+|===
+| cuisine
+| Indian
+| French
+|===
+// end::query[]
+
+.Parameters
+[opts="header",cols="1,1,1,1,4"]
+|===
+| Name                   | Type    | Default        | Optional | Description
+| data                   | list    | null           | no       | A list of maps of the following structure: `{item: nodeId, weights: [weight, weight, weight]}`
+| top                    | int     | 0              | yes      | The number of similar pairs to return. If `0` it will return as many as it finds.
+| topK                   | int     | 0              | yes      | The number of similar values to return per node. If `0` will return as many as it finds.
+| similarityCutoff       | int     | -1             | yes      | The threshold for Euclidean distance. Values above this will not be returned.
+| degreeCutoff           | int     | 0              | yes      | The threshold for the number of items in the `targets` list. If the list contains less than this amount that node will be excluded from the calculation.
+| concurrency            | int     | available CPUs | yes      | The number of concurrent threads
+| write                  | boolean | false          | yes      | Indicates whether results should be stored.
+| writeRelationshipType  | string  | SIMILAR        | yes      | The relationship type to use when storing results.
+| writeProperty          | string  | score          | yes      | The property to use when storing results.
+|===
+
+.Results
+[opts="header",cols="1,1,6"]
+|===
+| Name                  | Type    | Description
+| nodes                 | int     | The number of nodes passed in
+| similarityPairs       | int     | The number of pairs of similar nodes computed
+| write                 | boolean | Indicates whether results were stored
+| writeRelationshipType | string  | The relationship type used when storing results.
+| writeProperty         | string  | The property used when storing results.
+| min                   | double  | The minimum similarity score computed
+| max                   | double  | The maximum similarity score computed
+| mean                  | double  | The mean of similarities scores computed
+| stdDev                | double  | The standard deviation of similarities scores computed
+| p25                   | double  | The 25 percentile of similarities scores computed
+| p50                   | double  | The 50 percentile of similarities scores computed
+| p75                   | double  | The 75 percentile of similarities scores computed
+| p90                   | double  | The 90 percentile of similarities scores computed
+| p95                   | double  | The 95 percentile of similarities scores computed
+| p99                   | double  | The 99 percentile of similarities scores computed
+| p999                  | double  | The 99.9 percentile of similarities scores computed
+| p100                  | double  | The 25 percentile of similarities scores computed
+
+|===