Skip to content
This repository was archived by the owner on Apr 22, 2020. It is now read-only.

Commit bf62e58

Browse files
committed
euclidean docs
1 parent 7ef809b commit bf62e58

File tree

3 files changed

+334
-0
lines changed

3 files changed

+334
-0
lines changed

doc/asciidoc/algorithms-similarity.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ These algorithms help calculate the similarity of nodes:
1212

1313
* <<algorithms-similarity-jaccard, Jaccard Similarity>> (`algo.similarity.jaccard`)
1414
* <<algorithms-similarity-cosine, Cosine Similarity>> (`algo.similarity.cosine`)
15+
* <<algorithms-similarity-euclidean, Euclidean Distance>> (`algo.similarity.euclidean`)
1516

1617
include::similarity-jaccard.adoc[leveloffset=2]
1718
include::similarity-cosine.adoc[leveloffset=2]
19+
include::similarity-euclidean.adoc[leveloffset=2]
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
// tag::function[]
2+
RETURN algo.similarity.euclideanDistance([3,8,7,5,2,9], [10,8,6,6,4,5]) AS similarity
3+
// end::function[]
4+
5+
// tag::create-sample-graph[]
6+
7+
MERGE (french:Cuisine {name:'French'})
8+
MERGE (italian:Cuisine {name:'Italian'})
9+
MERGE (indian:Cuisine {name:'Indian'})
10+
MERGE (lebanese:Cuisine {name:'Lebanese'})
11+
MERGE (portuguese:Cuisine {name:'Portuguese'})
12+
13+
MERGE (zhen:Person {name: "Zhen"})
14+
MERGE (praveena:Person {name: "Praveena"})
15+
MERGE (michael:Person {name: "Michael"})
16+
MERGE (arya:Person {name: "Arya"})
17+
MERGE (karin:Person {name: "Karin"})
18+
19+
MERGE (praveena)-[:LIKES {score: 9}]->(indian)
20+
MERGE (praveena)-[:LIKES {score: 7}]->(portuguese)
21+
22+
MERGE (zhen)-[:LIKES {score: 10}]->(french)
23+
MERGE (zhen)-[:LIKES {score: 6}]->(indian)
24+
25+
MERGE (michael)-[:LIKES {score: 8}]->(french)
26+
MERGE (michael)-[:LIKES {score: 7}]->(italian)
27+
MERGE (michael)-[:LIKES {score: 9}]->(indian)
28+
29+
MERGE (arya)-[:LIKES {score: 10}]->(lebanese)
30+
MERGE (arya)-[:LIKES {score: 10}]->(italian)
31+
MERGE (arya)-[:LIKES {score: 7}]->(portuguese)
32+
33+
MERGE (karin)-[:LIKES {score: 9}]->(lebanese)
34+
MERGE (karin)-[:LIKES {score: 7}]->(italian)
35+
36+
// end::create-sample-graph[]
37+
38+
// tag::stream[]
39+
MATCH (p:Person), (c:Cuisine)
40+
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
41+
WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
42+
WITH collect(userData) as data
43+
CALL algo.similarity.euclidean.stream(data)
44+
YIELD item1, item2, count1, count2, similarity
45+
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
46+
ORDER BY similarity
47+
// end::stream[]
48+
49+
// tag::stream-similarity-cutoff[]
50+
MATCH (p:Person), (c:Cuisine)
51+
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
52+
WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
53+
WITH collect(userData) as data
54+
CALL algo.similarity.euclidean.stream(data, {similarityCutoff: 17.0})
55+
YIELD item1, item2, count1, count2, similarity
56+
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
57+
ORDER BY similarity
58+
// end::stream-similarity-cutoff[]
59+
60+
// tag::stream-topk[]
61+
MATCH (p:Person), (c:Cuisine)
62+
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
63+
WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
64+
WITH collect(userData) as data
65+
CALL algo.similarity.euclidean.stream(data, {topK:1})
66+
YIELD item1, item2, count1, count2, similarity
67+
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
68+
ORDER BY from
69+
// end::stream-topk[]
70+
71+
// tag::write-back[]
72+
MATCH (p:Person), (c:Cuisine)
73+
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
74+
WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
75+
WITH collect(userData) as data
76+
CALL algo.similarity.euclidean(data, {topK: 1, write:true})
77+
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
78+
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
79+
// end::write-back[]
80+
81+
// tag::query[]
82+
MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
83+
(other)-[:LIKES]->(cuisine)
84+
RETURN cuisine.name AS cuisine
85+
// end::query[]
Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
[[algorithms-similarity-euclidean]]
2+
= The Euclidean Distance algorithm
3+
4+
[abstract]
5+
--
6+
This section describes the Euclidean Distance algorithm in the Neo4j Graph Algorithms library.
7+
--
8+
9+
// tag::introduction[]
10+
Euclidean Distance measures the straight line distance between two points in n-dimensional space.
11+
// end::introduction[]
12+
13+
14+
[[algorithms-similarity-euclidean-context]]
15+
== History and explanation
16+
17+
// tag::explanation[]
18+
19+
Euclidean Distance is computed using the following formula:
20+
21+
[subs = none]
22+
\( similarity(p_1, p_2) = \sqrt{\sum_{i~\in~\textrm{item}} (s_{p_1} - s_{p_2})^2} \)
23+
24+
The library contains both procedures and functions to calculate similarity between sets of data.
25+
The function is best used when calculating the similarity between small numbers of sets.
26+
The procedures parallelize the computation and are therefore a better bet when computing similarities on bigger datasets.
27+
28+
Euclidean similarity is only calculated over non-NULL dimensions.
29+
When calling the function we should provide lists that contain the overlapping items.
30+
The procedures expect to receive the same length lists for all items so we need to pad those lists with 0s where necessary.
31+
32+
// end::explanation[]
33+
34+
[[algorithms-similarity-euclidean-usecase]]
35+
== Use-cases - when to use the Euclidean Distance algorithm
36+
37+
// tag::use-case[]
38+
We can use the Euclidean Similarity algorithm to work out the similarity between two things.
39+
We might then use the computed similarity as part of a recommendation query.
40+
e.g. recommend some movies to me based on the preferences of users who have rated other movies that I've seen in a similar way
41+
// end::use-case[]
42+
43+
44+
[[algorithms-similarity-euclidean-sample]]
45+
== Euclidean algorithm sample
46+
47+
.The following will return the Euclidean similarity of two lists of numbers
48+
[source, cypher]
49+
----
50+
include::scripts/similarity-euclidean.cypher[tag=function]
51+
----
52+
53+
// tag::function[]
54+
.Results
55+
[opts="header",cols="1"]
56+
|===
57+
| similarity
58+
| 8.426149773176359
59+
|===
60+
// end::function[]
61+
62+
// tag::function-explanation[]
63+
These two lists of numbers have a Euclidean Distance of 8.42.
64+
65+
// end::function-explanation[]
66+
67+
.The following will create a sample graph:
68+
[source, cypher]
69+
----
70+
include::scripts/similarity-euclidean.cypher[tag=create-sample-graph]
71+
----
72+
73+
.The following will return a stream of node pairs along with their intersection and Euclidean similarities
74+
[source, cypher]
75+
----
76+
include::scripts/similarity-euclidean.cypher[tag=stream]
77+
----
78+
79+
// tag::stream[]
80+
.Results
81+
[opts="header"]
82+
|===
83+
| from | to |similarity
84+
| Arya | Karin | 7.681145747868608
85+
| Zhen | Michael | 7.874007874011811
86+
| Zhen | Praveena | 12.569805089976535
87+
| Praveena | Michael | 12.727922061357855
88+
| Michael | Karin | 15.033296378372908
89+
| Praveena | Karin | 16.1245154965971
90+
| Zhen | Karin | 16.30950643030009
91+
| Praveena | Arya | 16.76305461424021
92+
| Michael | Arya | 17.406895185529212
93+
| Zhen | Arya | 19.621416870348583
94+
95+
|===
96+
// end::stream[]
97+
98+
Arya and Karin have the most similar food preferences, with a Euclidean Distance of 7.68.
99+
Lower scores are better here - a score of 0 would indicate that users have exactly the same preferences
100+
101+
We might decide that we don't want to see users with a similarity above 17 returned in our results.
102+
We can filter those out by passing in the `similarityCutoff` parameter.
103+
104+
.The following will return a stream of node pairs that have a similarity of at most 17 along with their Euclidean Distance
105+
[source, cypher]
106+
----
107+
include::scripts/similarity-euclidean.cypher[tag=stream-similarity-cutoff]
108+
----
109+
110+
// tag::stream-similarity-cutoff[]
111+
.Results
112+
[opts="header"]
113+
|===
114+
| from | to |similarity
115+
| Arya | Karin | 7.681145747868608
116+
| Zhen | Michael | 7.874007874011811
117+
| Zhen | Praveena | 12.569805089976535
118+
| Praveena | Michael | 12.727922061357855
119+
| Michael | Karin | 15.033296378372908
120+
| Praveena | Karin | 16.1245154965971
121+
| Zhen | Karin | 16.30950643030009
122+
| Praveena | Arya | 16.76305461424021
123+
|===
124+
// end::stream-similarity-cutoff[]
125+
126+
We can see that those users with a high score have been filtered out.
127+
If we're implementing a k-Nearest Neighbors type query we might instead want to find the most similar `k` users for a given user.
128+
We can do that by passing in the `topK` parameter
129+
130+
.The following will return a stream of users along with the most similar user to them i.e. k=1
131+
[source, cypher]
132+
----
133+
include::scripts/similarity-euclidean.cypher[tag=stream-topk]
134+
----
135+
136+
// tag::stream-topk[]
137+
.Results
138+
[opts="header",cols="1,1,1"]
139+
|===
140+
| from | to | similarity
141+
| Arya | Karin | 7.681145747868608
142+
| Karin | Arya | 7.681145747868608
143+
| Michael | Zhen | 7.874007874011811
144+
| Praveena | Zhen | 12.569805089976535
145+
| Zhen | Michael | 7.874007874011811
146+
147+
|===
148+
// end::stream-topk[]
149+
150+
These results will not be symmetrical.
151+
For example, the person most similar to Praveena is Zhen, but the person most similar to Zhen is Michael.
152+
153+
.Parameters
154+
[opts="header",cols="1,1,1,1,4"]
155+
|===
156+
| Name | Type | Default | Optional | Description
157+
| data | list | null | no | A list of maps of the following structure: `{item: nodeId, weights: [weight, weight, weight]}`
158+
| top | int | 0 | yes | The number of similar pairs to return. If `0` it will return as many as it finds.
159+
| topK | int | 0 | yes | The number of similar values to return per node. If `0` will return as many as it finds.
160+
| similarityCutoff | int | -1 | yes | The threshold for Euclidean distance. Values above this will not be returned.
161+
| degreeCutoff | int | 0 | yes | The threshold for the number of items in the `targets` list. If the list contains less than this amount that node will be excluded from the calculation.
162+
| concurrency | int | available CPUs | yes | The number of concurrent threads
163+
|===
164+
165+
.Results
166+
[opts="header",cols="1,1,6"]
167+
|===
168+
| Name | Type | Description
169+
| item1 | int | The ID of one node in the similarity pair
170+
| item2 | int | The ID of other node in the similarity pair
171+
| count1 | int | The size of the `targets` list of one node
172+
| count2 | int | The size of the `targets` list of other node
173+
| intersection | int | The number of intersecting values in the two nodes `targets` lists
174+
| similarity | int | The Euclidean distance between the two nodes
175+
|===
176+
177+
.The following will find the most similar user for each user and store a relationship between those users.
178+
[source, cypher]
179+
----
180+
include::scripts/similarity-euclidean.cypher[tag=write-back]
181+
----
182+
183+
// tag::write-back[]
184+
.Results
185+
[opts="header"]
186+
|===
187+
| nodes | similarityPairs | write | writeRelationshipType | writeProperty | min | max | mean | p95
188+
| 5 | 5 |true | SIMILAR | score | 7.681121826171875 | 12.569793701171875 | 8.736004638671876 | 12.569793701171875
189+
|===
190+
// end::write-back[]
191+
192+
We might then write a query to find out what types of cuisine people similar to us like:
193+
194+
.The following will find the most similar user to `Praveena` and return their favourite cuisine
195+
[source, cypher]
196+
----
197+
include::scripts/similarity-euclidean.cypher[tag=query]
198+
----
199+
200+
// tag::query[]
201+
.Results
202+
[opts="header",cols="1"]
203+
|===
204+
| cuisine
205+
| Indian
206+
| French
207+
|===
208+
// end::query[]
209+
210+
.Parameters
211+
[opts="header",cols="1,1,1,1,4"]
212+
|===
213+
| Name | Type | Default | Optional | Description
214+
| data | list | null | no | A list of maps of the following structure: `{item: nodeId, weights: [weight, weight, weight]}`
215+
| top | int | 0 | yes | The number of similar pairs to return. If `0` it will return as many as it finds.
216+
| topK | int | 0 | yes | The number of similar values to return per node. If `0` will return as many as it finds.
217+
| similarityCutoff | int | -1 | yes | The threshold for Euclidean distance. Values above this will not be returned.
218+
| degreeCutoff | int | 0 | yes | The threshold for the number of items in the `targets` list. If the list contains less than this amount that node will be excluded from the calculation.
219+
| concurrency | int | available CPUs | yes | The number of concurrent threads
220+
| write | boolean | false | yes | Indicates whether results should be stored.
221+
| writeRelationshipType | string | SIMILAR | yes | The relationship type to use when storing results.
222+
| writeProperty | string | score | yes | The property to use when storing results.
223+
|===
224+
225+
.Results
226+
[opts="header",cols="1,1,6"]
227+
|===
228+
| Name | Type | Description
229+
| nodes | int | The number of nodes passed in
230+
| similarityPairs | int | The number of pairs of similar nodes computed
231+
| write | boolean | Indicates whether results were stored
232+
| writeRelationshipType | string | The relationship type used when storing results.
233+
| writeProperty | string | The property used when storing results.
234+
| min | double | The minimum similarity score computed
235+
| max | double | The maximum similarity score computed
236+
| mean | double | The mean of similarities scores computed
237+
| stdDev | double | The standard deviation of similarities scores computed
238+
| p25 | double | The 25 percentile of similarities scores computed
239+
| p50 | double | The 50 percentile of similarities scores computed
240+
| p75 | double | The 75 percentile of similarities scores computed
241+
| p90 | double | The 90 percentile of similarities scores computed
242+
| p95 | double | The 95 percentile of similarities scores computed
243+
| p99 | double | The 99 percentile of similarities scores computed
244+
| p999 | double | The 99.9 percentile of similarities scores computed
245+
| p100 | double | The 25 percentile of similarities scores computed
246+
247+
|===

0 commit comments

Comments
 (0)