|
| 1 | +[[algorithms-similarity-euclidean]] |
| 2 | += The Euclidean Distance algorithm |
| 3 | + |
| 4 | +[abstract] |
| 5 | +-- |
| 6 | +This section describes the Euclidean Distance algorithm in the Neo4j Graph Algorithms library. |
| 7 | +-- |
| 8 | + |
| 9 | +// tag::introduction[] |
| 10 | +Euclidean Distance measures the straight line distance between two points in n-dimensional space. |
| 11 | +// end::introduction[] |
| 12 | + |
| 13 | + |
| 14 | +[[algorithms-similarity-euclidean-context]] |
| 15 | +== History and explanation |
| 16 | + |
| 17 | +// tag::explanation[] |
| 18 | + |
| 19 | +Euclidean Distance is computed using the following formula: |
| 20 | + |
| 21 | +[subs = none] |
| 22 | +\( similarity(p_1, p_2) = \sqrt{\sum_{i~\in~\textrm{item}} (s_{p_1} - s_{p_2})^2} \) |
| 23 | + |
| 24 | +The library contains both procedures and functions to calculate similarity between sets of data. |
| 25 | +The function is best used when calculating the similarity between small numbers of sets. |
| 26 | +The procedures parallelize the computation and are therefore a better bet when computing similarities on bigger datasets. |
| 27 | + |
| 28 | +Euclidean similarity is only calculated over non-NULL dimensions. |
| 29 | +When calling the function we should provide lists that contain the overlapping items. |
| 30 | +The procedures expect to receive the same length lists for all items so we need to pad those lists with 0s where necessary. |
| 31 | + |
| 32 | +// end::explanation[] |
| 33 | + |
| 34 | +[[algorithms-similarity-euclidean-usecase]] |
| 35 | +== Use-cases - when to use the Euclidean Distance algorithm |
| 36 | + |
| 37 | +// tag::use-case[] |
| 38 | +We can use the Euclidean Similarity algorithm to work out the similarity between two things. |
| 39 | +We might then use the computed similarity as part of a recommendation query. |
| 40 | +e.g. recommend some movies to me based on the preferences of users who have rated other movies that I've seen in a similar way |
| 41 | +// end::use-case[] |
| 42 | + |
| 43 | + |
| 44 | +[[algorithms-similarity-euclidean-sample]] |
| 45 | +== Euclidean algorithm sample |
| 46 | + |
| 47 | +.The following will return the Euclidean similarity of two lists of numbers |
| 48 | +[source, cypher] |
| 49 | +---- |
| 50 | +include::scripts/similarity-euclidean.cypher[tag=function] |
| 51 | +---- |
| 52 | + |
| 53 | +// tag::function[] |
| 54 | +.Results |
| 55 | +[opts="header",cols="1"] |
| 56 | +|=== |
| 57 | +| similarity |
| 58 | +| 8.426149773176359 |
| 59 | +|=== |
| 60 | +// end::function[] |
| 61 | + |
| 62 | +// tag::function-explanation[] |
| 63 | +These two lists of numbers have a Euclidean Distance of 8.42. |
| 64 | + |
| 65 | +// end::function-explanation[] |
| 66 | + |
| 67 | +.The following will create a sample graph: |
| 68 | +[source, cypher] |
| 69 | +---- |
| 70 | +include::scripts/similarity-euclidean.cypher[tag=create-sample-graph] |
| 71 | +---- |
| 72 | + |
| 73 | +.The following will return a stream of node pairs along with their intersection and Euclidean similarities |
| 74 | +[source, cypher] |
| 75 | +---- |
| 76 | +include::scripts/similarity-euclidean.cypher[tag=stream] |
| 77 | +---- |
| 78 | + |
| 79 | +// tag::stream[] |
| 80 | +.Results |
| 81 | +[opts="header"] |
| 82 | +|=== |
| 83 | +| from | to |similarity |
| 84 | +| Arya | Karin | 7.681145747868608 |
| 85 | +| Zhen | Michael | 7.874007874011811 |
| 86 | +| Zhen | Praveena | 12.569805089976535 |
| 87 | +| Praveena | Michael | 12.727922061357855 |
| 88 | +| Michael | Karin | 15.033296378372908 |
| 89 | +| Praveena | Karin | 16.1245154965971 |
| 90 | +| Zhen | Karin | 16.30950643030009 |
| 91 | +| Praveena | Arya | 16.76305461424021 |
| 92 | +| Michael | Arya | 17.406895185529212 |
| 93 | +| Zhen | Arya | 19.621416870348583 |
| 94 | + |
| 95 | +|=== |
| 96 | +// end::stream[] |
| 97 | + |
| 98 | +Arya and Karin have the most similar food preferences, with a Euclidean Distance of 7.68. |
| 99 | +Lower scores are better here - a score of 0 would indicate that users have exactly the same preferences |
| 100 | + |
| 101 | +We might decide that we don't want to see users with a similarity above 17 returned in our results. |
| 102 | +We can filter those out by passing in the `similarityCutoff` parameter. |
| 103 | + |
| 104 | +.The following will return a stream of node pairs that have a similarity of at most 17 along with their Euclidean Distance |
| 105 | +[source, cypher] |
| 106 | +---- |
| 107 | +include::scripts/similarity-euclidean.cypher[tag=stream-similarity-cutoff] |
| 108 | +---- |
| 109 | + |
| 110 | +// tag::stream-similarity-cutoff[] |
| 111 | +.Results |
| 112 | +[opts="header"] |
| 113 | +|=== |
| 114 | +| from | to |similarity |
| 115 | +| Arya | Karin | 7.681145747868608 |
| 116 | +| Zhen | Michael | 7.874007874011811 |
| 117 | +| Zhen | Praveena | 12.569805089976535 |
| 118 | +| Praveena | Michael | 12.727922061357855 |
| 119 | +| Michael | Karin | 15.033296378372908 |
| 120 | +| Praveena | Karin | 16.1245154965971 |
| 121 | +| Zhen | Karin | 16.30950643030009 |
| 122 | +| Praveena | Arya | 16.76305461424021 |
| 123 | +|=== |
| 124 | +// end::stream-similarity-cutoff[] |
| 125 | + |
| 126 | +We can see that those users with a high score have been filtered out. |
| 127 | +If we're implementing a k-Nearest Neighbors type query we might instead want to find the most similar `k` users for a given user. |
| 128 | +We can do that by passing in the `topK` parameter |
| 129 | + |
| 130 | +.The following will return a stream of users along with the most similar user to them i.e. k=1 |
| 131 | +[source, cypher] |
| 132 | +---- |
| 133 | +include::scripts/similarity-euclidean.cypher[tag=stream-topk] |
| 134 | +---- |
| 135 | + |
| 136 | +// tag::stream-topk[] |
| 137 | +.Results |
| 138 | +[opts="header",cols="1,1,1"] |
| 139 | +|=== |
| 140 | +| from | to | similarity |
| 141 | +| Arya | Karin | 7.681145747868608 |
| 142 | +| Karin | Arya | 7.681145747868608 |
| 143 | +| Michael | Zhen | 7.874007874011811 |
| 144 | +| Praveena | Zhen | 12.569805089976535 |
| 145 | +| Zhen | Michael | 7.874007874011811 |
| 146 | + |
| 147 | +|=== |
| 148 | +// end::stream-topk[] |
| 149 | + |
| 150 | +These results will not be symmetrical. |
| 151 | +For example, the person most similar to Praveena is Zhen, but the person most similar to Zhen is Michael. |
| 152 | + |
| 153 | +.Parameters |
| 154 | +[opts="header",cols="1,1,1,1,4"] |
| 155 | +|=== |
| 156 | +| Name | Type | Default | Optional | Description |
| 157 | +| data | list | null | no | A list of maps of the following structure: `{item: nodeId, weights: [weight, weight, weight]}` |
| 158 | +| top | int | 0 | yes | The number of similar pairs to return. If `0` it will return as many as it finds. |
| 159 | +| topK | int | 0 | yes | The number of similar values to return per node. If `0` will return as many as it finds. |
| 160 | +| similarityCutoff | int | -1 | yes | The threshold for Euclidean distance. Values above this will not be returned. |
| 161 | +| degreeCutoff | int | 0 | yes | The threshold for the number of items in the `targets` list. If the list contains less than this amount that node will be excluded from the calculation. |
| 162 | +| concurrency | int | available CPUs | yes | The number of concurrent threads |
| 163 | +|=== |
| 164 | + |
| 165 | +.Results |
| 166 | +[opts="header",cols="1,1,6"] |
| 167 | +|=== |
| 168 | +| Name | Type | Description |
| 169 | +| item1 | int | The ID of one node in the similarity pair |
| 170 | +| item2 | int | The ID of other node in the similarity pair |
| 171 | +| count1 | int | The size of the `targets` list of one node |
| 172 | +| count2 | int | The size of the `targets` list of other node |
| 173 | +| intersection | int | The number of intersecting values in the two nodes `targets` lists |
| 174 | +| similarity | int | The Euclidean distance between the two nodes |
| 175 | +|=== |
| 176 | + |
| 177 | +.The following will find the most similar user for each user and store a relationship between those users. |
| 178 | +[source, cypher] |
| 179 | +---- |
| 180 | +include::scripts/similarity-euclidean.cypher[tag=write-back] |
| 181 | +---- |
| 182 | + |
| 183 | +// tag::write-back[] |
| 184 | +.Results |
| 185 | +[opts="header"] |
| 186 | +|=== |
| 187 | +| nodes | similarityPairs | write | writeRelationshipType | writeProperty | min | max | mean | p95 |
| 188 | +| 5 | 5 |true | SIMILAR | score | 7.681121826171875 | 12.569793701171875 | 8.736004638671876 | 12.569793701171875 |
| 189 | +|=== |
| 190 | +// end::write-back[] |
| 191 | + |
| 192 | +We might then write a query to find out what types of cuisine people similar to us like: |
| 193 | + |
| 194 | +.The following will find the most similar user to `Praveena` and return their favourite cuisine |
| 195 | +[source, cypher] |
| 196 | +---- |
| 197 | +include::scripts/similarity-euclidean.cypher[tag=query] |
| 198 | +---- |
| 199 | + |
| 200 | +// tag::query[] |
| 201 | +.Results |
| 202 | +[opts="header",cols="1"] |
| 203 | +|=== |
| 204 | +| cuisine |
| 205 | +| Indian |
| 206 | +| French |
| 207 | +|=== |
| 208 | +// end::query[] |
| 209 | + |
| 210 | +.Parameters |
| 211 | +[opts="header",cols="1,1,1,1,4"] |
| 212 | +|=== |
| 213 | +| Name | Type | Default | Optional | Description |
| 214 | +| data | list | null | no | A list of maps of the following structure: `{item: nodeId, weights: [weight, weight, weight]}` |
| 215 | +| top | int | 0 | yes | The number of similar pairs to return. If `0` it will return as many as it finds. |
| 216 | +| topK | int | 0 | yes | The number of similar values to return per node. If `0` will return as many as it finds. |
| 217 | +| similarityCutoff | int | -1 | yes | The threshold for Euclidean distance. Values above this will not be returned. |
| 218 | +| degreeCutoff | int | 0 | yes | The threshold for the number of items in the `targets` list. If the list contains less than this amount that node will be excluded from the calculation. |
| 219 | +| concurrency | int | available CPUs | yes | The number of concurrent threads |
| 220 | +| write | boolean | false | yes | Indicates whether results should be stored. |
| 221 | +| writeRelationshipType | string | SIMILAR | yes | The relationship type to use when storing results. |
| 222 | +| writeProperty | string | score | yes | The property to use when storing results. |
| 223 | +|=== |
| 224 | + |
| 225 | +.Results |
| 226 | +[opts="header",cols="1,1,6"] |
| 227 | +|=== |
| 228 | +| Name | Type | Description |
| 229 | +| nodes | int | The number of nodes passed in |
| 230 | +| similarityPairs | int | The number of pairs of similar nodes computed |
| 231 | +| write | boolean | Indicates whether results were stored |
| 232 | +| writeRelationshipType | string | The relationship type used when storing results. |
| 233 | +| writeProperty | string | The property used when storing results. |
| 234 | +| min | double | The minimum similarity score computed |
| 235 | +| max | double | The maximum similarity score computed |
| 236 | +| mean | double | The mean of similarities scores computed |
| 237 | +| stdDev | double | The standard deviation of similarities scores computed |
| 238 | +| p25 | double | The 25 percentile of similarities scores computed |
| 239 | +| p50 | double | The 50 percentile of similarities scores computed |
| 240 | +| p75 | double | The 75 percentile of similarities scores computed |
| 241 | +| p90 | double | The 90 percentile of similarities scores computed |
| 242 | +| p95 | double | The 95 percentile of similarities scores computed |
| 243 | +| p99 | double | The 99 percentile of similarities scores computed |
| 244 | +| p999 | double | The 99.9 percentile of similarities scores computed |
| 245 | +| p100 | double | The 25 percentile of similarities scores computed |
| 246 | + |
| 247 | +|=== |
0 commit comments