@@ -179,7 +179,7 @@ this on `utop`.
179179
180180# let pool = Task.setup_pool ~num_additional_domains:3
181181val pool : Task.pool = <abstr>
182- ```
182+ ```
183183We have created a new task pool with three new domains. The parent domain is
184184also part of this pool, thus making it a pool of four domains. After the pool is
185185setup, we can use this pool to execute all tasks we want to run in parallel. The
@@ -285,7 +285,7 @@ to be executed.
285285Parallel for also has an optional parameter ` chunk_size ` . It determines the
286286granularity of tasks when executing them on multiple domains. If no parameter
287287is given for ` chunk size ` , a default chunk size is determined which performs
288- well in most cases. Only if the default chunk size doesn't work well, it is
288+ well in most cases. Only if the default chunk size doesn't work well, is it
289289recommended to experiment with different chunk sizes. The ideal ` chunk_size `
290290depends on a combination of factors:
291291
@@ -297,7 +297,7 @@ iterations divided by the number of cores. On the other hand, if the amount of
297297time taken is different for every iteration, the chunks should be smaller. If
298298the total number of iterations is a sizeable number, a ` chunk_size ` like 32 or
29929916 is safe to use, whearas if the number of iterations is low, like say 10, a
300- ` chunk_size ` of 1 would perform best.
300+ ` chunk_size ` of 1 would perform best.
301301
302302* ** Machine:** Optimal chunk size varies across machines and it is recommended
303303to experiment with a range of values to find out what works best on yours.
@@ -350,14 +350,14 @@ let parallel_matrix_multiply_3 pool m1 m2 m3 =
350350 let t = Array.make_matrix size size 0 in (* stores m1*m2 *)
351351 let res = Array.make_matrix size size 0 in
352352
353- Task.parallel_for pool ~chunk_size:(size/num_domains) ~ start:0 ~finish:(size - 1) ~body:(fun i ->
353+ Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
354354 for j = 0 to size - 1 do
355355 for k = 0 to size - 1 do
356356 t.(i).(j) <- t.(i).(j) + m1.(i).(k) * m2.(k).(j)
357357 done
358358 done);
359359
360- Task.parallel_for pool ~chunk_size:(size/num_domains) ~ start:0 ~finish:(size - 1) ~body:(fun i ->
360+ Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i ->
361361 for j = 0 to size - 1 do
362362 for k = 0 to size - 1 do
363363 res.(i).(j) <- res.(i).(j) + t.(i).(k) * m3.(k).(j)
@@ -505,7 +505,7 @@ The above example would be essentially blocking indefinitely because the `send`
505505does not have a corresponding receive. If we instead create a bounded channel
506506with buffer size n, it can store up to [ n] objects in the channel without a
507507corresponding receive, exceeding which the sending would block. We can try it
508- with the same example as above just by changing the buffer size to 1.
508+ with the same example as above just by changing the buffer size to 1.
509509
510510``` ocaml
511511open Domainslib
@@ -611,7 +611,7 @@ let _ =
611611 worker (update results) ();
612612 Array.iter Domain.join domains;
613613 Array.iter (Printf.printf "%d ") results
614- ```
614+ ```
615615
616616We have created an unbounded channel ` c ` which will act as a store for all the
617617tasks. We'll pay attention to two functions here: ` create_work ` and ` worker ` .
@@ -659,7 +659,7 @@ that if a lot more time is spent outside the function we'd like to parallelise,
659659the maximum speedup we could achieve would be lower.
660660
661661Profiling serial code can help us discover the hotspots where we might want to
662- introduce parallelism.
662+ introduce parallelism.
663663
664664```
665665Samples: 51K of event 'cycles:u', Event count (approx.): 28590830181
@@ -791,7 +791,7 @@ Shared Data Cache Line Table (2 entries, sorted on Total HITMs)
791791 ----------- Cacheline ---------- Total Tot ----- LLC Load Hitm ----- ---- Store Reference ---- --- Loa
792792Index Address Node PA cnt records Hitm Total Lcl Rmt Total L1Hit L1Miss Lc
793793 0 0x7f2bf49d7dc0 0 11473 13008 94.23% 1306 1306 0 1560 595 965 ◆
794- 1 0x7f2bf49a7b80 0 271 368 5.48% 76 76 0 123 76 47
794+ 1 0x7f2bf49a7b80 0 271 368 5.48% 76 76 0 123 76 47
795795```
796796
797797As evident from the report, there's quite a lot of false sharing happening in
@@ -953,7 +953,7 @@ So far we have only found that there is an imbalance in task distribution
953953in the code, we'll need to change our code accordingly to make the task
954954distribution more balanced, which could increase the speedup.
955955
956- ---
956+ ---
957957
958958Performace debugging can be quite tricky at times. If you could use some help in
959959debugging your Multicore OCaml code, feel free to create an issue in the
0 commit comments