-
Notifications
You must be signed in to change notification settings - Fork 19
Flusher step 2: CPU and NIC options #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
|
tests/gds_kernel_latency, brdw0/1, cuda9.0, driver 384.81, using Tesla P100: No Flusher CPU Flusher + 4 usec NIC Flusher + 8 usec |
|
hpgmg_async, brdw0/1, cuda9.0, driver 384.81, using Tesla P100, 2 processes:
|
|
@e-ago does GDS_FLUSHER_SERVICE=0 (no flusher) imply GDS_GPU_HAS_FLUSHER=1, i.e. using CUDA 9.1 (broken but still adding some overhead) internal flusher or nothing at all? |
|
@drossetti no. If GDS_GPU_HAS_FLUSHER is set to 1, then GDS_FLUSHER_SERVICE is ignored. On the contrary, GDS_FLUSHER_SERVICE=0 doesn’t imply GDS_GPU_HAS_FLUSHER=1. |
| // move flush to last wait in the whole batch | ||
| if (n_waits && no_network_descs_after_entry(n_descs, descs, last_wait)) { | ||
| gds_dbg("optimizing FLUSH to last wait i=%zu\n", last_wait); | ||
| move_flush = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who is setting move_flush=true in the 'GPU support native flusher' case?
src/flusher.hpp
Outdated
| #define GDS_FLUSHER_PORT 1 | ||
| #define GDS_FLUSHER_QKEY 0 //0x11111111 | ||
|
|
||
| #define CUDA_CHECK(stmt) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you using the CUDA RT API? this is a big decision...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, it was an oversight
src/gdsync.cpp
Outdated
| gqp->recv_cq.curr_offset = 0; | ||
|
|
||
| gds_dbg("created gds_qp=%p\n", gqp); | ||
| if(!(flags & GDS_CREATE_QP_FLUSHER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/gdsync.cpp
Outdated
| } | ||
| qp_attr->send_cq = tx_cq; | ||
| gds_dbg("created send_cq=%p\n", qp_attr->send_cq); | ||
| if(!(flags & GDS_CREATE_QP_FLUSHER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you are using a negated logic for the flusher flag...
please refactor into a bool local var
src/gdsync.cpp
Outdated
| param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH; | ||
|
|
||
| //No longer supported since CUDA 9.1 | ||
| //if (need_flush) param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not true, we have to query via ::CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH
src/flusher.hpp
Outdated
| #include "archutils.h" | ||
|
|
||
| #define GDS_FLUSHER_TYPE_CPU 1 | ||
| #define GDS_FLUSHER_TYPE_NIC 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd rather have enum here
|
|
||
| #define GDS_FLUSHER_OP_CPU 2 | ||
| #define GDS_FLUSHER_OP_NIC 5 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those constants are not related: they represent the number of ops required by NIC or CPU flusher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
…f define. local bool variable during qp creation. if CUDA_VERSION >= 9020 then query CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH in case of native flusher
|
@drossetti I've pushed some changes:
|
drossetti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the flusher is a big chunk of code.
I suggest to move to a more object oriented design and split the implementation in different .cpp files.
besides please reuse the memory allocation/registration functions already present in libgdsync
|
|
||
| #define GDS_FLUSHER_OP_CPU 2 | ||
| #define GDS_FLUSHER_OP_NIC 5 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| else | ||
| return false; | ||
| } | ||
| #define CHECK_FLUSHER_SERVICE() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a macro?
| } | ||
|
|
||
| static inline bool gds_flusher_service_active() { | ||
| if(gds_use_flusher == GDS_FLUSHER_CPU || gds_use_flusher == GDS_FLUSHER_NIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not also check if flusher_thread!=NULL ? or wait for the thread to set some volatile flag signaling its livelihood ?
|
|
||
| #define ROUND_TO(V,PS) ((((V) + (PS) - 1)/(PS)) * (PS)) | ||
|
|
||
| bool gds_use_native_flusher() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API reflect the a choice which has been made earlier, while its name implies an order to use the native flusher...
could be renamed as gds_is_native_flusher() or similar
| static gds_flusher_buf flack_d; | ||
| static int flusher_value=0; | ||
| static pthread_t flusher_thread; | ||
| static int gds_use_flusher = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the current stateful C API.
There should be a way to create a singleton object, the flusher, using an object factory.
flusher should be an abstract base class. derived classes are specializations.
And functions should be methods of that class.
| } | ||
|
|
||
| static int gds_flusher_pin_buffer(gds_flusher_buf * fl_mem, size_t req_size, int type_mem) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need a new memory allocation/registration function? why not using/extending those already here?
| gds_dbg("created gds_qp=%p\n", gqp); | ||
| if(!is_qp_flusher) | ||
| { | ||
| if(gds_flusher_init(pd, context, gpu_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gds_flusher_init() should return a flusher object which is stored in gds_qp.
you should convince the reviewer that there is value in abstracting the native flusher inside , or to simply special case in gdsync.c
|
The flusher implementation for the moment is in PR #51 |
There are 3 types of flusher: GPU native, CPU thread and NIC. It is possible to specify which one must be used by means of 2 env vars:
All the GDS_FLUSHER_SERVICE values have been tested with tests/gds_kernel_latency; here there is a report of the outputs with performance and the list of params posted in case of a wait operation.
Tested on ivy2/3 with cuda_20171220_23307802-inline-weak-membar-perf.
Note: GDR on ivy2/3 has poor performance.
In order to evaluate real performance, we should test the flusher on real-world applications using Async.
GDS_FLUSHER_SERVICE=0
GDS_FLUSHER_SERVICE=1 (CPU Thread) + 16 usec
GDS_FLUSHER_SERVICE=2 (NIC) + 20 usec