Skip to content

Conversation

@e-ago
Copy link
Collaborator

@e-ago e-ago commented Jan 17, 2018

There are 3 types of flusher: GPU native, CPU thread and NIC. It is possible to specify which one must be used by means of 2 env vars:

  • GDS_GPU_HAS_FLUSHER: 1 enable the GPU native flusher (service flusher will be ignored), 0 otherwise. Since CUDA 9.1 it must be always 0
  • GDS_FLUSHER_SERVICE:
    • 0 : No flusher service (default)
    • 1 : CPU thread flusher service
    • 2 : NIC flusher service

All the GDS_FLUSHER_SERVICE values have been tested with tests/gds_kernel_latency; here there is a report of the outputs with performance and the list of params posted in case of a wait operation.
Tested on ivy2/3 with cuda_20171220_23307802-inline-weak-membar-perf.
Note: GDR on ivy2/3 has poor performance.

In order to evaluate real performance, we should test the flusher on real-world applications using Async.

GDS_FLUSHER_SERVICE=0

[12893] GDS WARN  gds_post_ops() poll params
[12893] GDS INFO  gds_dump_params() param[0]:
[12893] GDS INFO  gds_dump_param() WAIT32 addr:0x204a0f9bc alias:0x7ffe8fb8afa0 value:00000000 flags:00000000

testing....
[1] batch 2: posted 20 sequences
pre-posting took 2301.00 usec
[0] 2048000 bytes in 0.04 seconds = 416.41 Mbit/sec
[0] 1000 iters in 0.04 seconds = 39.35 usec/iter
[1] 2048000 bytes in 0.04 seconds = 416.08 Mbit/sec
[1] 1000 iters in 0.04 seconds = 39.38 usec/iter

GDS_FLUSHER_SERVICE=1 (CPU Thread) + 16 usec

[12926] GDS WARN  gds_post_ops() poll params
[12926] GDS INFO  gds_dump_params() param[0]:
[12926] GDS INFO  gds_dump_param() WAIT32 addr:0x204a0f9bc alias:0x7ffc224f85f0 value:00000000 flags:00000000
[12926] GDS INFO  gds_dump_params() param[1]:
[12926] GDS INFO  gds_dump_param() WRITE32 addr:0x204a80000 alias:0x1 value:000003e7 flags:00000001
[12926] GDS INFO  gds_dump_params() param[2]:
[12926] GDS INFO  gds_dump_param() WAIT32 addr:0x23046d0000 alias:0x7ffc22513df0 value:000003e7 flags:00000001

testing....
[1] batch 2: posted 20 sequences
pre-posting took 2470.00 usec
[0] 2048000 bytes in 0.06 seconds = 293.14 Mbit/sec
[0] 1000 iters in 0.06 seconds = 55.89 usec/iter
[1] 2048000 bytes in 0.06 seconds = 292.90 Mbit/sec
[1] 1000 iters in 0.06 seconds = 55.94 usec/iter

GDS_FLUSHER_SERVICE=2 (NIC) + 20 usec

[12961] GDS WARN  gds_post_ops() poll params
[12961] GDS INFO  gds_dump_params() param[0]:
[12961] GDS INFO  gds_dump_param() WAIT32 addr:0x204a0f9bc alias:0x204a60a04 value:00000000 flags:00000000
[12961] GDS INFO  gds_dump_params() param[1]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x23046c0000 alias:(nil) value:000003e7 flags:00000001
[12961] GDS INFO  gds_dump_params() param[2]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x204a40104 alias:0x7f7df36a3300 value:e7030000 flags:00000001
[12961] GDS INFO  gds_dump_params() param[3]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x204a60b00 alias:0x7fffe63c8600 value:08e60300 flags:00000000
[12961] GDS INFO  gds_dump_params() param[4]:
[12961] GDS INFO  gds_dump_param() WRITE32 addr:0x204a60b04 alias:0x1 value:036d1400 flags:00000001
[12961] GDS INFO  gds_dump_params() param[5]:
[12961] GDS INFO  gds_dump_param() WAIT32 addr:0x23046d0000 alias:0x7fffe63e3e00 value:000003e7 flags:00000001

[1] batch 2: posted 20 sequences
pre-posting took 2556.00 usec
[0] 2048000 bytes in 0.06 seconds = 275.28 Mbit/sec
[0] 1000 iters in 0.06 seconds = 59.52 usec/iter
[1] 2048000 bytes in 0.06 seconds = 275.09 Mbit/sec
[1] 1000 iters in 0.06 seconds = 59.56 usec/iter

@e-ago
Copy link
Collaborator Author

e-ago commented Jan 25, 2018

tests/gds_kernel_latency, brdw0/1, cuda9.0, driver 384.81, using Tesla P100:

No Flusher

iters=1000 tx/rx_depth=1024

testing....
pre-posting took 1024.00 usec
[0] 2048000 bytes in 0.02 seconds = 744.56 Mbit/sec
[0] 1000 iters in 0.02 seconds = 22.00 usec/iter
[1] 2048000 bytes in 0.02 seconds = 743.54 Mbit/sec
[1] 1000 iters in 0.02 seconds = 22.03 usec/iter

CPU Flusher + 4 usec

pre-posting took 1400.00 usec
[0] 2048000 bytes in 0.03 seconds = 613.86 Mbit/sec
[0] 1000 iters in 0.03 seconds = 26.69 usec/iter
[1] 2048000 bytes in 0.03 seconds = 612.81 Mbit/sec
[1] 1000 iters in 0.03 seconds = 26.74 usec/iter

NIC Flusher + 8 usec

pre-posting took 1427.00 usec
[0] 2048000 bytes in 0.03 seconds = 540.98 Mbit/sec
[0] 1000 iters in 0.03 seconds = 30.29 usec/iter
[1] 2048000 bytes in 0.03 seconds = 539.94 Mbit/sec
[1] 1000 iters in 0.03 seconds = 30.34 usec/iter

@e-ago
Copy link
Collaborator Author

e-ago commented Jan 25, 2018

hpgmg_async, brdw0/1, cuda9.0, driver 384.81, using Tesla P100, 2 processes:

size gain no flusher sec no flusher
CPU Flusher
4 -11.54% +0.0003402
5 -11.92% +0.0009222
6 -9.61% +0.0020278
7 -5.33% +0.0031404
NIC Flusher
4 -13.39% +0.0003948
5 -14.39% +0.0011136
6 -11.49% +0.0024238
7 -5.56% +0.0032774

@drossetti
Copy link
Contributor

@e-ago does GDS_FLUSHER_SERVICE=0 (no flusher) imply GDS_GPU_HAS_FLUSHER=1, i.e. using CUDA 9.1 (broken but still adding some overhead) internal flusher or nothing at all?

@e-ago
Copy link
Collaborator Author

e-ago commented Jan 29, 2018

@drossetti no. If GDS_GPU_HAS_FLUSHER is set to 1, then GDS_FLUSHER_SERVICE is ignored. On the contrary, GDS_FLUSHER_SERVICE=0 doesn’t imply GDS_GPU_HAS_FLUSHER=1.
That is, if GDS_FLUSHER_SERVICE=0 and GDS_GPU_HAS_FLUSHER=0 there is no flusher at all

// move flush to last wait in the whole batch
if (n_waits && no_network_descs_after_entry(n_descs, descs, last_wait)) {
gds_dbg("optimizing FLUSH to last wait i=%zu\n", last_wait);
move_flush = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is setting move_flush=true in the 'GPU support native flusher' case?

src/flusher.hpp Outdated
#define GDS_FLUSHER_PORT 1
#define GDS_FLUSHER_QKEY 0 //0x11111111

#define CUDA_CHECK(stmt) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using the CUDA RT API? this is a big decision...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, it was an oversight

src/gdsync.cpp Outdated
gqp->recv_cq.curr_offset = 0;

gds_dbg("created gds_qp=%p\n", gqp);
if(!(flags & GDS_CREATE_QP_FLUSHER))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

src/gdsync.cpp Outdated
}
qp_attr->send_cq = tx_cq;
gds_dbg("created send_cq=%p\n", qp_attr->send_cq);
if(!(flags & GDS_CREATE_QP_FLUSHER))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you are using a negated logic for the flusher flag...
please refactor into a bool local var

src/gdsync.cpp Outdated
param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH;

//No longer supported since CUDA 9.1
//if (need_flush) param->waitValue.flags |= CU_STREAM_WAIT_VALUE_FLUSH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not true, we have to query via ::CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH

src/flusher.hpp Outdated
#include "archutils.h"

#define GDS_FLUSHER_TYPE_CPU 1
#define GDS_FLUSHER_TYPE_NIC 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd rather have enum here


#define GDS_FLUSHER_OP_CPU 2
#define GDS_FLUSHER_OP_NIC 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum here too

Copy link
Collaborator Author

@e-ago e-ago Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those constants are not related: they represent the number of ops required by NIC or CPU flusher

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

…f define. local bool variable during qp creation. if CUDA_VERSION >= 9020 then query CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH in case of native flusher
@e-ago
Copy link
Collaborator Author

e-ago commented Feb 5, 2018

@drossetti I've pushed some changes:

  • flusher env vars (GDS_GPU_HAS_FLUSHER and GDS_FLUSHER_SERVICE) merged into a single one (GDS_FLUSHER_TYPE)
  • enum with 4 different flusher types: GDS_FLUSHER_NONE=0, GDS_FLUSHER_NATIVE, GDS_FLUSHER_CPU, GDS_FLUSHER_NIC
  • in case of GDS_FLUSHER_NATIVE, move_flush reintroduced with CU_DEVICE_ATTRIBUTE_CAN_USE_WAIT_VALUE_FLUSH check (CUDA_VERSION >= 9020)
  • local bool variable during qp creation

Copy link
Contributor

@drossetti drossetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flusher is a big chunk of code.
I suggest to move to a more object oriented design and split the implementation in different .cpp files.
besides please reuse the memory allocation/registration functions already present in libgdsync


#define GDS_FLUSHER_OP_CPU 2
#define GDS_FLUSHER_OP_NIC 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

else
return false;
}
#define CHECK_FLUSHER_SERVICE() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a macro?

}

static inline bool gds_flusher_service_active() {
if(gds_use_flusher == GDS_FLUSHER_CPU || gds_use_flusher == GDS_FLUSHER_NIC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not also check if flusher_thread!=NULL ? or wait for the thread to set some volatile flag signaling its livelihood ?


#define ROUND_TO(V,PS) ((((V) + (PS) - 1)/(PS)) * (PS))

bool gds_use_native_flusher()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API reflect the a choice which has been made earlier, while its name implies an order to use the native flusher...
could be renamed as gds_is_native_flusher() or similar

static gds_flusher_buf flack_d;
static int flusher_value=0;
static pthread_t flusher_thread;
static int gds_use_flusher = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the current stateful C API.
There should be a way to create a singleton object, the flusher, using an object factory.
flusher should be an abstract base class. derived classes are specializations.
And functions should be methods of that class.

}

static int gds_flusher_pin_buffer(gds_flusher_buf * fl_mem, size_t req_size, int type_mem)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a new memory allocation/registration function? why not using/extending those already here?

gds_dbg("created gds_qp=%p\n", gqp);
if(!is_qp_flusher)
{
if(gds_flusher_init(pd, context, gpu_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gds_flusher_init() should return a flusher object which is stored in gds_qp.
you should convince the reviewer that there is value in abstracting the native flusher inside , or to simply special case in gdsync.c

@e-ago
Copy link
Collaborator Author

e-ago commented Apr 12, 2018

The flusher implementation for the moment is in PR #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants