speed problem in distribute training

## 1. In file "kvstore_dist_server.py"
Change
```
419      // TODO(mli) try to remove this CopyFrom
420     response.vals.CopyFrom(static_cast<const float*>(stored.data().dptr_), len);
```
to
```
response.vals = ps::SArray<float>(stored.data().dptr<float>(), len, false);
```

In a vgg16 training with two distribute machines (total 8 gpus),  it can accelerate **20** samples/sec. 
Is this method correct?

## 2. In file "kvstore_dist.py"
Delete the line 275 "send_buf.WaitToWrite();", can accelerate the speed with kvstore='sync_device' or 'local'.
In profile, I can find this WaitToWrite cause all the push locked until the whole backward finished.
Is this WaitToWrite necessary?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speed problem in distribute training #8097

1. In file "kvstore_dist_server.py"

2. In file "kvstore_dist.py"

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

speed problem in distribute training #8097

Description

1. In file "kvstore_dist_server.py"

2. In file "kvstore_dist.py"

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions