Implements sendfile for redis. by ShooterIT · Pull Request #7096 · redis/redis

ShooterIT · 2020-04-14T17:00:18Z

I implement sendfile for redis, but now only support Linux and macOS since I only have these two operation systems. For performance, it may couldn't save time but can reduce CPU usage.

If we implement it on BSD systems, please notice that sendfile may return -1 and errno set is to EAGAIN even if some bytes have been sent successfully and the the len argument is set correctly when using a socket marked for non-blocking I/O just like on macOS.

antirez · 2020-04-15T08:39:54Z

Hello @ShooterIT, thanks for your PR. Now the window for such a change in Redis 6 is closed. But there could be space for this for Redis 7. I'm pinging @oranagra that may want to give a look to evaluate this code. I believe that it should be proven to have a performance advantage of any kind before using it btw. Cheers.

oranagra · 2020-04-15T08:59:35Z

I suppose there's no disadvantage in that and the cost of complexity is small.
not copying gigabytes of data two additional times is surely beneficial IMO.
would be nice to see some benchmark though.

ShooterIT · 2020-04-15T11:59:38Z

Yes, we should be rigorous for every change @antirez @oranagra

My two server machines are the same, Linux kernel is 3.10.0, CPU is E5-2620 v2 @ 2.10GHz.
The time taken by these two methods is almost the same, but the CPU utilization rates are different. The CPU usages are as follows when master only send RDB file to one replica.

read/write	sendfile
9.1%	5.3%

To be honest, I only verified the correctness on my MacBook for macOS. I need to do some performance tests.

ShooterIT · 2020-05-19T12:48:08Z

The Linux server machines' configurations are the same, as above.

I did benchmark on master redis meanwhile replicas received RDB file. Some tests are as follows. But I get CPU usage without benchmark.

Benchmark:
./src/redis-benchmark -t set -d 1024 -n 1000000 -p 6380

Config:

  50 parallel clients
  1024 bytes payload
  keep alive: 1
  host configuration "save":
  host configuration "appendonly": no
  multi-thread: no

One replica syncs from master.

One replica	read/write	sendfile
CPU	9.1%	5.3%
Benchmark	55805	59403

Two replicas sync from master.

Two replicas	read/write	sendfile
CPU	10.3%	5.7%
Benchmark	51448	54445

ShooterIT · 2020-05-20T02:47:49Z

There are some my tests, it have advantage at reducing CPU usage @antirez @oranagra
I think that is why some web server use sendfile such as Nginx.

For macOS, my macbook has poor configurations, so I can't do some large capacity tests. but I could exactly find redis will cost low cpu if use sendfile.

oranagra · 2020-05-20T05:34:31Z

@antirez i think this can be merged. i honestly don't see any risks, and the fact the test suite passes, seem to be enough.

BTW, @ShooterIT, how about using splice to do the same thing for readSyncBulkPayload on the replica side? care to implement and test that?

ShooterIT · 2020-05-20T12:51:31Z

hello @oranagra
I ever thought about splice, I simply considered it can't move data between socket and regular file.
But your remind gives me an inspiration that we can use a temporary pipe to implement it with call splice twice, And this way may not destroy the code structure. Very interesting! I will have a try, thanks.

oranagra · 2020-05-20T13:36:01Z

@ShooterIT i don't understand what you mean. readSyncBulkPayload reads from a socket and write so a file. i think it can use splice in exactly the same way you used sendfile to read from a file and write to a socket.

ShooterIT · 2020-05-20T14:02:07Z

AFAIK, one fd of splice arguments must be a pipe fd

antirez · 2020-05-22T10:27:16Z

What about merging this only into unstable and not into Redis 6? @oranagra @ShooterIT?

ShooterIT · 2020-05-22T10:59:38Z

That's OK @antirez
Stability is first, I can understand since there are too many new features in 6.0. Let more people verify it, it can be merged if there is no problem for a long time.

oranagra · 2020-05-22T11:53:13Z

I would have argued that this is a safe improvement.. But considering my recent simple fix exploded twice.. I can't.. So since this is not a critical issue, but rather just efficiency, not putting into 6 is ok.

antirez · 2020-05-22T11:54:56Z

@oranagra yep given the intrinsic issues with programming... better to be safe :-D Anyway we can promote it to Redis 6 after some time.

antirez · 2020-05-22T11:55:51Z

Merged.

oranagra · 2020-06-01T13:39:54Z

@ShooterIT, not sure if you noticed, but this change was released as part of redis 6.0.4.

Later we received some reports that sendfile introduces latency implications.
Apparently, at least on some platforms if not all, it has to wait for TCP ACK, which means that every call introduces a latency of a round trip (slowing down clients while it streams the RDB to the replicas).

We may have to revert it, but first, i'd like to check if you have anything to see if you happen to have any feedback on that?

In any case, thanks for contributing (quite a lot recently). much appreciated.

ShooterIT · 2020-06-02T04:31:54Z

Thanks for notifying me @oranagra
Don't care my feeling if there is a performance loss, but firstly. I wish I can understand it clearly.
There are two questions I need you help me.

sendfile will delay to send data actually if we set repl-disable-tcp-nodelay no ? What's more, AFAIK, replica may employ delayed ACK algorithm, that may have an effect on delaying.
I only do some performance tests on Linux 3.10.0. I don't find latency. Do you know what platform? We can only use it on linux if there is no latency on linux because it is used most widely.

ShooterIT · 2020-06-02T05:26:18Z

I think sendfile makes much sense of transfering big file since some operation systems implement it. Is it my incorrect usage?

oranagra · 2020-06-02T05:36:17Z

@ShooterIT AFAIK it's not an incorrect usage, but rather in inefficient blocking implementation in the kernel. Have a look at this paper.
It was reported by @JohnSully, who said that he first noticed the latency (to other clients) and then tracked it down to the use of sendfile.
I didn't bother to try to test it myself, but please note that even if the problem exists only on some platforms, it may still mean we have to avoid using it on all.

oranagra · 2020-06-02T06:24:20Z

one possible solution is maybe to use sendfile from a thread in a blocking manner, it does add some complexity to redis though.
but come to think of it, the use of sendfile, not only reduces memory copying, but i suppose also reduces cache pollution and thus cache misses.
would be nice to try and measure it if someone has the time.

ShooterIT · 2020-06-02T07:50:21Z

Thanks @oranagra I will read it ASAP.
Hello @JohnSully would you mind to provide some details of your tests?

daemyungkang · 2020-06-02T08:06:48Z

I'm not sure it is the same issue but Kafka also had a trouble with sendfile.
It had an issue when sendfile block the network thread when it sends uncached file.
https://issues.apache.org/jira/browse/KAFKA-7504

JohnSully · 2020-06-02T08:26:26Z

Hi @ShooterIT

I originally noticed the problems from test failures in KeyDB after merging 6.0.4 and narrowed it down to the sendfile change. The root cause was increased synchronization time which the test were sensitive too.

sendfile() is no longer in KeyDB, but I've included a checkpoint where it existed so you can repro.

Check out KeyDB: https://github.com/JohnSully/KeyDB
checkout e50313b20718bc8df0a53c11b0960e4bcb2177d4
make
./runtest --single integration/replication-active

[err]: Active replicas propogate in tests/integration/replication-active.tcl

The active replication tests don't confirm that synchronization has finished before starting which makes them quite sensitive to latency. Adding a "after 2000" delay before the first test resolves the failure as does removing sendfile() completely which I unfortunately had to do.

I did spend some time researching why sendfile() is unexpectedly slower, what I found was that because of its reliance on DMA it has to ensure the last TCP packet is ACK'd before returning (and therefore that the TCP buffer is not needed). This means that the call is now affected by TCP round trip latency. It seems to be designed to send bulk transfers not short bursts as is done in Redis/KeyDB.

JohnSully · 2020-06-02T08:34:27Z

The other part to point out is that the tests rely on loopback sockets which have notoriously different perf characteristics to real NICs. I'm not a kernel expert so its quite possible some NICs have more efficient sendfile implementations. Nonetheless loopback use is an important scenario for KeyDB and I assume for Redis as well.

oranagra · 2020-06-02T08:49:34Z

it would be nice to benchmark it on various platforms if someone has the time.
AFAIK the benchmark that was done already measured replication time and CPU usage. but we now wanna add to the mix the latency impact on other clients, and maybe the amount of cache misses.

in any case, even if there are severe implications on just one platform, we probably can't afford to keep it as it is, but maybe we'll decide that it is important enough to consider using in a thread, or another solution comes up.

ShooterIT · 2020-06-03T05:53:05Z

Thanks @JohnSully
I didn't find errors when I did tests on my machines as you said. But it will case terrible latency if sendfile has to ensure the last TCP packet is ACK'd before returning. And this paper also points out the blocking behavior. Actually this exceeds my cognition. I don't know much about kernel, and can't sure if there are different implementations.

Just as @oranagra said,

in any case, even if there are severe implications on just one platform, we probably can't afford to keep it as it is

@antirez @oranagra You could revert it. I'm truly sorry to bring latency implications.

There is a test to see sendfile cost by strace on my machines. We can find sendfile time-consuming has large fluctuations. But I am not sure it is blocking for waiting TCP ACK, maybe because reading data from disk is blocked, you may do tests on your machine @JohnSully

18:35:12.001289 sendfile(7, 10, [40435712], 16384) = 16384 <0.000015>
18:35:12.001349 sendfile(7, 10, [40452096], 16384) = 16384 <0.000016>
18:35:12.001408 sendfile(7, 10, [40468480], 16384) = 16384 <0.000016>
18:35:12.001461 sendfile(7, 10, [40484864], 16384) = 16384 <0.000022>
18:35:12.001521 sendfile(7, 10, [40501248], 16384) = 16384 <0.000015>
18:35:12.001582 sendfile(7, 10, [40517632], 16384) = 16384 <0.000015>
18:35:12.001639 sendfile(7, 10, [40534016], 16384) = 16384 <0.000019>
18:35:12.001696 sendfile(7, 10, [40550400], 16384) = 16384 <0.000015>
18:35:12.001749 sendfile(7, 10, [40566784], 16384) = 16384 <0.000016>
18:35:12.001801 sendfile(7, 10, [40583168], 16384) = 16384 <0.000022>
18:35:12.001860 sendfile(7, 10, [40599552], 16384) = 16384 <0.000015>
18:35:12.001916 sendfile(7, 10, [40615936], 16384) = 16384 <0.000015>
18:35:12.001968 sendfile(7, 10, [40632320], 16384) = 16384 <0.000021>
18:35:12.002029 sendfile(7, 10, [40648704], 16384) = 16384 <0.000015>
18:35:12.002089 sendfile(7, 10, [40665088], 16384) = 16384 <0.000016>
18:35:12.002148 sendfile(7, 10, [40681472], 16384) = 16384 <0.000016>
18:35:12.002204 sendfile(7, 10, [40697856], 16384) = 16384 <0.000023>
18:35:12.002265 sendfile(7, 10, [40714240], 16384) = 16384 <0.000015>
18:35:12.002325 sendfile(7, 10, [40730624], 16384) = 16384 <0.000015>
18:35:12.002377 sendfile(7, 10, [40747008], 16384) = 16384 <0.000018>
18:35:12.002434 sendfile(7, 10, [40763392], 16384) = 16384 <0.000016>

I came up with an idea, sendfile will not block if we fill the tcp send buffer on every calling sendfile because there must be a TCP ACK if replica fd is writable. Right? So I use setsockopt to limit send buffer PROTO_IOBUF_LEN (anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);) . I can observe sendfile time-consuming is stable as follows, but the time of transfering RDB got longer because of too many small packets. That can be observed by 'tcpdump'. But I think we can't increase transfer size of sendfile simply since reading data from disk is blocked in sendfile

13:11:00.011137 sendfile(7, 10, [7892992], 16384) = 16384 <0.000016>
13:11:00.011301 sendfile(7, 10, [7909376], 16384) = 16384 <0.000016>
13:11:00.011466 sendfile(7, 10, [7925760], 16384) = 16384 <0.000015>
13:11:00.011629 sendfile(7, 10, [7942144], 16384) = 16384 <0.000016>
13:11:00.011792 sendfile(7, 10, [7958528], 16384) = 16384 <0.000016>
13:11:00.011956 sendfile(7, 10, [7974912], 16384) = 16384 <0.000015>
13:11:00.012120 sendfile(7, 10, [7991296], 16384) = 16384 <0.000016>
13:11:00.012286 sendfile(7, 10, [8007680], 16384) = 16384 <0.000017>
13:11:00.012449 sendfile(7, 10, [8024064], 16384) = 16384 <0.000015>
13:11:00.012612 sendfile(7, 10, [8040448], 16384) = 16384 <0.000016>
13:11:00.012775 sendfile(7, 10, [8056832], 16384) = 16384 <0.000015>
13:11:00.012939 sendfile(7, 10, [8073216], 16384) = 16384 <0.000016>
13:11:00.013103 sendfile(7, 10, [8089600], 16384) = 16384 <0.000016>
13:11:00.013268 sendfile(7, 10, [8105984], 16384) = 16384 <0.000016>
13:11:00.013431 sendfile(7, 10, [8122368], 16384) = 16384 <0.000016>
13:11:00.013595 sendfile(7, 10, [8138752], 16384) = 16384 <0.000016>
13:11:00.013759 sendfile(7, 10, [8155136], 16384) = 16384 <0.000016>
13:11:00.013922 sendfile(7, 10, [8171520], 16384) = 16384 <0.000016>
13:11:00.014087 sendfile(7, 10, [8187904], 16384) = 16384 <0.000015>
13:11:00.014251 sendfile(7, 10, [8204288], 16384) = 16384 <0.000017>
13:11:00.014414 sendfile(7, 10, [8220672], 16384) = 16384 <0.000016>
13:11:00.014581 sendfile(7, 10, [8237056], 16384) = 16384 <0.000015>
13:11:00.014742 sendfile(7, 10, [8253440], 16384) = 16384 <0.000016>
13:11:00.014907 sendfile(7, 10, [8269824], 16384) = 16384 <0.000016>
13:11:00.015070 sendfile(7, 10, [8286208], 16384) = 16384 <0.000015>
13:11:00.015234 sendfile(7, 10, [8302592], 16384) = 16384 <0.000016>
13:11:00.015397 sendfile(7, 10, [8318976], 16384) = 16384 <0.000016>
13:11:00.015561 sendfile(7, 10, [8335360], 16384) = 16384 <0.000015>
13:11:00.015725 sendfile(7, 10, [8351744], 16384) = 16384 <0.000015>
13:11:00.015888 sendfile(7, 10, [8368128], 16384) = 16384 <0.000015>
13:11:00.016082 sendfile(7, 10, [8384512], 16384) = 16384 <0.000015>
13:11:00.016290 sendfile(7, 10, [8400896], 16384) = 16384 <0.000018>
13:11:00.016535 sendfile(7, 10, [8417280], 16384) = 16384 <0.000016>
13:11:00.016699 sendfile(7, 10, [8433664], 16384) = 16384 <0.000016>
13:11:00.016943 sendfile(7, 10, [8450048], 16384) = 16384 <0.000016>
13:11:00.017189 sendfile(7, 10, [8466432], 16384) = 16384 <0.000016>
13:11:00.017353 sendfile(7, 10, [8482816], 16384) = 16384 <0.000016>

Maybe using a thread to send RDB file is a good idea, but that brings much complexity especially many replicas need to receive RDB files at the same time.

JohnSully · 2020-06-03T05:59:00Z

When I saw this change I thought it was a really good idea. I’m a bit sad it didn’t work out as well as hoped. A separate thread to send the RDB would be cool.

I get a consistent repro on my Ubuntu 18.04 machine run in WSL2’s virtual machine but I’ve seen it in more pure linux setups as well.

ShooterIT · 2020-06-04T03:07:05Z

Hello @JohnSully would you mind to help verify my guess

I came up with an idea, sendfile will not block if we fill the tcp send buffer on every calling sendfile because there must be a TCP ACK if replica fd is writable. Right? So I use setsockopt to limit send buffer PROTO_IOBUF_LEN (anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);)

You could add the following code in https://github.com/antirez/redis/blob/unstable/src/replication.c#L1351, and do your tests in your environment.

#if HAVE_SENDFILE
                if(!server.tls_replication) anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);
#endif

JohnSully · 2020-06-04T03:44:21Z

@ShooterIT I repro'd without tls compiled and in tests without TLS enabled. Are you sure this will still have an effect?

ShooterIT · 2020-06-04T04:25:01Z

Hi @JohnSully That's OK. Now, TLS must be disable to enable sendfile.
For your repo, Maybe we need to add the following code in https://github.com/JohnSully/KeyDB/blob/unstable/src/replication.cpp#L1663

#if HAVE_SENDFILE
                if (!g_pserver->tls_replication && !g_pserver->fActiveReplica)
                    anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);
#endif

and add int anetSetSendBuffer(char *err, int fd, int buffsize); in src/anet.h

ShooterIT · 2020-06-04T04:31:03Z

I can pass the test that you told me by above modifications, although passed the test before on my environments.
https://github.com/antirez/redis/pull/7096#issuecomment-637378543

ShooterIT · 2020-06-04T09:22:58Z

Hi @JohnSully @oranagra @antirez I have some new discoveries.
I track down the implement of sendfile on linux , in linux source code
https://github.com/torvalds/linux/blob/master/fs/splice.c#L967
https://github.com/torvalds/linux/blob/master/fs/splice.c#L936
That means sendfile doesn't support nonblocking mode for out_fd, for this comment https://github.com/torvalds/linux/blob/master/fs/splice.c#L934, I consider it is wrong, should be
Block on output, we have to drain the direct pipe. Right?
I don't know much about linux kernel, so I am not sure about it, please help me to review this point.

What‘s more, I read man sendfile on my macOS, for error EAGIN, this is the description, same as FressBSD.

 [EAGAIN]           The socket is marked for non-blocking I/O and not all data was sent due to the socket buffer
                    being full.  If specified, the number of bytes successfully sent will be returned in *len.

In freeBSD source code,
https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_sendfile.c#L776
https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_sendfile.c#L811
Maybe the behavior of sendfile on BSD operation system is our want.

oranagra · 2020-06-04T10:38:59Z

@ShooterIT it seems you are right.
@antirez i think we should revert and possibly make another quick release. (note there are a few other important fixes in other PRs)

ShooterIT · 2020-06-04T12:21:14Z

Please revert it @antirez I am truly sorry.

JohnSully · 2020-06-05T00:08:35Z

Hi @ShooterIT It will pass the test but only because sendfile is disabled for active replication which that test enables. It doesn't fundamentally fix the issue of slow startup. Active replication is a bit more sensitive because each server will send an RDB to the other so latency is 2x normal.

I initially disabled it only for active replication when I hadn't yet understood the impact of sendfile().

antirez · 2020-06-06T09:43:46Z

@ShooterIT no problem, there are complex interactions, we never are aware, you did nothing wrong here, instead we maintainers made an error by letting a new feature in that was just a performance improvement, without giving it the time to stay enough into unstable. A big thank you to @JohnSully for the heads up on that. To start I can revert the feature just this way:

I reverted the implementation both on Redis 6.0 and unstable.

Implements sendfile for redis.

Implements sendfile for redis.

9cf500a

antirez added cpu efficiency replication labels May 22, 2020

antirez merged commit 4016155 into redis:unstable May 22, 2020

ShooterIT deleted the sendfile branch June 2, 2020 06:08

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Jun 18, 2020

Merge pull request redis#7096 from ShooterIT/sendfile

ba81143

Implements sendfile for redis.

madolson mentioned this pull request Aug 11, 2020

Send the RDB bulk data by function of sendfile. #5253

Closed

ShooterIT mentioned this pull request Sep 6, 2020

[BUG] Disk Based Replication Causing OOM in Master (v6.0.5) #7717

Closed

ShooterIT mentioned this pull request Oct 19, 2020

Send RDB by a separate thread when full synchronization #7902

Closed

Conversation

ShooterIT commented Apr 14, 2020

Uh oh!

antirez commented Apr 15, 2020

Uh oh!

oranagra commented Apr 15, 2020

Uh oh!

ShooterIT commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShooterIT commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShooterIT commented May 20, 2020

Uh oh!

oranagra commented May 20, 2020

Uh oh!

ShooterIT commented May 20, 2020

Uh oh!

oranagra commented May 20, 2020

Uh oh!

ShooterIT commented May 20, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antirez commented May 22, 2020

Uh oh!

ShooterIT commented May 22, 2020

Uh oh!

oranagra commented May 22, 2020

Uh oh!

antirez commented May 22, 2020

Uh oh!

antirez commented May 22, 2020

Uh oh!

oranagra commented Jun 1, 2020

Uh oh!

ShooterIT commented Jun 2, 2020

Uh oh!

ShooterIT commented Jun 2, 2020

Uh oh!

oranagra commented Jun 2, 2020

Uh oh!

oranagra commented Jun 2, 2020

Uh oh!

ShooterIT commented Jun 2, 2020

Uh oh!

daemyungkang commented Jun 2, 2020

Uh oh!

JohnSully commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 2, 2020

Uh oh!

oranagra commented Jun 2, 2020

Uh oh!

ShooterIT commented Jun 3, 2020

Uh oh!

JohnSully commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShooterIT commented Jun 4, 2020

Uh oh!

JohnSully commented Jun 4, 2020

Uh oh!

ShooterIT commented Jun 4, 2020

Uh oh!

ShooterIT commented Jun 4, 2020

Uh oh!

ShooterIT commented Jun 4, 2020

Uh oh!

oranagra commented Jun 4, 2020

Uh oh!

ShooterIT commented Jun 4, 2020

Uh oh!

JohnSully commented Jun 5, 2020

Uh oh!

antirez commented Jun 6, 2020

Uh oh!

Reviewers

Assignees

ShooterIT commented Apr 15, 2020 •

edited

Loading

ShooterIT commented May 19, 2020 •

edited

Loading

ShooterIT commented May 20, 2020 via email •

edited

Loading

JohnSully commented Jun 2, 2020 •

edited

Loading

JohnSully commented Jun 3, 2020 •

edited

Loading