Skip to content

Implements sendfile for redis.#7096

Merged
antirez merged 1 commit intoredis:unstablefrom
ShooterIT:sendfile
May 22, 2020
Merged

Implements sendfile for redis.#7096
antirez merged 1 commit intoredis:unstablefrom
ShooterIT:sendfile

Conversation

@ShooterIT
Copy link
Member

I implement sendfile for redis, but now only support Linux and macOS since I only have these two operation systems. For performance, it may couldn't save time but can reduce CPU usage.

If we implement it on BSD systems, please notice that sendfile may return -1 and errno set is to EAGAIN even if some bytes have been sent successfully and the the len argument is set correctly when using a socket marked for non-blocking I/O just like on macOS.

@antirez
Copy link
Contributor

antirez commented Apr 15, 2020

Hello @ShooterIT, thanks for your PR. Now the window for such a change in Redis 6 is closed. But there could be space for this for Redis 7. I'm pinging @oranagra that may want to give a look to evaluate this code. I believe that it should be proven to have a performance advantage of any kind before using it btw. Cheers.

@oranagra
Copy link
Member

I suppose there's no disadvantage in that and the cost of complexity is small.
not copying gigabytes of data two additional times is surely beneficial IMO.
would be nice to see some benchmark though.

@ShooterIT
Copy link
Member Author

ShooterIT commented Apr 15, 2020

Yes, we should be rigorous for every change @antirez @oranagra

My two server machines are the same, Linux kernel is 3.10.0, CPU is E5-2620 v2 @ 2.10GHz.
The time taken by these two methods is almost the same, but the CPU utilization rates are different. The CPU usages are as follows when master only send RDB file to one replica.

read/write sendfile
9.1% 5.3%

To be honest, I only verified the correctness on my MacBook for macOS. I need to do some performance tests.

@ShooterIT
Copy link
Member Author

ShooterIT commented May 19, 2020

The Linux server machines' configurations are the same, as above.

I did benchmark on master redis meanwhile replicas received RDB file. Some tests are as follows. But I get CPU usage without benchmark.

Benchmark:
./src/redis-benchmark -t set -d 1024 -n 1000000 -p 6380

Config:

  50 parallel clients
  1024 bytes payload
  keep alive: 1
  host configuration "save":
  host configuration "appendonly": no
  multi-thread: no

One replica syncs from master.

One replica read/write sendfile
CPU 9.1% 5.3%
Benchmark 55805 59403

Two replicas sync from master.

Two replicas read/write sendfile
CPU 10.3% 5.7%
Benchmark 51448 54445

@ShooterIT
Copy link
Member Author

There are some my tests, it have advantage at reducing CPU usage @antirez @oranagra
I think that is why some web server use sendfile such as Nginx.

For macOS, my macbook has poor configurations, so I can't do some large capacity tests. but I could exactly find redis will cost low cpu if use sendfile.

@oranagra
Copy link
Member

@antirez i think this can be merged. i honestly don't see any risks, and the fact the test suite passes, seem to be enough.

BTW, @ShooterIT, how about using splice to do the same thing for readSyncBulkPayload on the replica side? care to implement and test that?

@ShooterIT
Copy link
Member Author

hello @oranagra
I ever thought about splice, I simply considered it can't move data between socket and regular file.
But your remind gives me an inspiration that we can use a temporary pipe to implement it with call splice twice, And this way may not destroy the code structure. Very interesting! I will have a try, thanks.

@oranagra
Copy link
Member

@ShooterIT i don't understand what you mean. readSyncBulkPayload reads from a socket and write so a file. i think it can use splice in exactly the same way you used sendfile to read from a file and write to a socket.

@ShooterIT
Copy link
Member Author

ShooterIT commented May 20, 2020 via email

@antirez
Copy link
Contributor

antirez commented May 22, 2020

What about merging this only into unstable and not into Redis 6? @oranagra @ShooterIT?

@ShooterIT
Copy link
Member Author

That's OK @antirez
Stability is first, I can understand since there are too many new features in 6.0. Let more people verify it, it can be merged if there is no problem for a long time.

@oranagra
Copy link
Member

I would have argued that this is a safe improvement.. But considering my recent simple fix exploded twice.. I can't.. So since this is not a critical issue, but rather just efficiency, not putting into 6 is ok.

@antirez
Copy link
Contributor

antirez commented May 22, 2020

@oranagra yep given the intrinsic issues with programming... better to be safe :-D Anyway we can promote it to Redis 6 after some time.

@antirez antirez merged commit 4016155 into redis:unstable May 22, 2020
@antirez
Copy link
Contributor

antirez commented May 22, 2020

Merged.

@oranagra
Copy link
Member

oranagra commented Jun 1, 2020

@ShooterIT, not sure if you noticed, but this change was released as part of redis 6.0.4.

Later we received some reports that sendfile introduces latency implications.
Apparently, at least on some platforms if not all, it has to wait for TCP ACK, which means that every call introduces a latency of a round trip (slowing down clients while it streams the RDB to the replicas).

We may have to revert it, but first, i'd like to check if you have anything to see if you happen to have any feedback on that?

In any case, thanks for contributing (quite a lot recently). much appreciated.

@ShooterIT
Copy link
Member Author

Thanks for notifying me @oranagra
Don't care my feeling if there is a performance loss, but firstly. I wish I can understand it clearly.
There are two questions I need you help me.

  1. sendfile will delay to send data actually if we set repl-disable-tcp-nodelay no ? What's more, AFAIK, replica may employ delayed ACK algorithm, that may have an effect on delaying.
  2. I only do some performance tests on Linux 3.10.0. I don't find latency. Do you know what platform? We can only use it on linux if there is no latency on linux because it is used most widely.

@ShooterIT
Copy link
Member Author

I think sendfile makes much sense of transfering big file since some operation systems implement it. Is it my incorrect usage?

@oranagra
Copy link
Member

oranagra commented Jun 2, 2020

@ShooterIT AFAIK it's not an incorrect usage, but rather in inefficient blocking implementation in the kernel. Have a look at this paper.
It was reported by @JohnSully, who said that he first noticed the latency (to other clients) and then tracked it down to the use of sendfile.
I didn't bother to try to test it myself, but please note that even if the problem exists only on some platforms, it may still mean we have to avoid using it on all.

@ShooterIT ShooterIT deleted the sendfile branch June 2, 2020 06:08
@oranagra
Copy link
Member

oranagra commented Jun 2, 2020

one possible solution is maybe to use sendfile from a thread in a blocking manner, it does add some complexity to redis though.
but come to think of it, the use of sendfile, not only reduces memory copying, but i suppose also reduces cache pollution and thus cache misses.
would be nice to try and measure it if someone has the time.

@ShooterIT
Copy link
Member Author

Thanks @oranagra I will read it ASAP.
Hello @JohnSully would you mind to provide some details of your tests?

@daemyungkang
Copy link
Contributor

I'm not sure it is the same issue but Kafka also had a trouble with sendfile.
It had an issue when sendfile block the network thread when it sends uncached file.
https://issues.apache.org/jira/browse/KAFKA-7504

@JohnSully
Copy link
Contributor

JohnSully commented Jun 2, 2020

Hi @ShooterIT

I originally noticed the problems from test failures in KeyDB after merging 6.0.4 and narrowed it down to the sendfile change. The root cause was increased synchronization time which the test were sensitive too.

sendfile() is no longer in KeyDB, but I've included a checkpoint where it existed so you can repro.

  1. Check out KeyDB: https://github.com/JohnSully/KeyDB
  2. checkout e50313b20718bc8df0a53c11b0960e4bcb2177d4
  3. make
  4. ./runtest --single integration/replication-active

[err]: Active replicas propogate in tests/integration/replication-active.tcl

The active replication tests don't confirm that synchronization has finished before starting which makes them quite sensitive to latency. Adding a "after 2000" delay before the first test resolves the failure as does removing sendfile() completely which I unfortunately had to do.

I did spend some time researching why sendfile() is unexpectedly slower, what I found was that because of its reliance on DMA it has to ensure the last TCP packet is ACK'd before returning (and therefore that the TCP buffer is not needed). This means that the call is now affected by TCP round trip latency. It seems to be designed to send bulk transfers not short bursts as is done in Redis/KeyDB.

@JohnSully
Copy link
Contributor

The other part to point out is that the tests rely on loopback sockets which have notoriously different perf characteristics to real NICs. I'm not a kernel expert so its quite possible some NICs have more efficient sendfile implementations. Nonetheless loopback use is an important scenario for KeyDB and I assume for Redis as well.

@oranagra
Copy link
Member

oranagra commented Jun 2, 2020

it would be nice to benchmark it on various platforms if someone has the time.
AFAIK the benchmark that was done already measured replication time and CPU usage. but we now wanna add to the mix the latency impact on other clients, and maybe the amount of cache misses.

in any case, even if there are severe implications on just one platform, we probably can't afford to keep it as it is, but maybe we'll decide that it is important enough to consider using in a thread, or another solution comes up.

@ShooterIT
Copy link
Member Author

Thanks @JohnSully
I didn't find errors when I did tests on my machines as you said. But it will case terrible latency if sendfile has to ensure the last TCP packet is ACK'd before returning. And this paper also points out the blocking behavior. Actually this exceeds my cognition. I don't know much about kernel, and can't sure if there are different implementations.

Just as @oranagra said,

in any case, even if there are severe implications on just one platform, we probably can't afford to keep it as it is

@antirez @oranagra You could revert it. I'm truly sorry to bring latency implications.

There is a test to see sendfile cost by strace on my machines. We can find sendfile time-consuming has large fluctuations. But I am not sure it is blocking for waiting TCP ACK, maybe because reading data from disk is blocked, you may do tests on your machine @JohnSully

18:35:12.001289 sendfile(7, 10, [40435712], 16384) = 16384 <0.000015>
18:35:12.001349 sendfile(7, 10, [40452096], 16384) = 16384 <0.000016>
18:35:12.001408 sendfile(7, 10, [40468480], 16384) = 16384 <0.000016>
18:35:12.001461 sendfile(7, 10, [40484864], 16384) = 16384 <0.000022>
18:35:12.001521 sendfile(7, 10, [40501248], 16384) = 16384 <0.000015>
18:35:12.001582 sendfile(7, 10, [40517632], 16384) = 16384 <0.000015>
18:35:12.001639 sendfile(7, 10, [40534016], 16384) = 16384 <0.000019>
18:35:12.001696 sendfile(7, 10, [40550400], 16384) = 16384 <0.000015>
18:35:12.001749 sendfile(7, 10, [40566784], 16384) = 16384 <0.000016>
18:35:12.001801 sendfile(7, 10, [40583168], 16384) = 16384 <0.000022>
18:35:12.001860 sendfile(7, 10, [40599552], 16384) = 16384 <0.000015>
18:35:12.001916 sendfile(7, 10, [40615936], 16384) = 16384 <0.000015>
18:35:12.001968 sendfile(7, 10, [40632320], 16384) = 16384 <0.000021>
18:35:12.002029 sendfile(7, 10, [40648704], 16384) = 16384 <0.000015>
18:35:12.002089 sendfile(7, 10, [40665088], 16384) = 16384 <0.000016>
18:35:12.002148 sendfile(7, 10, [40681472], 16384) = 16384 <0.000016>
18:35:12.002204 sendfile(7, 10, [40697856], 16384) = 16384 <0.000023>
18:35:12.002265 sendfile(7, 10, [40714240], 16384) = 16384 <0.000015>
18:35:12.002325 sendfile(7, 10, [40730624], 16384) = 16384 <0.000015>
18:35:12.002377 sendfile(7, 10, [40747008], 16384) = 16384 <0.000018>
18:35:12.002434 sendfile(7, 10, [40763392], 16384) = 16384 <0.000016>

I came up with an idea, sendfile will not block if we fill the tcp send buffer on every calling sendfile because there must be a TCP ACK if replica fd is writable. Right? So I use setsockopt to limit send buffer PROTO_IOBUF_LEN (anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);) . I can observe sendfile time-consuming is stable as follows, but the time of transfering RDB got longer because of too many small packets. That can be observed by 'tcpdump'. But I think we can't increase transfer size of sendfile simply since reading data from disk is blocked in sendfile

13:11:00.011137 sendfile(7, 10, [7892992], 16384) = 16384 <0.000016>
13:11:00.011301 sendfile(7, 10, [7909376], 16384) = 16384 <0.000016>
13:11:00.011466 sendfile(7, 10, [7925760], 16384) = 16384 <0.000015>
13:11:00.011629 sendfile(7, 10, [7942144], 16384) = 16384 <0.000016>
13:11:00.011792 sendfile(7, 10, [7958528], 16384) = 16384 <0.000016>
13:11:00.011956 sendfile(7, 10, [7974912], 16384) = 16384 <0.000015>
13:11:00.012120 sendfile(7, 10, [7991296], 16384) = 16384 <0.000016>
13:11:00.012286 sendfile(7, 10, [8007680], 16384) = 16384 <0.000017>
13:11:00.012449 sendfile(7, 10, [8024064], 16384) = 16384 <0.000015>
13:11:00.012612 sendfile(7, 10, [8040448], 16384) = 16384 <0.000016>
13:11:00.012775 sendfile(7, 10, [8056832], 16384) = 16384 <0.000015>
13:11:00.012939 sendfile(7, 10, [8073216], 16384) = 16384 <0.000016>
13:11:00.013103 sendfile(7, 10, [8089600], 16384) = 16384 <0.000016>
13:11:00.013268 sendfile(7, 10, [8105984], 16384) = 16384 <0.000016>
13:11:00.013431 sendfile(7, 10, [8122368], 16384) = 16384 <0.000016>
13:11:00.013595 sendfile(7, 10, [8138752], 16384) = 16384 <0.000016>
13:11:00.013759 sendfile(7, 10, [8155136], 16384) = 16384 <0.000016>
13:11:00.013922 sendfile(7, 10, [8171520], 16384) = 16384 <0.000016>
13:11:00.014087 sendfile(7, 10, [8187904], 16384) = 16384 <0.000015>
13:11:00.014251 sendfile(7, 10, [8204288], 16384) = 16384 <0.000017>
13:11:00.014414 sendfile(7, 10, [8220672], 16384) = 16384 <0.000016>
13:11:00.014581 sendfile(7, 10, [8237056], 16384) = 16384 <0.000015>
13:11:00.014742 sendfile(7, 10, [8253440], 16384) = 16384 <0.000016>
13:11:00.014907 sendfile(7, 10, [8269824], 16384) = 16384 <0.000016>
13:11:00.015070 sendfile(7, 10, [8286208], 16384) = 16384 <0.000015>
13:11:00.015234 sendfile(7, 10, [8302592], 16384) = 16384 <0.000016>
13:11:00.015397 sendfile(7, 10, [8318976], 16384) = 16384 <0.000016>
13:11:00.015561 sendfile(7, 10, [8335360], 16384) = 16384 <0.000015>
13:11:00.015725 sendfile(7, 10, [8351744], 16384) = 16384 <0.000015>
13:11:00.015888 sendfile(7, 10, [8368128], 16384) = 16384 <0.000015>
13:11:00.016082 sendfile(7, 10, [8384512], 16384) = 16384 <0.000015>
13:11:00.016290 sendfile(7, 10, [8400896], 16384) = 16384 <0.000018>
13:11:00.016535 sendfile(7, 10, [8417280], 16384) = 16384 <0.000016>
13:11:00.016699 sendfile(7, 10, [8433664], 16384) = 16384 <0.000016>
13:11:00.016943 sendfile(7, 10, [8450048], 16384) = 16384 <0.000016>
13:11:00.017189 sendfile(7, 10, [8466432], 16384) = 16384 <0.000016>
13:11:00.017353 sendfile(7, 10, [8482816], 16384) = 16384 <0.000016>

Maybe using a thread to send RDB file is a good idea, but that brings much complexity especially many replicas need to receive RDB files at the same time.

@JohnSully
Copy link
Contributor

JohnSully commented Jun 3, 2020

When I saw this change I thought it was a really good idea. I’m a bit sad it didn’t work out as well as hoped. A separate thread to send the RDB would be cool.

I get a consistent repro on my Ubuntu 18.04 machine run in WSL2’s virtual machine but I’ve seen it in more pure linux setups as well.

@ShooterIT
Copy link
Member Author

Hello @JohnSully would you mind to help verify my guess

I came up with an idea, sendfile will not block if we fill the tcp send buffer on every calling sendfile because there must be a TCP ACK if replica fd is writable. Right? So I use setsockopt to limit send buffer PROTO_IOBUF_LEN (anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);)

You could add the following code in https://github.com/antirez/redis/blob/unstable/src/replication.c#L1351, and do your tests in your environment.

#if HAVE_SENDFILE
                if(!server.tls_replication) anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);
#endif

@JohnSully
Copy link
Contributor

@ShooterIT I repro'd without tls compiled and in tests without TLS enabled. Are you sure this will still have an effect?

@ShooterIT
Copy link
Member Author

Hi @JohnSully That's OK. Now, TLS must be disable to enable sendfile.
For your repo, Maybe we need to add the following code in https://github.com/JohnSully/KeyDB/blob/unstable/src/replication.cpp#L1663

#if HAVE_SENDFILE
                if (!g_pserver->tls_replication && !g_pserver->fActiveReplica)
                    anetSetSendBuffer(NULL,slave->conn->fd,PROTO_IOBUF_LEN/2);
#endif

and add int anetSetSendBuffer(char *err, int fd, int buffsize); in src/anet.h

@ShooterIT
Copy link
Member Author

I can pass the test that you told me by above modifications, although passed the test before on my environments.
https://github.com/antirez/redis/pull/7096#issuecomment-637378543

@ShooterIT
Copy link
Member Author

Hi @JohnSully @oranagra @antirez I have some new discoveries.
I track down the implement of sendfile on linux , in linux source code
https://github.com/torvalds/linux/blob/master/fs/splice.c#L967
https://github.com/torvalds/linux/blob/master/fs/splice.c#L936
That means sendfile doesn't support nonblocking mode for out_fd, for this comment https://github.com/torvalds/linux/blob/master/fs/splice.c#L934, I consider it is wrong, should be
Block on output, we have to drain the direct pipe. Right?
I don't know much about linux kernel, so I am not sure about it, please help me to review this point.

What‘s more, I read man sendfile on my macOS, for error EAGIN, this is the description, same as FressBSD.

 [EAGAIN]           The socket is marked for non-blocking I/O and not all data was sent due to the socket buffer
                    being full.  If specified, the number of bytes successfully sent will be returned in *len.

In freeBSD source code,
https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_sendfile.c#L776
https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_sendfile.c#L811
Maybe the behavior of sendfile on BSD operation system is our want.

@oranagra
Copy link
Member

oranagra commented Jun 4, 2020

@ShooterIT it seems you are right.
@antirez i think we should revert and possibly make another quick release. (note there are a few other important fixes in other PRs)

@ShooterIT
Copy link
Member Author

Please revert it @antirez I am truly sorry.

@JohnSully
Copy link
Contributor

Hi @ShooterIT It will pass the test but only because sendfile is disabled for active replication which that test enables. It doesn't fundamentally fix the issue of slow startup. Active replication is a bit more sensitive because each server will send an RDB to the other so latency is 2x normal.

I initially disabled it only for active replication when I hadn't yet understood the impact of sendfile().

@antirez
Copy link
Contributor

antirez commented Jun 6, 2020

@ShooterIT no problem, there are complex interactions, we never are aware, you did nothing wrong here, instead we maintainers made an error by letting a new feature in that was just a performance improvement, without giving it the time to stay enough into unstable. A big thank you to @JohnSully for the heads up on that. To start I can revert the feature just this way:

I reverted the implementation both on Redis 6.0 and unstable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants