Posts Tagged ‘rust’

A lot of people have been asking me how to use protocols in Rust lately, so I thought I’d write up a little tutorial. Custom protocols are how to get the biggest benefits from Rust’s communication system, as this is how you get the biggest safety guarantees and expose the most opportunities for optimization. It’s also more labor intensive, so some of the other library features such as streams or port sets might be a better starting point. This post, however, introduces how to write your own protocols.

Protocols are created using the proto! syntax extension. Each protocol has a collection of states, and each state in a protocol has a collection of messages that are allowed. Every protocol must have at least states, though states needn’t have any messages. Thus, we will start with the simplest possible protocol definition:

proto! simple_proto {
    StartState:send { }
}

This creates a protocol called simple_proto, with a single state called StartState. Protocols always start in the first state listed. There is one other decoration, :send. This indicates that StartState is a send state. Protocols are always written from the client perspective, so a send state means it is a state in which the client may send a message. The protocol compiler will generate a dual state for the server, meaning there will be a version of StartState that is expecting to receive a message.

Given this specification, the protocol compiler will create a module called simple_proto, that includes types and functions for communicating according to this protocol. One generated function is called init, which is used to create a pipe in the start state. We can start the protocol like this:

let (server, client) = simple_proto::init();

This creates a pair of endpoints. The client endpoint will have the type simple_proto::client::StartStart, meaning it is in the client’s version of StartState. The server endpoint, on the other hand, has type simple_proto::server::StartState, which means it is in the server’s version of the StartState. The difference means that client is expecting to send a message, while server is expecting to receive a message.

We can’t really do anything further with this protocol, since there are no messages defined. Strictly speaking, the server could try to receive, but it would block forever since there is no way for the sender to send a message. Let’s fix this by adding a message.

proto! simple_proto {
    StartState:send {
        SayHello -> StartState
    }
}

Messages have the form Name(arguments) -> NextState. In this case, we added a message called SayHello, which carries no data with it. After sending a SayHello message, the protocol transitions to (or stays in, really) the StartState. We can now write some functions that communicate with each other. Here’s an example client program:

fn client(+channel: simple_proto::client::StartState) {
    import simple_proto::client;

    client::SayHello(channel);
}

Receive, by itself, is a little trickier. I recommend using the select macro instead. For now, you can get the select macro here. Once macro import is working, the select macro will be included in the standard library. For now, to use the select macro, save it in a file called select-macro.rs and add the following lines near the top of your program.

fn macros() {
    include!("select-macro.rs");
}

Once you’ve done this, you can write the server as follows.

fn server(+channel: simple_proto::server::StartState) {
    import simple_proto::SayHello;

    select! {
        channel => {
            SayHello -> _channel {
                io::println("Client says hello!");
            }
        }
    }
}

Select allows you to provide a set of endpoints to listen for messages on, followed by actions to take depending on which one receives a message. In this case, we only have one endpoint, called channel, which is in the server StartState state. After the =>, there is a block describing message patterns and code to execute if the pattern is matched. In this case, we only have one pattern, SayHello -> _channel. This mirrors the definition of the message in the protocol specification. It says “if we receive a SayHello message, bind an endpoint in the next protocol state to _channel and execute the code in the following block.” We use _channel for the next state because in this case we are not planning on sending or receiving any more messages.

Now let’s make this protocol a little more interesting by adding a new state and a message that carries data. We will do this by letting the client ask the server’s name and wait for a reply. The new protocol looks like this:

proto! simple_proto {
    StartState:send {
        SayHello -> StartState,
        WhatsYourName -> GettingName
    }

    GettingName:recv {
        MyNameIs(~str) -> StartState
    }
}

We’ve added a new message to StartState, which the client uses to ask the server’s name. After sending this, the protocol transitions to the GettingName state, where the client will wait to receive the MyNameIs message from the server. At this point, the protocol moves back to the StartState, and we can do it all over again. We’ve added an argument to MyNameIs, which means this message carries a string with it. Our client code now looks like this:

fn client(+channel: simple_proto::client::StartState) {
    import simple_proto::client;
    import simple_proto::MyNameIs;

    let channel = client::SayHello(channel);
    let channel = client::WhatsYourName(channel);
    select! {
        channel => {
            MyNameIs(name) -> _channel {
                io::println(fmt!("The server is named %s", *name));
            }
        }
    }
}

At a high level, this code says hello to the server, then asks for it’s name, then waits for the response and reports the server’s name to the user. It probably looks a little add that every line starts with let channel = .... This is because endpoints are single use. Any time you send or receive a message on an endpoint, the endpoint is consumed. Fortunately, all the send and receive functions return a new endpoint that you can use to continue the protocol.

The use of select! here is similar to how it was in the previous server example, except that we’ve added name to the MyNameIs pattern. This matches the ~str parameter in the protocol specification, and it binds the string sent by the server to name, so that we can print it out in the handler code.

For the new server, we need to add another clause to the message patterns:

fn server(+channel: simple_proto::server::StartState) {
    import simple_proto::{SayHello, WhatsYourName};
    import simple_proto::server::MyNameIs;

    select! {
        channel => {
            SayHello -> _channel {
                io::println("Client says hello!");
            },
            WhatsYourName -> channel {
                MyNameIs(channel, ~"Bob");
            }
        }
    }
}

In this case, if we receive a WhatsYourName message, we send a MyNameIs message on the new endpoint (called channel), which contains the string ~"Bob", which is what this server has decided to call itself. The client will eventually receive this string and show it to the user.

This covers the basic definition and usage of protocols. There are several other features, however. This includes polymorphic states and terminal states. Polymorphic states allow you to create protocols that work for different types. One common example is the stream protocol, which lets you send a whole bunch of messages of a given type:

proto! stream {
    Stream:send<T:send> {
        Send(T) -> Stream<T>
    }
}

We can add as many type parameters as we want to each of the states, with arbitrary bounds as well. You’ll probably want all your data types to be send-bounded though. Then, each time a message transitions to a polymorphic state, it must provide type parameters. You can see this on the right had side of the Send message.

Sometimes, we want to have a message that ends the protocol. For example, in our previous example, we might want a GoodBye message. One way to do this is to make a state with no messages, and step to that:

proto! simple_proto {
    StartState:send {
        GoodBye -> Done,
    }

    Done { }
}

However, this is a little verbose, and it also hides that fact that you really intended the protocol to end there. Thus, there is a special form that indicates sending a message ends the protocol. We write it like this:

proto! simple_proto {
    StartState:send {
        GoodBye -> !
    }
}

Stepping to ! represents closing the protocol and is analogous to how a function that returns ! actually never returns. When a message steps to !, neither the corresponding send function nor the receive function will return a new endpoint, meaning there is no way you could even attempt to send messages on this connection.

I hope this helps to get started with protocols in Rust. There are a few other features, but this covers the basics. Please feel free to ask questions!

The benchmarks in my last post had one thing in common: all communication was one sender to one receiver. It’s surprising how often this is sufficient, but sooner or later we are going to need a way to have multiple tasks sending to the same receiver. I’ve been experimenting with two ways of doing many senders to different receivers, and I now have some results to show.

The pipes library includes a select operation. This lets you listen on several receive endpoints simultaneously. Unfortunately, the single-use nature of endpoints makes select a little clunky to use. To help alleviate this, I added a port_set to the library. Port sets allow you to easily treat several receive endpoints as a unit. This allows send to still be very fast, but receive is a little bit slower due to the overhead setting up and tearing down the select operation. The current implementation for select is O(n) in the number of endpoints, so this works well for small numbers of tasks, but breaks down as things get bigger.

The other option is to slow down the sending end, using something I call a shared_chan. This is a send endpoint wrapped in an exclusive ARC. Now all the senders have to contend with each other, but the receive side is exactly as cheap as before. For cases where you have a lot of senders that send messages relatively infrequently, this will likely outperform the port_set approach, at least until select is faster.

Both of these are sufficient to run the msgsend benchmark that I talked about at the beginning of all of this. Here are the results, combined with the previous numbers.

Language Messages per second Comparison
Rust port_set 881,578 232.8%
Scala 378,740 100.0%
Rust port/chan (updated) 227,020 59.9%
Rust shared_chan 173,436 45.8%
Erlang (Bare) 78,670 20.8%
Erlang (OTP) 76,405 20.2%

The most obvious thing is that the port_set version is over twice as fast as Scala, the previous winner. I also re-ran the port/chan version for comparison, and it got a little bit faster. There has been quite a bit of churn in Rust recently, so it’s quite possible that these showed up here as better performance.

Writing the port_set version proved the most interesting to me. Relying on select ended up relaxing some of the ordering guarantees. Previously if we had Task A send a message to Task C and then send a message to Task B, and then have Task B wait to receive message to from Task A and then send a message to Task C, we could count on Task C seeing Task A’s message before seeing Task B’s message. With the port_set, this is no longer true, although we still preserve the order in messages sent by a single task. An easy way to work around this, however, was to rely on pipe’s closure reporting ability. The server could tell when a worker would no longer send any more messages because it would detect when the worker closed its end of the pipe.

I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.

Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:

Sent 1000000 messages in 0.227634 seconds
  4.39301e+06 messages / second
  0.227634 µs / message

This is about 17x faster!

It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.

To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Graph500 PBFS 0.914772 0.777784 17.6%

The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Shootout K-Nucleotide 4.335 3.125 38.7%

Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.

Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.

About a month ago, I posted that I was going to be working on improving Rust’s message passing performance. I quickly threw together a prototype of a new communication system based on a shared queue protected by a mutex. This was about twice as fast as the existing system, because it removed the global mutex from the messaging paths. This prototype hurt expressiveness somewhat, and still it seemed we could do a lot better.

Rust has some extremely powerful features in its type system. The fact that it can deal with concepts like uniqueness, initialization status, copyability, and other traits mean we can encode some very powerful invariants. Thus, I took some inspiration from the Singularity OS and set out to see if I could encode something like channel contracts in Rust. The result is a proposal for a feature I’m calling pipes.

The way pipes work is that when you create a pipe you get two endpoints that are forever entangled together. One endpoint can send one message, and the other endpoint can receive that one message. Sending and receiving destroys the endpoint, but the operation also produces a new endpoint to continue the communication. Endpoints have a state associated with them, which specifies which messages can be sent or received. This information is encoding in the type system, so Rust can statically guarantee that no task will send a message that is not legal in the given state. Pipes are not copyable; they are always for 1:1 communication. However, endpoints can be sent between tasks.

Critical to pipes are the associated protocol specification. Protocols have two views: the client and the server. Protocols are always written from the perspective of the client. This decision was arbitrary, but in general it makes sense to only write down one side of the protocol. The other perspective is generated by reversing the direction of all the messages. Here’s an example of what I’m envisioning for a protocol specification.

proto! bank {
    login:send {
        login(username, password) -> login_response
    }

    login_response:recv {
        ok -> connected,
        invalid -> login
    }

    connected:send {
        deposit(money) -> connected,
        withdrawal(amount) -> withdrawal_response
    }

    withdrawal_response:recv {
        money(money) -> connected,
        insufficient_funds -> connected
    }
}

This describes the protocol you might use in an online banking situation. The protocol has four states (login, login_response, connected and withdrawal_response), each one annotated with whether the sender is allowed to send or receive in that state. In this case, a client would start out in the login state, where the client can attempt to login with a username and password. After sending a login message, the protocol enters the login_response state, where the server informs the client that either the login succeeded (in which case the protocol transitions to the connected state), or the login failed, in which case the protocol returns to the login state and the client can retry.

From the connected state, the client can try to deposit or withdrawal money. We assume that depositing money never fails, so sending a deposit message results in the protocol staying in the connected state. On the other hand, withdrawal can fail, for example, if the account does not have enough money. To model this, sending a withdrawal message results in the protocol going to the withdrawal_response state. Here, the client waits to either receive the requested money, or for a message saying there was not enough money in the account. In both cases, we end up back in the connected state.

Below is a code example showing how a client might use this protocol.

fn bank_client(+bank: bank::client::login) {
    import bank::*;

    let bank = client::login(bank, "theincredibleholk", "1234");
    let bank = alt recv(bank) {
      some(ok(connected)) {
        #move(connected)
      }
      some(invalid(_)) { fail "login unsuccessful" }
      none { fail "bank closed the connection" }
    };

    let bank = client::deposit(bank, 100.00);
    let bank = client::withdrawal(bank, 50.00);
    alt recv(bank) {
      some(money(m, _)) {
        io::println("Yay! I got money!");
      }
      some(insufficient_funds(_)) {
        fail "someone stole my money"
      }
      none {
        fail "bank closed the connection"
      }
    }
}

All of this code in this posts works on the latest Rust compiler as of this morning. I’ve also started transitioning some of our benchmarks to the new pipe system, and the results have been impressive. I’ll have a post diving into the performance of pipes soon.

For the last week or so, I’ve been mostly fighting with the performance of the Rust compiler. The rustc performance graph from the RustBot page tells a lot of the story:

Performance graph for rustc

The wildly varying performance is largely due to a change I made to move the code for addition on vectors out of the compiler and into the core library. The compiler used to generate very optimized vector code. For example, when doing v += [x] (that is, appending a single element to a vector in place), the compiler would avoid allocating a temporary vector for the [x]. In effect, the compiler generated what amounts to an inline call to vec::push.

My work involved adding an overloaded addition operator to Rust’s core library, and changing the compiler to use that version instead of its generated version. The landing of this change corresponds to about build number 70 or so on the graph, where there is a small but significant jump in the time rustc took. Due to the way we handle overloaded operators, the compiler must insert many more vector copies than it did before. It seemed too good to be true that I would only lose a small amount of performance.

And it was too good to be true. It turns out through a bunch of rebasing, I had managed to undo a small switch in is_binopable that controlled whether my new code ever got called. In the meantime, Sully has been doing a lot of work on the way Rust handles vectors, and he landed a change around build 90 in the performance graph above that ended up causing Rust to use the library version of vector append instead. This was more like the performance decrease I was expecting.

In order to fix this, I had to track down all the places in the compiler where we were using the addition operator on vectors, and replace them with things like vec::push, vec::push_all, vec::append and vec::append_one. This was tedious work, and I did it in stages. Each spot where the execution time improves in the graph is where I landed a couple more of these changes. Once this was finally done, we can see that rustc’s performance was even slightly better than it was before. I believe this is because the library versions of these functions are able to remove some copies that rustc could not remove before. Sadly, the code is much uglier than it was before. We are discussing the ability to specify modes on the self argument, which will allow the compiler to again avoid unnecessary copies when using the plus operator. This should allow us to make using the plus operator just as fast as the underlying library functions.

The moral of the story is that copies can kill performance. Rust warns often when it is inserting copies you may not have expected. Pay attention to these warnings and you will write much faster code.

Previously, I talked about a couple of ways to improve the message passing performance in Rust. The obvious bottleneck was that sending and receiving both involved taking a lock that was shared between all tasks. In order to remove this lock, I set out to write a new port/channel system that relied much less on the runtime library.

In order to do this, we first needed a variant of the atomic reference counter that allows us to share mutable data. We accomplish this by adding a mutex and condition variable. The mutex is your standard pthread mutex, while we’ve implemented our own condition variable that the Rust scheduler is aware of. Because we’re using a standard pthread mutex, using this exclusive access is unsafe and should be used with care; if you’re not careful, you could deadlock Rust. Fortunately, the API for Rust’s low-level locks and condition variables makes it harder to accidentally hold a lock for unbounded amounts of time.

Once we have this, ports and channels simply become a locked atomically reference-counted queue. There’s no more global lock, and things are far simpler because we only need the Rust runtime support for providing mutual exclusion and condition variable signalling. This part didn’t take long to implement, and I was eager to try it out. I spent a little while writing up a new benchmark that would really show the benefit of avoiding the global lock, and when I went to run it, things crashed.

It turns out, I had discovered a bug whereby vector addition could copy things that weren’t copyable. I saw two approaches to fixing this: fixing trans (the Rust to LLVM translation pass), or moving vector addition out of trans and into the library. The idea behind the second option is that if we did this, vector addition would be going through the existing and well-tested function call path, rather than a special vector addition codegen path. This seemed like the best option overall, since the first option felt more like a band aid for the root cause that we have too much functionality that is duplicated in subtly different ways.

Thus, I set out to move vector addition to libcore. This exposed some subtle semantics issues around const vectors, but these were mostly not too painful to work out. My firsts working versions were too slow to finish building the compiler in a reasonable amount of time. I was thinking we’d end up taking a 20% to 2x performance hit by doing this change, but thanks to some heroic optimization help from Niko, we got the performance to the point where in some cases the new code even performs ever so slightly better than the old code. In the course of doing these changes, I also discovered another bug, that led to us leaking memory, and in the course of fixing that, I discovered a way to make Rust segfault. Ah, the life of a compiler writer.

At any rate, we have fixes to all of these bugs in the works, and things are working well enough to run a benchmark. This test creates a ring of tasks, who each send a message to their neighbor on one side and receive from the other side. Here are the numbers for the old messaging system.

Sent 1000000 messages in 3.88114 seconds
  257656 messages / second
  3.88114 μs / message

And here are the numbers for the new system.

Sent 1000000 messages in 1.87881 seconds
  532253 messages / second
  1.87881 μs / message

As you can see, we’re about 1.9x faster than we were before.

This new system doesn’t yet have the exact same features as the old system, and it needs some help with ergonomics. My next task will be to work on adding these missing things and hopefully be able to incrementally replace the old code with the new, faster version.

The code for this post isn’t yet in the main Rust tree, but it should be landing soon as we squash the bugs we found in the process of doing this new message passing system.

Today we’re going to take a brief look at message passing performance in Rust. Rust is already pretty quick on this front, but there are some really obvious possibilities for making it even faster. It looks like I’ll be spending most of the summer working on this, so I thought it’d be good to start out with some baseline performance numbers. Rust’s testsuite has a benchmark called msgsend.rs, which is a microbenchmark for message passing performance. It is based on a benchmark from Paul Keeble that original compared Erlang and Scala. Here’s a table summarizing the results:

Language Messages per second Comparison
Scala 378,740 100.0%
Rust 175,373 46.3%
Erlang (Bare) 78,670 20.8%
Erlang (OTP) 76,405 20.2%

These numbers were generated on an Early 2011 MacBook Pro 2.3 GHz Intel Core i7 with 8GB RAM and Lion 10.7.4. Don’t read too much into these numbers; they are incredibly unscientific. The only real use for them will be to see if running the same benchmark for Rust in a few weeks yields a larger number. It’s also worth noting that my results disagree with the original author, who saw Erlang performing about 6 times faster than Scala. I suspect that Erlang’s message passing may not be as efficient on Mac OS X, but I do not know what system the tests were originally run on.

So, where do we go from here? I mentioned that there are some obvious first steps. The most obvious is that right now sending a message involves taking a global lock. In effect, this means we can only send one message in a time. Suppose we had four tasks, A, B, C and D, where A was sending to B and C to D. Intuitively, these two sends should be completely independent of each other, but this is not currently the case in Rust. This is one thing I hope to change.

It would be nice if we could do at least some of the communication without any locks at all, using a lock-free data structure. This is worth experimenting with, but we’ll have to wait for the benchmarks to show whether this is the best idea.

Somewhat orthogonal to message passing performance, I hope to be able to implement more of the communication system in Rust itself. When the communication system was first written, we had to put almost all of it in the runtime library that was written in C++. Rust is significantly more powerful now, and it seems like it would not take too much more work to write the communication system almost entirely in Rust.

In my last post, we saw how to get a parallel speedup on a breadth first search in Rust. One major flaw with this was that we had to use unsafe pointers all over the place to prevent from copying large data structures around. Given that Rust’s slogan is “a safe, concurrent, practical language,” it would be terribly sad to sacrifice safety in order to get concurrency. Fortunately, the tricks we were doing before were mostly safe. A human reviewer could see without too much trouble that the data we were sharing was immutable (so we wouldn’t have to worry about race conditions), and that we were careful to manage task and data lifetimes so we wouldn’t have a use-after-free error. Now the goal is to teach the compiler to verify the safety of these patterns on its own.

There are two safety properties we are concerned with:

  1. Only immutable (i.e. constant) data can be shared.
  2. Shared data must remain live as long as any task might access it.

For point (1), we have solved this by introducing a const kind. For the purposes of Rust, constant types are basically those which do not contain the mut annotation anywhere in them. Getting the details of this just right takes some care, and we will probably have to tweak the rules as time goes on. One current weakness is that only item functions (i.e. those without closures) count as const. Many Rust types already end up being const, which probably falls out of Rust’s immutable by default philosophy.

For (2), we’ve added a new module to the library called arc. This stands for atomic reference counting. This is a somewhat unfortunate name, as it is so similar to the more common automatic reference counting. We’d be happy to entertain suggestions for better names. The idea behind ARC is that it wraps over some piece of constant data. The ARC is non-copyable, but it can be sent (this required another tweak to Rust’s kind system to allow sendable resources). When you want to access the data the ARC is managing, you can ask it for a reference. Importantly, ARC provides a clone operation, which increments the reference count and returns another ARC pointing to the same data. When the ARC object goes out of scope, it decrements the reference count and potentially frees the data it is wrapping. ARCs can be safely shared between tasks because all reference counting is done using atomic operations. Additionally, Rust’s region system ensures that the ARC object outlives any references to the data inside of the ARC.

We were able to implement ARC entirely in the Rust Standard Library, with a little bit of runtime support for atomic operatons. ARC itself required some unsafe code in its implementation, but this fits with the “practical” part of Rust. We can implement limited parts of the program using unsafe primitives, and bless them as safe through manual inspection. Users can then build safe programs from the components in the standard library. Indeed, the actual parallel breadth first code no longer needs the unsafe keyword.

Along the way, Niko and I were also able to solve some annoying compiler bugs that make our standard library much more functional. The BFS code is also in the main Rust tree now, and later today I’m hoping to land a few more changes to the libraries to make building other parallel programs easier.

I’m happy to report that my parallel breadth first search program in Rust gets a 2.8 to 3x speedup now on my quad core MacBook Pro. While not in the ideal 4-8x range, I’m pretty satisfied with a 3x speedup. As you recall from the last post, the majority of the time was still spent doing memory allocation and copies. I tried several things to finally get a speedup, only one of which had much noticeable effect:

  1. Allocate and update the result storage in place
  2. Reuse tasks
  3. Share more with unsafe pointers

Allocate and update in place

Allocation in place was my hypothesis from the last post. Basically, you allocate a big mutable vector for the results ahead of time. Then, each of the worker tasks write their results directly into the finally location, rather than incurring a bunch of memory copy costs to aggregate the results as a second step. Part of the reason why this seems like a good idea comes from plot below of CPU usage over time.

Profile showing periodic CPU spikes

The benchmark searches for 64 randomly selected keys. This graph shows about one and a half of the searches. Each search consists of a parallel step to update all the vertex colors (from white to gray to black), and then a synchronization step to aggregation the results together. This repeats until no more nodes are gray. Traversing the whole graph should take about the diameter of the graph, which is usually about 4. This matches what we see on the graph. There are a bunch of spikes all the cores are busy, followed by some sequential synchronization time. Each search actually consists of 8 spikes because there is a parallel step to test if we’re done and a parallel step to do the actual work.

The idea was that the sequential time between each spike came from appending all the intermediate result vectors together. By allocating in place, we should have been able to avoid this cost. Unfortunately, I did not see a noticeable performance improvement by doing this.

Reusing tasks

My next idea was that maybe we were spending too much time in task creation and destruction. To test this, I wrote a task pool, which would use the same tasks to process multiple work items. By design, Rust’s task primitives are very cheap, so this shouldn’t be necessary in practice. Once again, this transformation did not noticeably affect performance. This is actually good news, because it means Rust delivers on its goal of making spawning tasks cheap enough that developers shouldn’t worry about spawning too many.

Share more with unsafe pointers

After two failed attempts, Patrick Walton and I took a more in depth look at the profiles. We discovered that compiling with the -g flag makes Instruments able to show us a lot more useful information about where we are spending our time. Patrick’s diagnosis was that we were spending too much time allocating and copying, and he advised I try to track down the expensive copies. The profiles showed much of the time was spent in the beginning and end of the parallel block, meaning the problems were probably do to copying and freeing things in the closure. There are two fairly large structures that this code used: the edge list and the color vector. I replaced these with unsafe pointers to ensure they were actually shared rather than copied. Finally, I saw a parallel speedup! The edge list was a couple hundred megabytes, so this explains why copying it was rather expensive.

Future directions

My experience again shows several ways we can improve Rust. The most obvious is that we need a way to share data between tasks; copying is just too expensive. For the most part, this isn’t a problem since we would only be sharing read-only data. The trickiness comes in when we have to decide when to reclaim the shared data. One way of doing this is to use the patient parent pattern. A parent can spawn several parallel tasks which each get read-only access to all of the parent’s data. The parent suspends execution until all of the children complete, at which point the parent would free the data through the usual means. Here the children do not worry about freeing these structures at all.

Another approach is to use atomic reference counting for data that is shared between tasks. This gives more flexibility, because the lifetimes of tasks are not strictly bounded by their parents. It does mean we pay the cost of atomic reference counting, but this will hopefully not be bad since we only have to update reference counts when data structures cross task boundaries.

Finally, there are some smaller changes that would be nice. I spent a while worrying about time spent appending vectors. It might be worth adding a tree-vector to the standard library, which gives logarithmic time element lookup but constant time appending. Since the root cause of my performance problems was a large implicit copy, it’d be nice to have the compiler warn on potentially large copies, or maybe even make it an error where the programmer must explicitly copy something if this is really desired. As another optimization, it might be worth keeping tasks around once they die and reusing them to hopefully reduce the overhead of creating tasks even further. I have a page about this in the Rust wiki, but we’ll need more evidence that this is a good idea before going for it.

I’ve continued tuning the Graph500 code I talked about yesterday. I still haven’t managed to achieve a speedup over the sequential version, but I’m not running about an order of magnitude faster than I was yesterday.

If you recall, the last profile from yesterday showed that we were spending an enormous amount of time waiting on a spinlock related to malloc and free. The problem was basically that the closure we were passing to par::mapi included the adjacency lists for the whole graph. In order to guarantee safety, Rust was copying the adjacency lists into each task. Since these are immutable, however, we should have been able to share them in place.

It turns out Rust has a way to let you do this by dropping into the unsafe portions of the language. Now, instead of duplicating the closure, we simply pass an unsafe pointer to the closure to the spawned tasks. These dereference it in place, and we avoid the copy. Now, instead of having a 50-100x slowdown, we’re only at a 4-5x slowdown. The profile looks like this:

Graph500 profile showing very little time in the malloc spinlocks.

The troublesome spinlock from before only accounts for 3.7% of our time. The 4th function, which has 5.3% of the time, is the main worker for the breadth first search. Ideally, this would account for the majority of the time. The top two functions, however, appear to do with allocating and zeroing memory.

I also tried using unsafe pointers to the input vectors, but this did not make a material difference in the performance. I suspect the bzero and memmove time mostly comes from aggregating the results from each of the parallel tasks together. The next obvious optimizations are to try to pre-allocate the result space and update these in place.