First, let me paraphrase something from my LiveJournal profile: These posts are my own, and in particular do not necessarily reflect my employer's positions, strategies, or opinions.
With that said, some say that the current geopolitical outlook is grim. And far be it from me to minimize the present-day geopolitical problems, nor am I at all interested in comparing them to their counterparts in the "good old days". But neither do I wish to obsess on these problems. I will instead call attention to a few instances of global cooperation, current and past.
Last month, NASA's oldest active astronaut traveled to Kazakhstan's Baikonur Cosmodrome, entered a Soyuz capsule atop a Roscosmos rocket and flew to the International Space Station. For me, this is especially inspiring: If he can do that at age 69, I should certainly be able to continue doing my much less demanding job for many years to come.
Some decades ago, during the Cold War, I purchased an English translation of Gradshteyn's and Ryzhik's classic "Table of Integrals, Series, and Products". Although computer-algebra systems have largely replaced this book, I have used it within the past few years and I used it heavily in the 1980s and early 1990s. Thus, along with many others, I am indebted to the longstanding Russian tradition of excellence in mathematics.
( Read more...Collapse )In order to save mass-storage space and to reduce boot times, rcutorture runs out of a tiny initrd filesystem that consists only of a root directory and a statically linked init program based on nolibc. This init program binds itself to a randomly chosen CPU, spins for a short time, sleeps for a short time, and repeats, the point being to inject at least a little userspace execution for the benefit of nohz_full CPUs.
This works very well most of the time. But what if you want to use a full userspace when torturing RCU, perhaps because you want to test runtime changes to RCU's many sysfs parameters, run heavier userspace loads on nohz_full CPUs, or even to run BPF programs?
What you do is go back to the old way of building rcutorture's initrd.
Which raises the question as to what the new way might be.
What rcutorture does is to look at the tools/testing/selftests/rcutorture/initr
Which means that you can put whatever initrd file tree you wish into that initrd directory, and as long as it contains a /init program, rcutorture will happily bundle that file tree into an initrd in each the resulting rcutorture kernel images.
( Read more...Collapse )The v2023.06.11a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.
This release contains a new section on thermal throttling (along with a new cartoon), improvements to the memory-ordering chapter (including intuitive subsets of the Linux-kernel memory model), fixes to the deferred-processing chapter, additional clocksource-deviation material to the "What Time Is It?" section, and numerous fixes inspired by questions and comments from readers. Discussions with Yariv Aridor were especially fruitful. Akira Yokosawa contributed some quick quizzes and other upgrades of the technical discussions, along with a great many improvements to grammar, glossaries, epigraphs, and the build system. Leonardo Bras also provided some much-appreciated build-system improvements, and also started up continuous integration for some of the code samples.
Elad Lahav, Alan Huang, Zhouyi Zhou, and especially SeongJae Park contributed numerous excellent fixes for grammatical and typographical errors. SeongJae's fixes were from his Korean translation of this book.
Elad Lahav, Alan Huang, and Patrick Pan carried out some much-needed review of the code samples and contributed greatly appreciated fixes and improvements. In some cases, they drug the code kicking and screaming into the 2020s. :-)
Under Construction
A correspondent closed out 2022 by sending me an off-list email asking whether or not a pair of Rust crates (rcu_clean and left_right) were really implementations of read-copy update (RCU), with an LWN commenter throwing in crossbeam's epoch crate for good measure. At first glance, this is a pair of simple yes/no questions that one should be able to answer off the cuff.
What Is An RCU?
Except that there is quite a variety of RCU implementations in the wild. Even if we remain within the cozy confines of the Linux kernel, we have: (1) The original "vanilla" RCU, (2) Sleepable RCU (SRCU), (3) Tasks RCU, (4) Tasks Rude RCU, and Tasks Trace RCU. These differ not just in performance characteristics, in fact, it is not in general possible to mechanically convert (say) SRCU to RCU. The key attributes of RCU implementations are the marking of read-side code regions and data accesses on the one hand and some means of waiting on all pre-existing readers on the other. For more detail, see the 2019 LWN article and for more background, see the Linux Foundation RCU presentations here and here.
The next sections provide an overview of the Linux-kernel RCU implementations' functional properties, with performance and scalability characteristics left as an exercise for the interested reader.
Vanilla RCU
Vanilla RCU has quite a variety of bells and whistles:
- Explicit nesting read-side markers, rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and friends.
- Pointer-update function, rcu_assign_pointer().
- Synchronous grace-period-wait primitives, synchronize_rcu() and synchronize_rcu_expedited().
- An asynchronous grace-period wait primitive, call_rcu(). And additionally a synchronous callback wait primitive, rcu_barrier().
- Polled grace-period wait primitives.
Sleepable RCU (SRCU)
( Read more...Collapse )A previous post in this series showed how you can use the --bootargs parameter and .boot files to supply kernel boot parameters to the kernels under test. This works, but it turns out that there is another way, which is often the case with the Linux kernel. This other way is Masami Hiramatsu's bootconfig facility, which is nicely documented in detail here. This blog post is a how-to guide on making use of bootconfig when running rcutorture.
The bootconfig facility allows kernel boot parameters to be built into initrd or directly into the kernel itself, this last being the method used here. This requires that the kernel build system be informed of the parameters. Suppose that these parameters are placed in a file named /tmp/dump_tree.bootparam as follows:
kernel.rcutree.dump_tree=1
kernel.rcutree.blimit=15
Note well the "kernel." prefix, which is required here. The other option is an "init." prefix, which would cause the parameter to instead be passed to the init process.
Then the following three Kconfig options inform the build system of this file:
CONFIG_BOOT_CONFIG=y
CONFIG_BOOT_CONFIG_EMBED=y
CONFIG_BOOT_CONFIG_EMBED_FILE="/tmp/dump
A few years ago, I posted on the challenges of maintaining low weight as one ages. I have managed to stay near my target weight, with the occasional excursion in either direction, though admittedly more often up than down. My suspicion that maintaining weight would prove 90% as difficult as losing it has proven to be all too well founded. As has the observation that exercise is inherently damaging to muscles (see for example here), especially as one's body's ability to repair itself decreases inexorably with age.
It can be helpful to refer back to those old college physics courses. One helpful formula is the well-worn Newtonian formula for kinetic energy, which is equal to half your mass times the square of your velocity. Now, the human body does not maintain precisely the same speed while moving (that is after all what bicycles are for), and the faster you are going, the more energy your body must absorb when decreasing your velocity by a set amount on each footfall. In fact, this amount of energy increases linearly with your average velocity. So you can reduce the energy absorption (and thus the muscle and joint damage) by decreasing your speed. And here you were wondering why old people often move much more slowly than do young people!
( Read more...Collapse )I had the privilege of presenting Unraveling Fence & RCU Mysteries (C++ Concurrency Fundamentals) to the CPP Summit. As the title suggests, this covered RCU from a C++ viewpoint. At the end, there were several excellent questions:
- How does the rcu_synchronize() wait-for-readers operation work?
- But the rcu_domain class contains lock() and unlock() member functions!!!
- Lockless things make me nervous!
There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk. Therefore, I address these questions in the following sections.
How Does rcu_synchronize() Work?
There are a great many ways to make this work. Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero. There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems paper (non-paywalled draft). The most detailed discussion is in the supplementary materials.
More recent information may be found in Section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
The rcu_domain Class Contains lock() and unlock() Member Functions?
Indeed it does!
( Read more...Collapse )This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup.
One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX.
Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book.
This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms.
Device Communication and Memory Ordering
Andreas Hindborg talked about his work on a Rust-language PCI NVMe driver for the Linux kernel x86 architecture. I focus on the driver's interaction with device firmware in main memory. As is often the case, there is a shared structure in normal memory in which the device firmware reports the status of an I/O request. This structure has a special 16-bit field that is used to report that all of the other fields are now filled in, so that the Linux device driver can now safely access them.This of course requires some sort of memory ordering, both on the part of the device firmware and on the part of the device driver. One straightforward approach is for the driver to use smp_load_acquire() to access that 16-bit field, and only if the returned value indicates that the other fields have been filled in, access those other fields.
To the surprise of several Linux-kernel developers in attendance, Andreas's driver instead does a volatile load of the entire structure, 16-bit field and all. If the 16-bit field indicates that all the fields have been updated by the firmware, it then does a second volatile load of the entire structure and then proceeds to use the values from this second load. Apparently, the performance degradation from these duplicate accesses, though measurable, is quite small. However, given the possibility of larger structures for which the performance degradation might be more profound, Andreas was urged to come up with a more elaborate Rust construction that would permit separate access to the 16-bit field, thus avoiding the redundant memory traffic.
In a subsequent conversation, Andreas noted that the overall performance degradation of this Rust-language prototype compared to the C-language version is at most 2.61%, and that in one case there is a yet-as-unexplained performance increase of 0.67%. Of course, given that both C and Rust are compiled, statically typed, non-garbage-collected, and handled by the same compiler back-end, it should not be a surprise that they achieve similar performance. An interesting question is whether the larger amount of information available to the Rust compiler will ever allow it to consistently provide better performance than does C.
Rust and Linux-Kernel Filesystems
Wedson Almeida Filho described his work on a Rust implementations of portions of the Linux kernel's 9p filesystem. The part of this effort that caught my attention was use of the Rust language's asynchronous facilities to implement some of the required state machines and use of a Rust-language type-aware packet-decoding facility.So what do these two features have to do with concurrency in general, let alone memory-ordering in particular?
Not much. Sure, call_rcu() is (most of) the async equivalent of synchronize_rcu(), but that should be well-known and thus not particularly newsworthy.
But these two features might have considerable influence over the success or lack thereof for the Rust-for-Linux effort.
In the past, much of the justification for use of Rust in Linux has been memory safety, based on the fact that 70% of the Linux exploits have involved memory unsafety. Suppose that Rust manages to address the entire 70%, as opposed to relegating (say) half of that to Rust-language unsafe code. Then use of Rust might at best provide a factor-of-three improvement.
Of course, a factor of three is quite good in absolute terms. However, in my experience, at least an order of magnitude is usually required to with high probability motivate the kind of large change represented by the C-to-Rust transition. Therefore, in order for me to rate the Rust-for-Linux projects success as highly probable, another factor of three is required.
One possible source of the additional factor of three might come from the Rust-for-Linux community giving neglected drivers some much-needed attention. Perhaps the 9p work would qualify, but it seems unlikely that the NVMe drivers are lacking attention.
Perhaps Rust async-based state machines and type-aware packet decoding will go some ways towards providing more of the additional factor of three. Or perhaps a single factor of three will suffice in this particular case. Who knows?
“Pinning” for the Linux Kernel in Rust
Benno Lossin and Gary Guo reported on their respective efforts to implement pinning in Rust for the Linux kernel (a more detailed report may be found here).But perhaps you, like me, are wondering just what pinning is and why the Linux kernel needs it.
It turns out that the Rust language allows the compiler great freedom to rearrange data structures by default, similar to what would happen in C-language code if each and every structure was marked with the Linux kernel's __randomize_layout directive. In addition, by default, Rust-language data can be moved at runtime.
This apparently allows additional optimizations, but it is not particularly friendly to Linux-kernel idioms that take addresses of fields in structures. Such idioms include pretty much all of Linux's linked-list code, as well as things like RCU callbacks. And who knows, perhaps this situation helps to explain recent hostility towards linked lists expressed on social media. ;-)
The Rust language accommodates these idioms by allowing data to be pinned, thus preventing the compiler from moving it around.
However, pinning data has consequences, including the fact that such pinning is visible in Rust's type system. This visibility allows Rust compilers to complain when (for example) list_for_each_entry() is passed an unpinned data structure. Such complaints are important, given that passing unpinned data to primitives expecting pinned data would result in memory unsafety. The code managing the pinning is expected to be optimized away, but there are concerns that it would make itself all too apparent during code review and debugging.
Benno's and Gary's approaches were compared and contrasted, but my guess is that additional work will be required to completely solve this problem. Some attendees would like to see pinning addressed at the Rust-language level rather than at the library level, but time will tell what approach is most appropriate.
RCU Use Cases
Although RCU was not an official topic, for some reason it came up frequently during the hallway tracks that I participated in. Which is probably a good thing. ;-)One question is exactly how the various RCU use cases should be represented in Rust. Rust's atomic-reference-count facility has been put forward, and it is quite possible that, in combination with liberal use of atomics, it will serve for some of the more popular use cases. Wedson suggested that Revocable covers another group of use cases, and at first glance, it appears that Wedson might be quite right. Still others might be handled by wait-wakeup mechanisms. There are still some use cases to be addressed, but some progress is being made.
Rust and Python
Some people suggested that Rust might take over from Python, adding that Rust solves many problems that Python is said to encounter in large-scale programs, and that Rust should prove more ergonomic than Python. The first is hard to argue with, given that Python seems to be most successful in smaller-scale projects that make good use of Python's extensive libraries. The second seems quite unlikely to me, in fact, it is hard for me to imagine that anything about Rust would seem in any way ergonomic to a dyed-in-the-wool Python developer.One particularly non-ergonomic attribute of current Rust is likely to be its libraries, which last I checked were considerably less extensive than those of Python. Although the software-engineering shortcomings of large-scale Python projects can and have motivated conversion to Rust, it appears to me that smaller Python projects making heavier use of a wider variety of libraries might need more motivation.
Automatic conversion of Python libraries, anyone? ;-)
Conclusions
I found Kangrejos 2022 to be extremely educational and thought-provoking, and am very glad that I was able to attend. I am still uncertain about Rust's prospects in the Linux kernel, but the session provided some good examples of actual work done and problems solved. Perhaps Rust's async and type-sensitive packet-decoding features will help increase its probability of success, and perhaps there are additional Rust features waiting to be applied, for example, use of Rust iterators to simplify lockdep cycle detection. And who knows? Perhaps future versions of Rust will be able to offer consistent performance advantages. Stranger things have happened!History
September 19, 2022: Changes based on LPC and LWN reports.TL;DR: In complete consonance with a depressingly large fraction of 21
On the Model of Computation: Point: We Must Extend Our Model of Computation to Account for Cost and Location
Dally makes a number of good points. First, he notes that past decades have seen a shift from computation being more expensive than memory references to memory references being more expensive than computation, and by a good many orders of magnitude. Second, he points out that in production, constant factors can be more important than asymptotic behavior. Third, he describes situations where hardware caches are not a silver bullet. Fourth, he calls out the necessity of appropriate computational models for performance programming, though to his credit, he does clearly state that not all programming is performance programming. Fifth, he calls out the global need for reduced energy expended on computation, motivated by projected increases in computation-driven energy demand. Sixth and finally, he advocates for a new computational model that builds on PRAM by (1) assigning different costs to computation and to memory references and (1) making memory-reference costs a function of a distance metric based on locations assigned to processing units and memories.On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis
Vishkin also makes some interesting points. First, he calls out the days of Moore's-Law frequency scaling, calling for the establishing of a similar “free lunch” scaling in terms of numbers of CPUs. He echoes Dally's point about not all programming being performance programming, but argues that in addition to a “memory wall” and an “energy wall”, there is also a “parallel programming wall” because programming multicore systems is in “simply too difficult”. Second, Vishkin calls on hardware vendors to take more account of ease of use when creating computer systems, including systems with GPUs. Third, Vishkin argues that compilers should take up much of the burden of squeezing performance out of hardware. Fourth, he makes several arguments that concurrent systems should adapt to PRAM rather than adapting PRAM to existing hardware. Fifth and finally, he calls for funding for a “multicore software spiral” that he hopes will bring back free-lunch performance increases driven by Moore's Law, but based on increasing numbers of cores rather than increasing clock frequency.Discussion
Figure 2.3 of Is Parallel Programming Hard, And, If So, What Can You Do About It? (AKA “perfbook”) shows the “Iron Triangle” of concurrent programming. Vishkin focuses primarily at the upper edge of this triangle, which is primarily about productivity. Dally focuses primarily on the lower left edge of this triangle, which is primarily about performance. Except that Vishkin's points about the cost of performance programming are all too valid, which means that defraying the cost of performance programming in the work-a-day world requires very large numbers of users, which is often accomplished by making performance-oriented concurrent software be quite general, thus amortizing its cost over large numbers of use cases and users. This puts much performance-oriented concurrent software at the lower tip of the iron triangle, where performance meets generality.One could argue that Vishkin's proposed relegation of performance-oriented code to compilers is one way to break the iron triangle of concurrent programming, but this argument fails because compilers are themselves low-level software (thus at the lower tip of the triangle). Worse yet, there have been many attempts to put the full concurrent-programming burden on the compiler (or onto transactional memory), but rarely to good effect. Perhaps SQL is the exception that proves this rule, but this is not an example that Vishkin calls out.
Both Dally and Vishkin correctly point to the ubiquity of multicore systems, so it only reasonable to look at how these systems are handled in practice. After all, it is only fair to check their apparent assumption of “badly”. And Section 2.3 of perfbook lists several straightforward approaches: (1) Using multiple instances of a sequential application, (2) Using the large quantity of existing parallel software, ranging from operating-system kernels to database servers (SQL again!) to numerical software to image-processing libraries to machine-learning libraries, to name but a few of the areas that Python developers could teach us about, and (3) Applying sequential performance optimizations such that single-threaded performance suffices. Those taking door 1 or door 2 will need only a few parallel performance programmers, and it seems to have proven eminently feasible to have those few to use a sufficiently accurate computational model. The remaining users simply invoke the software produced by these parallel performance programmers, and need not worry much about concurrency. The cases where none of these three doors apply are more challenging, but surmounting the occasional challenge is part of the programming game, not just the parallel-programming game.
One additional historical trend is the sharply decreasing cost of concurrent systems, both in terms of purchase price and energy consumption. Where an entry-level late-1980s parallel system cost millions of dollars and consumed kilowatts, an entry-level 2022 system can be had for less than $100. This means that there is no longer an overwhelming economic imperative to extract every ounce of performance from most year-2022 concurrent systems. For example, much smartphone code attains the required performance while running single-threaded, which means that additional performance from parallelism could well be a waste of energy. Instead, the smartphone automatically powers down the unused hardware, thus providing only the performance that is actually needed, and does so in an energy-efficient manner. The occasional heavy-weight application might turn on all the CPUs, but only such applications need the ministrations of parallel performance programmers and complex computational models.
Thus, Dally and Vishkin are talking past each other, describing different types of software that is produced by different types of developers, who can each use whatever computational model is appropriate to the task at hand.
Which raises the question of whether there might be a one-size-fits-all computational model. I have my doubts. In my 45 years of supporting myself and (later) my family by developing software, I have seen many models, frameworks, languages, and libraries come and go. In contrast, to the best of my knowledge, the speed of light and the sizes of the various types of atoms have remained constant. Don't get me wrong, I do hope that the trend towards vertical stacking of electronics within CPU chips will continue, and I also hope that this trend will result in smaller systems that are correspondingly less constrained by speed-of-light and size-of-atoms limitations. However, the laws of physics appear to be exceedingly stubborn things, and we would therefore be well-advised to work together. Dally's attempts to dictate current hardware conveniences to software developers is about as likely to succeed as is Vishkin's attempts to dictate current software conveniences to hardware developers. Both of these approaches are increasingly likely to fail should the trend towards special-purpose hardware continue full force. Software developers cannot always ignore the laws of physics, and by the same token, hardware developers cannot always ignore the large quantity of existing software and the large number of existing software developers. To be sure, in order to continue to break new ground, both software and hardware developers will need to continue to learn new tricks, but many of these tricks will require coordinated hardware/software effort.
Sadly, there is no hint in either Dally's or Vishkin's article of any attempt to learn what the many successful developers of concurrent software actually do. Thus, it is entirely possible that neither Dally's nor Vishkin's preferred concurrent computational model is applicable to actual parallel programming practice. In fact, in my experience, developers often use the actual hardware and software as the model, driving their development and optimization efforts from actual measurements and from intuitions built up from those measurements. Some might look down on this approach as being both non-portable and non-mathematical, but portability is often achieved simply by testing on a variety of hardware, courtesy of the relatively low prices of modern computer systems. As for mathematics, those of us who appreciate mathematics would do well to admit that it is often much more efficient to let the objective universe carry out the mathematical calculations in its normal implicit and ineffable manner, especially given the complexity of present day hardware and software.
Circling back to the cynics, there are no doubt some cynics that would summarize this blog post as “please download and read my book”. Except that in contrast to the hypothesized cynical summarization of Dally's and Vishkin's articles, you can download and read my book free of charge. Which is by design, given that the whole point is to promulgate information. Cue cynics calling out which of the three of us is the greater fool! ;-)
Comments