0% found this document useful (0 votes)
8 views4 pages

Experiences From Implementing Multiprocessor Suppo

Uploaded by

bagulgirish765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Experiences From Implementing Multiprocessor Suppo

Uploaded by

bagulgirish765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Experiences from Implementing Multiprocessor Support for an Industrial

Operating System Kernel

Simon Kågström, Håkan Grahn, and Lars Lundberg


Department of Systems and Software Engineering, School of Engineering
Blekinge Institute of Technology, Ronneby, Sweden
{ska, hgr, llu}@bth.se

Abstract 2. The Operating System


The ongoing transition from uniprocessor to multiproces-
2.1. General OS structure
sor computers requires operating system support. There
is a large body of specialized operating systems which re-
quire porting in order to work on multiprocessors. In this Figure 1 shows the architecture of the operating system.
paper we describe the design and implementation of a mul- The system exports a C++ or Java API to application pro-
tiprocessor port of a cluster operating system kernel. The grammers for the clusterware. The clusterware runs on top
multiprocessor support is implemented with a giant locking of one of the two kernels and provides access to a repli-
scheme, which allowed us to get an initial working version cated RAM-resident database, cluster management, and a
with only minor changes to the original code. We also dis- CORBA object broker. The cluster consists of process-
cuss performance and implementation experiences. ing nodes and gateway nodes. Processing nodes handle
the workload and usually run the in-house kernel, while
gateway nodes run Linux and act as front-ends to the clus-
ter. The gateway nodes also provide logging support for the
1. Introduction cluster nodes and do regular database backups to disk.
A current trend in the computer industry is the transition
from uniprocessors to various kinds of multiprocessors. C++ / Java API

Apart from SMPs, many manufacturers are now present- Node


Signalling
ing chip multiprocessors or simultaneous multithreaded management

CPUs [5] which allow more efficient use of chip area. This Clusterware

trend requires support from operating systems.


In-house Kernel Linux kernel External
We are working on a project together with a producer of network
large industrial systems in providing multiprocessor sup- Processor Processor Gateway Gateway
Node Node node node
port for an operating system kernel. The operating sys-
Cluster-internal network
tem is proprietary fault-tolerant industrial operating system
primarily used in telecommunications with (soft) real-time
response time which runs on clusters of uniprocessor In- Figure 1. The operating system architecture.
tel IA-32 computers. The system can use either the Linux
kernel or an in-house kernel, which performs better than The operating system employs an asynchronous event-
Linux. In this paper, we describe the design and implemen- driven programming model, with short-lived processes ac-
tation of the initial multiprocessor support for the in-house tivated by incoming workload. The system supports two
kernel. More technical details about the implementation types of processes, static and dynamic. Static processes
can be found in [1]. Some structure names and terms were are restarted on failure and can either be unique (one for
changed to keep the anonymity of our industrial partner. the entire cluster) or replicated (one per node in the clus-
The rest of the paper is structured as follows. Section 2.1 ter, for lower communication costs). Dynamic processes
describes the operating system. Section 3 discusses the de- are created when referenced by another process and usu-
sign of the multiprocessor support and Section 4 describes ally run short jobs, e.g., checking and updating an entry in
a performance evaluation. We thereafter discuss some ex- the database. Dynamic processes are often tied to database
periences we made in Section 5 and conclude in Section 6. objects on the local node for fast access to the database.
2.2. The Process and Memory Model avoid having to convert every private structure into an array,
we adopted an approach where each CPU always runs in a
The in-house kernel base user programs on three ba- private address space, accessing CPU-local data in private
sic entities: threads, processes, and containers. There is memory. To achieve this, we reserve a 4KB virtual address
kernel-level support for threading, and threads define the range for CPU-local data and map this page to different
basic unit of execution. The in-house kernel separates the physical pages for each CPU. We modified structure dec-
concepts of process and protection domains. Processes are larations to place the data in a special ELF-section, which
resource holders, while containers define the protection do- then is mapped and page-aligned by the boot loader.
main (an address space). To support the asynchronous pro- The CPU-local page approach presents a few problems,
gramming model, the kernel supplies very fast creation and however. First, some CPU-local structures are too large
termination of processes. There are several mechanisms to fit in one page of memory. Second, handling of multi-
behind the fast process handling. Each code package (ob- threaded processes must be modified with our approach as
ject code) is located at a unique virtual address range in described in the next section. The kernel stack, which is
the address space and all code packages also reside perma- 128KB per CPU, is one structure which is too large to store
nently in memory. This allows fast setup of new containers in the CPU-local page. The address of the kernel stack
since no new memory mappings are needed for object code is only needed at a few places, however, so we added a
and also means that there will never be any page faults on level of indirection to set the stack pointer register through
application code. Further, the in-house kernel does not use a CPU-local pointer to the kernel stack top.
disk swapping, which simplifies page fault handling and
reduces page fault latency. 3.2. Multithreaded Processes
The paging system uses a two-level paging structure on
the IA-32 where 4KB pages are mapped into 4MB page The CPU-local page presents a problem for multi-
tables which in turn are mapped into the complete 4GB threaded containers. Using a single address space for all
address space in a page directory. A global page direc- CPUs on a multiprocessor would cause the CPU-local vir-
tory containing application code, kernel code, and kernel tual page to map to the same physical page for all CPUs,
data is created during kernel startup, and this page directory i.e., the CPU-local variables would be the same for all
can thereafter serve as basis for subsequent page directories CPUs. To solve this problem, a multithreaded container
since containers share most of the address space. Contain- needs a separate page directory for every CPU executing
ers start with only two memory pages allocated, one con- threads in the container. Since multithreaded containers
taining the page table and the other the first 4KB of the are fairly rare, we chose a lazy method for handling mul-
process stack. Because of this, the container can use the tithreaded containers. Our method allows singlethreaded
global page directory, only replacing the entry mapping the containers to run with the same memory requirements as
process stack, global variables, and part of the heap. before, while multithreaded containers require one extra
memory page per CPU which executes in the container.
3. Design of the Multiprocessor Support The method requires only minimal scheduler changes.
Figure 2 shows the handling of multithreaded contain-
For the first multiprocessor implementation, we employ ers on multiprocessors in the in-house kernel. The fig-
a simple locking scheme where the entire kernel is pro- ure shows the container memory data structure, which (as
tected by a single, “giant” lock (see Chapter 10 in [4]). before) has a container page directory pointer and an ini-
The giant lock is acquired when the kernel is entered and tial page directory entry, but is extended with an array of
released again on kernel exit. The advantage of the giant per-CPU page directory pointers. When the process starts
locking mechanism is small changes to the code since most up it will have only one thread as in Figure 2a. The sin-
of the uniprocessor semantics of the kernel can be kept. For glethreaded process starts without a private page directory
the initial version, we deemed this important for correct- and instead uses the global page directory. The global page
ness reasons and to get a working version early. However, directory is modified with a page table for the process stack,
the giant lock has performance shortcomings especially for global variables and part of the heap. This state will be kept
kernel-bound workloads. as long as the process is singlethreaded and uses moderate
amounts of heap or stack space.
3.1. CPU-local Data When the process becomes multithreaded the first time,
as shown in Figure 2b, a new container page directory is al-
Some structures in the kernel need to be accessed pri- located and copied from the global page directory, with the
vately by each CPU. For example, the currently running current CPU set as owner. Apart from setting the owner,
thread and the kernel stack must be local to each CPU. To this step works exactly as in the uniprocessor version and
(a) startup, singlethreaded (b) single CPU, multithreaded (c) multiple CPUs, multithreaded

Container memory CPU-local page


data structure directory entry

Container page
directory pointer
Owner
Per CPU page
directory pointers
Initial page
directory entry

Process stack page


Container-local
table
page directory

Legend CPU-local
Global page Page table / CPU-local
page directory
directory page directory Page table

Figure 2. Handling of container address spaces in the in-house kernel for multiprocessor computers

since thread stacks reside outside the 4MB process stack the proportion of user to kernel execution. Apart from the
area, multithreaded processes will soon need a private ad- benchmark processes, around 100 system threads handling
dress space even on uniprocessor hardware. As long as database replication, logging etc., were also running in the
only one CPU executes in the process, only one page direc- system. We measure the time from the start of the bench-
tory will be used. However, when another CPU schedules mark application until it finishes.
a thread in the process, a single page directory is no longer Consistent speedups on the multiprocessor is seen only
safe. The container page directory must be copied to a new when our benchmark application executes almost com-
CPU-local page directory, requiring one page directory per pletely in user-mode, so the presented results refer to the
CPU active in the container. This is shown in Figure 2c. case when the benchmark processes run only in user-mode.
One complication with this scheme is page fault han- Executing the benchmark with the multiprocessor kernel
dling. If two or more CPUs run in a container, a page fault on a uniprocessor gives a modest slowdown of around 2%,
will be generated for the CPU-local page directory. We which suggests that our implementation can be used even
therefore modified the page fault handler to always update on uniprocessor hardware. Running the benchmark on the
the container page directory beside the CPU-local page di- multiprocessor gives a 20% speedup over the uniproces-
rectory. However, there can still be inconsistencies if the sor kernel, which was less than we expected. Since the
fault is caused by the owner of the container page direc- two benchmark processes run completely in user-mode and
tory. A later access on the same page from another CPU does not interact with each other, we expected a speedup
will then cause a spurious page fault. We handle this sit- close to 2.0.
uation lazily in the page fault handler by checking if the
page was already mapped in the container page directory, Table 1. Kernel/user time for UP and SMP.
in which case we just copy the entry to the faulting page User-mode Kernel Spinning
directory. This situation is fairly uncommon since it only UP 64% 36% < 0.1%
affects page directory faults, i.e., unmapped 4MB areas. SMP 55%-59% 20%-22% 20-23%

4. Evaluation Table 1 shows the lock contention during the benchmark


run, both for the uniprocessor (when acquiring the lock al-
We have performed an initial evaluation of our multi- ways succeeds directly) and for the multiprocessor. From
processor implementation where we evaluate contention on the table, we can see that the multiprocessor spends 20%-
our locking scheme as well as the performance of the mul- 23% of the time spinning for the giant lock. Since the in-
tiprocessor port. We ran all performance measurements kernel time is serialized by the giant lock, the theoretically
on a two-way 300MHz Pentium II SMP. For the perfor- maximum speedup we can achieve on a dual processor sys-
mance evaluation, we constructed a benchmark application 64 ≈ 1.47 according to Amdahl’s law. There are
tem is 36+64
36+ 2
with two processes executing a loop. The loop performs several reasons why the speedup is only 1.2 for our bench-
system calls at configurable intervals so that we can vary mark. First, the in-kernel time is high due to running sys-
tem processes. Second, some heavily accessed shared ker- per processor as e.g., recent versions of Linux do [3]. Fi-
nel structures, e.g., the ready queue, cause cache lines to nally, we would like to further explore CPU-affinity opti-
move between processors, and third, as a consequence of mizations for short-lived processes. Currently, processes
the high in-kernel time, time is wasted spinning on the lock. will not be moved from the processor they are started on,
but for short-lived processes it could also be beneficial to
5. Implementation Experiences restart them on the last processor they executed on.

The implementation of multiprocessor support for the 6. Conclusions


in-house kernel was more time consuming then we had ex-
pected. The project has been ongoing part-time for two In this paper, we have described the design and imple-
years, during which a single developer has performed the mentation of a multiprocessor port of a cluster operating
multiprocessor implementation whereas we initially ex- system kernel. Since our focus was to get an initial ver-
pected a first version within approximately six months. The sion with low engineering effort, we chose a simple “giant”
reasons for the delay are manifold. locking scheme where a single lock protects the entire ker-
First, the development of a multiprocessor kernel is gen- nel. Our model where CPU-local variables are placed in a
erally harder then a uniprocessor kernel because of inher- virtual address range mapped privately on different CPUs
ent mutual exclusion issues. We also performed most of the limited the source code changes and we also showed how
implementation off-site, which made it harder to get assis- this method can be applied to multithreaded processes with
tance from the core developers. A highly specialized and a very small additional memory penalty. Our experience
complex build and setup process led us to spend significant illustrates that refactoring of a large and complex unipro-
amount of time on configuration issues and build problems. cessor kernel for multiprocessor operation is a substantial
Further, the code base consists of over 2.5 million lines of undertaking, but also that it is possible to implement multi-
code, of which around 160,000 were relevant for our pur- processor support without intrusive changes to the original
poses. The complexity and volume of the code meant that kernel, changing around 1% of the core parts of the kernel.
we had to spend a lot of time to understand the code.
We did consider a number of other approaches apart Acknowledgments
from the giant lock for the initial implementation. Master-
slave systems (refer to Chapter 9 in [4]) also allow only The authors would like to thank the anonymous re-
one processor in the kernel at a time. In master-slave sys- viewers and the PAARTS-group for their valuable com-
tems, one “master” processor is dedicated to handling ker- ments. This work was partly funded by The Knowl-
nel operations while the other processors (“slaves”) only edge Foundation in Sweden under a research grant for
run user-level applications. Similarly, the application ker- the project “Blekinge - Engineering Software Qualities
nel approach [2] executes all kernel operations on one pro- (BESQ)” (http://www.bth.se/besq).
cessor, but allows the uniprocessor kernel to be kept as-is
while multiprocessor support is added as a loadable ker-
References
nel module. Neither of these approaches provide any ad-
ditional performance benefit over giant locking, and in-
[1] S. Kågström, H. Grahn, and L. Lundberg. The Design and
crementally improving the giant locking with finer-grained Implementation of Multiprocessor Support for an Industrial
strategies is easier. Operating System Kernel. Technical Report 2005:3, ISSN:
In the end, around 1% of the relevant code base was 1103-1581, Blekinge Institute of Technology.
modified. We wrote around 2,300 lines of code in new files [2] S. Kågström, L. Lundberg, and H. Grahn. A novel method for
and modified 1,600 existing lines for the implementation. adding multiprocessor support to a large and complex unipro-
The new code implement processor startup and support for cessor kernel. In Proceedings of the 18th International Par-
the locking scheme whereas the modified lines implement allel and Distributed Processing Symposium (IPDPS 2004),
CPU-local data, acquiring and releasing the giant lock, etc. Santa Fe, NM, USA, April 2004.
Future work related to the multiprocessor port will cen- [3] R. Love. Linux Kernel Development. Sams, Indianapolis,
Indiana, 1st edition, 2003.
ter around the following. Since the speedup with the giant
[4] C. Schimmel. UNIX Systems for Modern Architectures.
lock is low, we will investigate a more fine-grained lock-
Addison-Wesley, Boston, first edition, 1994.
ing scheme, starting with coarse subsystem locks. First, [5] L. Spracklen and S. G. Abraham. Chip multithreading: Op-
we are planning to use a separate lock for low-level inter- portunities and challenges. In Proceedings of the 11th In-
rupt handling to get lower interrupt latency. Another area ternational Conference on High-Performance Computer Ar-
of possible improvements is the scheduler were we will in- chitecture (HPCA-11), pages 248–252, San Francisco, CA,
vestigate dividing the common ready queue into one queue USA, February 2005.

You might also like