80% found this document useful (5 votes)
7K views592 pages

Linux Programming by Example

Unix Programs Standards Features and Power: GNU Programs Summary of C hapters Typographical C onventions Where to Get Unix and GNU Source C ode Where to Get the Example Programs Used in This Book.

Uploaded by

Mohit Modi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
80% found this document useful (5 votes)
7K views592 pages

Linux Programming by Example

Unix Programs Standards Features and Power: GNU Programs Summary of C hapters Typographical C onventions Where to Get Unix and GNU Source C ode Where to Get the Example Programs Used in This Book.

Uploaded by

Mohit Modi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 592

C opyright

Preface
Audience
What You Will Learn
Small Is Beautiful: Unix Programs
Standards
Features and Power: GNU Programs
Summary of C hapters
Typographical C onventions
Where to Get Unix and GNU Source C ode
Unix C ode
GNU C ode
Where to Get the Example Programs Used in This Book
About the C over
Acknowledgments
Part I: Files and Users
C hapter 1. Introduction
Section 1.1. The Linux/Unix File Model
Section 1.2. The Linux/Unix Process Model
Section 1.3. Standard C vs. Original C
Section 1.4. Why GNU Programs Are Better
Section 1.5. Portability Revisited
Section 1.6. Suggested Reading
Section 1.7. Summary
Exercises
C hapter 2. Arguments, Options, and the Environment
Section 2.1. Option and Argument C onventions
Section 2.2. Basic C ommand-Line Processing
Section 2.3. Option Parsing: getopt() and getopt_long()
Section 2.4. The Environment
Section 2.5. Summary
Exercises
C hapter 3. User-Level Memory Management
Section 3.1. Linux/Unix Address Space
Section 3.2. Memory Allocation
Section 3.3. Summary
Exercises
C hapter 4. Files and File I/O
Section 4.1. Introducing the Linux/Unix I/O Model
Section 4.2. Presenting a Basic Program Structure
Section 4.3. Determining What Went Wrong
Section 4.4. Doing Input and Output
Section 4.5. Random Access: Moving Around within a File
Section 4.6. C reating Files
Section 4.7. Forcing Data to Disk
Section 4.8. Setting File Length
Section 4.9. Summary
Exercises
C hapter 5. Directories and File Metadata
Section 5.1. C onsidering Directory C ontents
Section 5.2. C reating and Removing Directories
Section 5.3. Reading Directories
Section 5.4. Obtaining Information about Files
Section 5.5. C hanging Ownership, Permission, and Modification Times
Section 5.6. Summary
Exercises
C hapter 6. General Library Interfaces — Part 1
Section 6.1. Times and Dates
Section 6.2. Sorting and Searching Functions
Section 6.3. User and Group Names
Section 6.4. Terminals: isatty()
Section 6.5. Suggested Reading
Section 6.6. Summary
Exercises
C hapter 7. Putting It All Together: ls
Section 7.1. V7 ls Options
Section 7.2. V7 ls C ode
Section 7.3. Summary
Exercises
C hapter 8. Filesystems and Directory Walks
Section 8.1. Mounting and Unmounting Filesystems
Section 8.2. Files for Filesystem Administration
Section 8.3. Retrieving Per-Filesystem Information
Section 8.4. Moving Around in the File Hierarchy
Section 8.5. Walking a File Tree: GNU du
Section 8.6. C hanging the Root Directory: chroot()
Section 8.7. Summary
Exercises
Part II: Processes, IPC , and Internationalization
C hapter 9. Process Management and Pipes
Section 9.1. Process C reation and Management
Section 9.2. Process Groups
Section 9.3. Basic Interprocess C ommunication: Pipes and FIFOs
Section 9.4. File Descriptor Management
Section 9.5. Example: Two-Way Pipes in gawk
Section 9.6. Suggested Reading
Section 9.7. Summary
Exercises
C hapter 10. Signals
Section 10.1. Introduction
Section 10.2. Signal Actions
Section 10.3. Standard C Signals: signal() and raise()
Section 10.4. Signal Handlers in Action
Section 10.5. The System V Release 3 Signal APIs: sigset() et al.
Section 10.6. POSIX Signals
Section 10.7. Signals for Interprocess C ommunication
Section 10.8. Important Special-Purpose Signals
Section 10.9. Signals Across fork() and exec()
Section 10.10. Summary
Exercises
C hapter 11. Permissions and User and Group ID Numbers
Section 11.1. C hecking Permissions
Section 11.2. Retrieving User and Group IDs
Section 11.3. C hecking as the Real User: access()
Section 11.4. C hecking as the Effective User: euidaccess() (GLIBC )
Section 11.5. Setting Extra Permission Bits for Directories
Section 11.6. Setting Real and Effective IDs
Section 11.7. Working with All Three IDs: getresuid() and setresuid() (Linux)
Section 11.8. C rossing a Security Minefield: Setuid root
Section 11.9. Suggested Reading
Section 11.10. Summary
Exercises
C hapter 12. General Library Interfaces — Part 2
Section 12.1. Assertion Statements: assert()
Section 12.2. Low-Level Memory: The memXXX() Functions
Section 12.3. Temporary Files
Section 12.4. C ommitting Suicide: abort()
Section 12.5. Nonlocal Gotos
Section 12.6. Pseudorandom Numbers
Section 12.7. Metacharacter Expansions
Section 12.8. Regular Expressions
Section 12.9. Suggested Reading
Section 12.10. Summary
Exercises
C hapter 13. Internationalization and Localization
Section 13.1. Introduction
Section 13.2. Locales and the C Library
Section 13.3. Dynamic Translation of Program Messages
Section 13.4. C an You Spell That for Me, Please?
Section 13.5. Suggested Reading
Section 13.6. Summary
Exercises
C hapter 14. Extended Interfaces
Section 14.1. Allocating Aligned Memory: posix_memalign() and memalign()
Section 14.2. Locking Files
Section 14.3. More Precise Times
Section 14.4. Advanced Searching with Binary Trees
Section 14.5. Summary
Exercises
Part III: Debugging and Final Project
C hapter 15. Debugging
Section 15.1. First Things First
Section 15.2. C ompilation for Debugging
Section 15.3. GDB Basics
Section 15.4. Programming for Debugging
Section 15.5. Debugging Tools
Section 15.6. Software Testing
Section 15.7. Debugging Rules
Section 15.8. Suggested Reading
Section 15.9. Summary
Exercises
C hapter 16. A Project that Ties Everything Together
Section 16.1. Project Description
Section 16.2. Suggested Reading
Part IV: Appendixes
Appendix A. Teach Yourself Programming in Ten Years
Why Is Everyone in Such a Rush?
References
Answers
Footnotes
Appendix B. C aldera Ancient UNIX License
Appendix C . GNU General Public License
Preamble
Terms and C onditions for C opying, Distribution and Modification
How to Apply These Terms to Your New Programs
Example Use
Index
Preface
One of the best ways to learn about programming is to read well-written programs. This book teaches the
fundamental Linux system call APIs—those that form the core of any significant program—by presenting code
from production programs that you use every day.

By looking at concrete programs, you can not only see how to use the Linux APIs, but you also can examine the
real-world issues (performance, portability, robustness) that arise in writing software.

While the book's title is Linux Programming by Example, everything we cover, unless otherwise noted, applies
to modern Unix systems as well. In general we use "Linux" to mean the Linux kernel, and "GNU/Linux" to mean
the total system (kernel, libraries, tools). Also, we often say "Linux" when we mean all of Linux, GNU/Linux and
Unix; if something is specific to one system or the other, we mention it explicitly.
Audience
This book is intended for the person who understands programming and is familiar with the basics of C, at least
on the level of The C Programming Language by Kernighan and Ritchie. (Java programmers wishing to read
this book should understand C pointers, since C code makes heavy use of them.) The examples use both the
1990 version of Standard C and Original C.

In particular, you should be familiar with all C operators, control-flow structures, variable and pointer declarations
and use, the string management functions, the use of exit(), and the <stdio.h> suite of functions for file
input/output.

You should understand the basic concepts of standard input, standard output, and standard error and the fact
that all C programs receive an array of character strings representing invocation options and arguments. You
should also be familiar with the fundamental command-line tools, such as cd, cp, date, ln, ls, man (and info if
you have it), rmdir, and rm, the use of long and short command-line options, environment variables, and I/O
redirection, including pipes.

We assume that you want to write programs that work not just under GNU/Linux but across the range of Unix
systems. To that end, we mark each interface as to its availability (GLIBC systems only, or defined by POSIX,
and so on), and portability advice is included as an integral part of the text.

The programming taught here may be at a lower level than you're used to; that's OK. The system calls are the
fundamental building blocks for higher operations and are thus low-level by nature. This in turn dictates our use of
C: The APIs were designed for use from C, and code that interfaces them to higher-level languages, such as C++
and Java, will necessarily be lower level in nature, and most likely, written in C. It may help to remember that "low
level" doesn't mean "bad," it just means "more challenging."
What You Will Learn
This book focuses on the basic APIs that form the core of Linux programming:

Memory management

File input/output

File metadata

Processes and signals

Users and groups

Programming support (sorting, argument parsing, and so on)

Internationalization

Debugging

We have purposely kept the list of topics short. We believe that it is intimidating to try to learn "all there is to
know" from a single book. Most readers prefer smaller, more focused books, and the best Unix books are all
written that way.

So, instead of a single giant tome, we plan several volumes: one on Interprocess Communication (IPC) and
networking, and another on software development and code portability. We also have an eye toward possible
additional volumes in a Linux Programming by Example series that will cover topics such as thread
programming and GUI programming.

The APIs we cover include both system calls and library functions. Indeed, at the C level, both appear as simple
function calls. A system call is a direct request for system services, such as reading or writing a file or creating a
process. A library function, on the other hand, runs at the user level, possibly never requesting any services from
the operating system. System calls are documented in section 2 of the reference manual (viewable online with the
man command), and library functions are documented in section 3.

Our goal is to teach you the use of the Linux APIs by example: in particular, through the use, wherever possible,
of both original Unix source code and the GNU utilities. Unfortunately, there aren't as many self-contained
examples as we thought there'd be. Thus, we have written numerous small demonstration programs as well. We
stress programming principles: especially those aspects of GNU programming, such as "no arbitrary limits," that
make the GNU utilities into exceptional programs.

The choice of everyday programs to study is deliberate. If you've been using GNU/Linux for any length of time,
you already understand what programs such as ls and cp do; it then becomes easy to dive straight into how the
programs work, without having to spend a lot of time learning what they do.

Occasionally, we present both higher-level and lower-level ways of doing things. Usually the higher-level standard
interface is implemented in terms of the lower-level interface or construct. We hope that such views of what's
"under the hood" will help you understand how things work; for all the code you write, you should always use the
higher-level, standard interface.

Similarly, we sometimes introduce functions that provide certain functionality and then recommend (with a
provided reason) that these functions be avoided! The primary reason for this approach is so that you'll be able to
recognize these functions when you see them and thus understand the code using them. A well-rounded
knowledge of a topic requires understanding not just what you can do, but what you should and should not do.
Finally, each chapter concludes with exercises. Some involve modifying or writing code. Others are more in the
category of "thought experiments" or "why do you think..." We recommend that you do all of them—they will help
cement your understanding of the material.
Small Is Beautiful: Unix Programs
Hoare's law: "Inside every large program is a small program struggling to get out."

—C.A.R. Hoare

Initially, we planned to teach the Linux API by using the code from the GNU utilities. However, the modern
versions of even simple command-line programs (like mv and cp) are large and many-featured. This is particularly
true of the GNU variants of the standard utilities, which allow long and short options, do everything required by
POSIX, and often have additional, seemingly unrelated options as well (like output highlighting).

It then becomes reasonable to ask, "Given such a large and confusing forest, how can we focus on the one or two
important trees?" In other words, if we present the current full-featured program, will it be possible to see the
underlying core operation of the program?

That is when Hoare's law[1] inspired us to look to the original Unix programs for example code. The original V7
Unix utilities are small and straightforward, making it easy to see what's going on and to understand how the
system calls are used. (V7 was released around 1979; it is the common ancestor of all modern Unix systems,
including GNU/Linux and the BSD systems.)
[1] This famousstatement was made at The International Workshop on Efficient Production of Large Programs in
Jablonna, Poland, August 10–14, 1970.

For many years, Unix source code was protected by copyrights and trade secret license agreements, making it
difficult to use for study and impossible to publish. This is still true of all commercial Unix source code. However,
in 2002, Caldera (currently operating as SCO) made the original Unix code (through V7 and 32V Unix) available
under an Open Source style license (see Appendix B, "Caldera Ancient UNIX License," page 655). This makes
it possible for us to include the code from the early Unix system in this book.
Standards
Throughout the book we refer to several different formal standards. A standard is a document describing how
something works. Formal standards exist for many things, for example, the shape, placement, and meaning of the
holes in the electrical outlet in your wall are defined by a formal standard so that all the power cords in your
country work in all the outlets.

So, too, formal standards for computing systems define how they are supposed to work; this enables developers
and users to know what to expect from their software and enables them to complain to their vendor when
software doesn't work.

Of interest to us here are:

0. ISO/IEC International Standard 9899: Programming Languages—C, 1990. The first formal standard
for the C programming language.

1. ISO/IEC International Standard 9899: Programming Languages—C, Second edition, 1999. The
second (and current) formal standard for the C programming language.

2. ISO/IEC International Standard 14882: Programming Languages—C++, 1998. The first formal
standard for the C++ programming language.

3. ISO/IEC International Standard 14882: Programming Languages—C++, 2003. The second (and
current) formal standard for the C++ programming language.

4. IEEE Standard 1003.1–2001: Standard for Information Technology—Portable Operating System


Interface (POSIX®). The current version of the POSIX standard; describes the behavior expected of Unix
and Unix-like systems. This edition covers both the system call and library interface, as seen by the C/C++
programmer, and the shell and utilities interface, seen by the user. It consists of several volumes:

Base Definitions. The definitions of terms, facilities, and header files.

Base Definitions—Rationale. Explanations and rationale for the choice of facilities that both are and
are not included in the standard.

System Interfaces. The system calls and library functions. POSIX terms them all "functions."

Shell and Utilities. The shell language and utilities available for use with shell programs and
interactively.

Although language standards aren't exciting reading, you may wish to consider purchasing a copy of the C
standard: It provides the final definition of the language. Copies can be purchased from ANSI[2] and from ISO.[3]
(The PDF version of the C standard is quite affordable.)
[2] http://www.ansi.org

[3] http://www.iso.ch

The POSIX standard can be ordered from The Open Group.[4] By working through their publications catalog to
the items listed under "CAE Specifications," you can find individual pages for each part of the standard (named
"C031" through "C034"). Each one's page provides free access to the online HTML version of the particular
volume.
[4] http://www.opengroup.org
The POSIX standard is intended for implementation on both Unix and Unix-like systems, as well as non-Unix
systems. Thus, the base functionality it provides is a subset of what Unix systems have. However, the POSIX
standard also defines optional extensions—additional functionality, for example, for threads or real-time support.
Of most importance to us is the X/Open System Interface (XSI) extension, which describes facilities from
historical Unix systems.

Throughout the book, we mark each API as to its availability: ISO C, POSIX, XSI, GLIBC only, or nonstandard
but commonly available.
Features and Power: GNU Programs
Restricting ourselves to just the original Unix code would have made an interesting history book, but it would not
have been very useful in the 21st century. Modern programs do not have the same constraints (memory, CPU
power, disk space, and speed) that the early Unix systems did. Furthermore, they need to operate in a multilingual
world—ASCII and American English aren't enough.

More importantly, one of the primary freedoms expressly promoted by the Free Software Foundation and the
GNU Project[5] is the "freedom to study." GNU programs are intended to provide a large corpus of well-written
programs that journeyman programmers can use as a source from which to learn.
[5] http://www.gnu.org

By using GNU programs, we want to meet both goals: show you well-written, modern code from which you will
learn how to write good code and how to use the APIs well.

We believe that GNU software is better because it is free (in the sense of "freedom," not "free beer"). But it's also
recognized that GNU software is often technically better than the corresponding Unix counterparts, and we
devote space in Section 1.4, "Why GNU Programs Are Better," page 14, to explaining why.

A number of the GNU code examples come from gawk (GNU awk). The main reason is that it's a program with
which we're very familiar, and therefore it was easy to pick examples from it. We don't otherwise make any
special claims about it.
Summary of Chapters
Driving a car is a holistic process that involves multiple simultaneous tasks. In many ways, Linux programming is
similar, requiring understanding of multiple aspects of the API, such as file I/O, file metadata, directories, storage
of time information, and so on.

The first part of the book looks at enough of these individual items to enable studying the first significant program,
the V7 ls. Then we complete the discussion of files and users by looking at file hierarchies and the way
filesystems work and are used.

Chapter 1, "Introduction," page 3,

describes the Unix and Linux file and process models, looks at the differences between Original C and
1990 Standard C, and provides an overview of the principles that make GNU programs generally better
than standard Unix programs.

Chapter 2, "Arguments, Options, and the Environment," page 23,

describes how a C program accesses and processes command-line arguments and options and explains
how to work with the environment.

Chapter 3, "User-Level Memory Management," page 51,

provides an overview of the different kinds of memory in use and available in a running process. User-level
memory management is central to every nontrivial application, so it's important to understand it early on.

Chapter 4, "Files and File I/O," page 83,

discusses basic file I/O, showing how to create and use files. This understanding is important for everything
else that follows.

Chapter 5, "Directories and File Metadata," page 117,

describes how directories, hard links, and symbolic links work. It then describes file metadata, such as
owners, permissions, and so on, as well as covering how to work with directories.

Chapter 6, "General Library Interfaces — Part 1," page 165,

looks at the first set of general programming interfaces that we need so that we can make effective use of a
file's metadata.

Chapter 7, "Putting It All Together: ls," page 207,

ties together everything seen so far by looking at the V7 ls program.

Chapter 8, "Filesystems and Directory Walks," page 227,

describes how filesystems are mounted and unmounted and how a program can tell what is mounted on the
system. It also describes how a program can easily "walk" an entire file hierarchy, taking appropriate action
for each object it encounters.

The second part of the book deals with process creation and management, interprocess communication with
pipes and signals, user and group IDs, and additional general programming interfaces. Next, the book first
describes internationalization with GNU gettext and then several advanced APIs.

Chapter 9, "Process Management and Pipes," page 283,


looks at process creation, program execution, IPC with pipes, and file descriptor management, including
nonblocking I/O.

Chapter 10, "Signals," page 347,

discusses signals, a simplistic form of interprocess communication. Signals also play an important role in a
parent process's management of its children.

Chapter 11, "Permissions and User and Group ID Numbers," page 403,

looks at how processes and files are identified, how permission checking works, and how the setuid and
setgid mechanisms work.

Chapter 12, "General Library Interfaces — Part 2," page 427,

looks at the rest of the general APIs; many of these are more specialized than the first general set of APIs.

Chapter 13, "Internationalization and Localization," page 485,

explains how to enable your programs to work in multiple languages, with almost no pain.

Chapter 14, "Extended Interfaces," page 529,

describes several extended versions of interfaces covered in previous chapters, as well as covering file
locking in full detail.

We round the book off with a chapter on debugging, since (almost) no one gets things right the first time, and we
suggest a final project to cement your knowledge of the APIs covered in this book.

Chapter 15, "Debugging," page 567,

describes the basics of the GDB debugger, transmits as much of our programming experience in this area as
possible, and looks at several useful tools for doing different kinds of debugging.

Chapter 16, "A Project that Ties Everything Together," page 641,

presents a significant programming project that makes use of just about everything covered in the book.

Several appendices cover topics of interest, including the licenses for the source code used in this book.

Appendix A, "Teach Yourself Programming in Ten Years," page 649,

invokes the famous saying, "Rome wasn't built in a day." So too, Linux/Unix expertise and understanding
only come with time and practice. To that end, we have included this essay by Peter Norvig which we
highly recommend.

Appendix B, "Caldera Ancient UNIX License," page 655,

covers the Unix source code used in this book.

Appendix C, "GNU General Public License," page 657,

covers the GNU source code used in this book.


Typographical Conventions
Like all books on computer-related topics, we use certain typographical conventions to convey information.
Definitions or first uses of terms appear in italics, like the word "Definitions" at the beginning of this sentence.
Italics are also used for emphasis, for citations of other works, and for commentary in examples. Variable items
such as arguments or filenames, appear like this. Occasionally, we use a bold font when a point needs to be
made strongly.

Things that exist on a computer are in a constant-width font, such as filenames (foo.c) and command names (ls,
grep). Short snippets that you type are additionally enclosed in single quotes: 'ls -l *.c'.

$ and > are the Bourne shell primary and secondary prompts and are used to display interactive examples. User
input appears in a different font from regular computer output in examples. Examples look like this:

$ ls -l Look at files. Option is digit 1, not letter l


foo
bar
baz

We prefer the Bourne shell and its variants (ksh93, Bash) over the C shell; thus, all our examples show only the
Bourne shell. Be aware that quoting and line-continuation rules are different in the C shell; if you use it, you're on
your own![6]
[6] Seethe csh(1) and tcsh(1) manpages and the book Using csh & tcsh, by Paul DuBois, O'Reilly & Associates,
Sebastopol, CA, USA, 1995. ISBN: 1-56592-132-1.

When referring to functions in programs, we append an empty pair of parentheses to the function's name:
printf(), strcpy(). When referring to a manual page (accessible with the man command), we follow the
standard Unix convention of writing the command or function name in italics and the section in parentheses after it,
in regular type: awk(1), printf(3).
Where to Get Unix and GNU Source Code
You may wish to have copies of the programs we use in this book for your own experimentation and review. All
the source code is available over the Internet, and your GNU/Linux distribution contains the source code for the
GNU utilities.
Unix Code
Archives of various "ancient" versions of Unix are maintained by The UNIX Heritage Society (TUHS),
http://www.tuhs.org.

Of most interest is that it is possible to browse the archive of old Unix source code on the Web. Start with
http://minnie.tuhs.org/UnixTree/. All the example code in this book is from the Seventh Edition
Research UNIX System, also known as "V7."

The TUHS site is physically located in Australia, although there are mirrors of the archive around the world—see
http://www.tuhs.org/archive_sites.html. This page also indicates that the archive is available for
mirroring with rsync. (See http://rsync.samba.org/ if you don't have rsync: It's standard on GNU/Linux
systems.)

You will need about 2–3 gigabytes of disk to copy the entire archive. To copy the archive, create an empty
directory, and in it, run the following commands:

mkdir Applications 4BSD PDP-11 PDP-11/Trees VAX Other

rsync -avz minnie.tuhs.org::UA_Root .


rsync -avz minnie.tuhs.org::UA_Applications Applications
rsync -avz minnie.tuhs.org::UA_4BSD 4BSD
rsync -avz minnie.tuhs.org::UA_PDP11 PDP-11
rsync -avz minnie.tuhs.org::UA_PDP11_Trees PDP-11/Trees
rsync -avz minnie.tuhs.org::UA_VAX VAX
rsync -avz minnie.tuhs.org::UA_Other Other

You may wish to omit copying the TRees directory, which contains extractions of several versions of Unix, and
occupies around 700 megabytes of disk.

You may also wish to consult the TUHS mailing list to see if anyone near you can provide copies of the archive
on CD-ROM, to avoid transferring so much data over the Internet.

The folks at Southern Storm Software, Pty. Ltd., in Australia, have "modernized" a portion of the V7 user-level
code so that it can be compiled and run on current systems, most notably GNU/Linux. This code can be
downloaded from their web site.[7]
[7] http://www.southern-storm.com.au/v7upgrade.html

It's interesting to note that V7 code does not contain any copyright or permission notices in it. The authors wrote
the code primarily for themselves and their research, leaving the permission issues to AT&T's corporate licensing
department.
GNU Code
If you're using GNU/Linux, then your distribution will have come with source code, presumably in whatever
packaging format it uses (Red Hat RPM files, Debian DEB files, Slackware .tar.gz files, etc.). Many of the
examples in the book are from the GNU Coreutils, version 5.0. Find the appropriate CD-ROM for your
GNU/Linux distribution, and use the appropriate tool to extract the code. Or follow the instructions in the next
few paragraphs to retrieve the code.

If you prefer to retrieve the files yourself from the GNU ftp site, you will find them at
ftp://ftp.gnu.org/gnu/coreutils/coreutils-5.0.tar.gz.

You can use the wget utility to retrieve the file:

$ wget ftp://ftp.gnu.org/gnu/coreutils/coreutils-5.0.tar.gz Retrieve the distribution


... lots of output here as file is retrieved ...

Alternatively, you can use good old-fashioned ftp to retrieve the file:

$ ftp ftp.gnu.org Connect to GNU ftp site


Connected to ftp.gnu.org (199.232.41.7).
220 GNU FTP server ready.
Name (ftp.gnu.org:arnold): anonymous Use anonymous ftp
331 Please specify the password.
Password: Password does not echo on screen
230-If you have any problems with the GNU software or its downloading,
230-please refer your questions to <[email protected]>.
... Lots of verbiage deleted
230 Login successful. Have fun.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd /gnu/coreutils Change to Coreutils directory
250 Directory successfully changed.
ftp> bin
200 Switching to Binary mode.
ftp> hash Print # signs as progress indicato
Hash mark printing on (1024 bytes/hash mark).
ftp> get coreutils-5.0.tar.gz Retrieve file
local: coreutils-5.0.tar.gz remote: coreutils-5.0.tar.gz
227 Entering Passive Mode (199, 232, 41, 7, 86, 107)
150 Opening BINARY mode data connection for coreutils-5.0.tar.gz (6020616 bytes)
#################################################################################
#################################################################################
...
226 File send OK.
6020616 bytes received in 2.03e+03 secs (2.9 Kbytes/sec)
ftp> quit Log off
221 Goodbye.

Once you have the file, extract it as follows:

$ gzip -dc < coreutils-5.0.tar.gz | tar -xvpf - Extract files


... lots of output here as files are extracted ...
Systems using GNU tar may use this incantation:

$ tar -xvpzf coreutils-5.0.tar.gz Extract files


... lots of output here as files are extracted ...

In compliance with the GNU General Public License, here is the Copyright information for all GNU programs
quoted in this book. All the programs are "free software; you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version." See Appendix C, "GNU General Public License," page 657, for
the text of the GNU General Public License.

Coreutils 5.0 File Copyright dates


lib/safe-read.c Copyright © 1993–1994, 1998, 2002
lib/safe-write.c Copyright © 2002
lib/utime.c Copyright © 1998, 2001–2002
lib/xreadlink.c Copyright © 2001
src/du.c Copyright © 1988–1991, 1995–2003
src/env.c Copyright © 1986, 1991–2003
src/install.c Copyright © 1989–1991, 1995–2002
src/link.c Copyright © 2001–2002
src/ls.c Copyright © 1985, 1988, 1990, 1991, 1995–2003
src/pathchk.c Copyright © 1991–2003
src/sort.c Copyright © 1988, 1991–2002
src/sys2.h Copyright © 1997–2003
src/wc.c Copyright © 1985, 1991, 1995–2002
Gawk 3.0.6 File Copyright dates
eval.c Copyright © 1986, 1988, 1989, 1991–2000
Gawk 3.1.3 File Copyright dates
awk.h Copyright © 1986, 1988, 1989, 1991–2003
builtin.c Copyright © 1986, 1988, 1989, 1991–2003
eval.c Copyright © 1986, 1988, 1989, 1991–2003
io.c Copyright © 1986, 1988, 1989, 1991–2003
main.c Copyright © 1986, 1988, 1989, 1991–2003
posix/gawkmisc.c Copyright © 1986, 1988, 1989, 1991–1998,
2001–2003
Gawk 3.1.4 File Copyright dates
Coreutils 5.0 File Copyright dates
builtin.c Copyright © 1986, 1988, 1989, 1991–2004
GLIBC 2.3.2 File Copyright dates
locale/locale.h Copyright © 1991, 1992, 1995–2002
posix/unistd.h Copyright © 1991–2003
time/sys/time.h Copyright © 1991–1994, 1996–2003
Make 3.80 File Copyright dates
read.c Copyright © 1988–1997, 2002
Where to Get the Example Programs Used in This Book
The example programs used in this book can be found at http://authors.phptr.com/robbins.
About the Cover
"This is the weapon of a Jedi Knight ..., an elegant weapon for a more civilized age. For over a thousand
generations the Jedi Knights were the guardians of peace and justice in the Old Republic. Before the dark
times, before the Empire."

—Obi-Wan Kenobi

You may be wondering why we chose to put a light saber on the cover and to use it throughout the book's
interior. What does it represent, and how does it relate to Linux programming?

In the hands of a Jedi Knight, a light saber is both a powerful weapon and a thing of beauty. Its use demonstrates
the power, knowledge, control of the Force, and arduous training of the Jedi who wields it.

The elegance of the light saber mirrors the elegance of the original Unix API design. There, too, the studied,
precise use of the APIs and the Software Tools and GNU design principles lead to today's powerful, flexible,
capable GNU/Linux system. This system demonstrates the knowledge and understanding of the programmers
who wrote all its components.

And, of course, light sabers are just way cool!


Acknowledgments
Writing a book is lots of work, and doing it well requires help from many people. Dr. Brian W. Kernighan, Dr.
Doug McIlroy, Peter Memishian, and Peter van der Linden reviewed the initial book proposal. David J. Agans,
Fred Fish, Don Marti, Jim Meyering, Peter Norvig, and Julian Seward provided reprint permission for various
items quoted throughout the book. Thanks to Geoff Collyer, Ulrich Drepper, Yosef Gold, Dr. C.A.R. (Tony)
Hoare, Dr. Manny Lehman, Jim Meyering, Dr. Dennis M. Ritchie, Julian Seward, Henry Spencer, and Dr.
Wladyslaw M. Turski, who provided much useful general information. Thanks also to the other members of the
GNITS gang: Karl Berry, Akim DeMaille, Ulrich Drepper, Greg McGary, Jim Meyering, François Pinard, and
Tom Tromey, who all provided helpful feedback about good programming practice. Karl Berry, Alper Ersoy,
and Dr. Nelson H.F. Beebe provided valuable technical help with the Texinfo and DocBook/XML toolchains.

Good technical reviewers not only make sure that an author gets his facts right, they also ensure that he thinks
carefully about his presentation. Dr. Nelson H.F. Beebe, Geoff Collyer, Russ Cox, Ulrich Drepper, Randy
Lechlitner, Dr. Brian W. Kernighan, Peter Memishian, Jim Meyering, Chet Ramey, and Louis Taber acted as
technical reviewers for the entire book. Dr. Michael Brennan provided helpful comments on Chapter 15. Both the
prose and many of the example programs benefited from their reviews. I hereby thank all of them. As most
authors usually say here, "Any remaining errors are mine."

I would especially like to thank Mark Taub of Pearson Education for initiating this project, for his enthusiasm for
the series, and for his help and advice as the book moved through its various stages. Anthony Gemmellaro did a
phenomenal job of realizing my concept for the cover, and Gail Cocker's interior design is beautiful. Faye
Gemmellaro made the production process enjoyable, instead of a chore. Dmitry Kirsanov and Alina Kirsanova
did the figures, page layout, and indexing; they were a pleasure to work with.

Finally, my deepest gratitude and love to my wife, Miriam, for her support and encouragement during the book's
writing.

Arnold Robbins
Nof Ayalon
ISRAEL
Part I: Files and Users
Chapter 1 Introduction
Chapter 2 Arguments, Options, and the Environment
Chapter 3 User-Level Memory Management
Chapter 4 Files and File I/O
Chapter 5 Directories and File Metadata
Chapter 6 General Library Interfaces — Part 1
Chapter 7 Putting It All Together: ls
Chapter 8 Filesystems and Directory Walks
Chapter 1. Introduction
In this chapter

1.1 The Linux/Unix File Model page 4

1.2 The Linux/Unix Process Model page 10

1.3 Standard C vs. Original C page 12

1.4 Why GNU Programs Are Better page 14

1.5 Portability Revisited page 19

1.6 Suggested Reading page 20

1.7 Summary page 21

Exercises page 22

If there is one phrase that summarizes the primary GNU/Linux (and therefore Unix) concepts, it's "files and
processes." In this chapter we review the Linux file and process models. These are important to understand
because the system calls are almost all concerned with modifying some attribute or part of the state of a file or a
process.

Next, because we'll be examining code in both styles, we briefly review the major difference between 1990
Standard C and Original C. Finally, we discuss at some length what makes GNU programs "better,"
programming principles that we'll see in use in the code.

This chapter contains a number of intentional simplifications. The full details are covered as we progress through
the book. If you're already a Linux wizard, please forgive us.
1.1. The Linux/Unix File Model
One of the driving goals in the original Unix design was simplicity. Simple concepts are easy to learn and use.
When the concepts are translated into simple APIs, simple programs are then easy to design, write, and get
correct. In addition, simple code is often smaller and more efficient than more complicated designs.

The quest for simplicity was driven by two factors. From a technical point of view, the original PDP-11
minicomputers on which Unix was developed had a small address space: 64 Kilobytes total on the smaller
systems, 64K code and 64K of data on the large ones. These restrictions applied not just to regular programs
(so-called user level code), but to the operating system itself (kernel level code). Thus, not only "Small Is
Beautiful" aesthetically, but "Small Is Beautiful" because there was no other choice!

The second factor was a negative reaction to contemporary commercial operating systems, which were
needlessly complicated, with obtuse command languages, multiple kinds of file I/O, and little generality or
symmetry. (Steve Johnson once remarked that "Using TSO is like trying to kick a dead whale down a beach."
TSO is one of the obtuse mainframe time-sharing systems just described.)

1.1.1. Files and Permissions

The Unix file model is as simple as it gets: A file is a linear stream of bytes. Period. The operating system imposes
no preordained structure on files: no fixed or varying record sizes, no indexed files, nothing. The interpretation of
file contents is entirely up to the application. (This isn't quite true, as we'll see shortly, but it's close enough for a
start.)

Once you have a file, you can do three things with the file's data: read them, write them, or execute them.

Unix was designed for time-sharing minicomputers; this implies a multiuser environment from the get-go. Once
there are multiple users, it must be possible to specify a file's permissions: Perhaps user jane is user fred's boss,
and jane doesn't want fred to read the latest performance evaluations.

For file permission purposes, users are classified into three distinct categories: user: the owner of a file; group: the
group of users associated with this file (discussed shortly); and other: anybody else. For each of these categories,
every file has separate read, write, and execute permission bits associated with it, yielding a total of nine
permission bits. This shows up in the first field of the output of 'ls -l':

$ ls -l progex.texi
-rw-r--r-- 1 arnold devel 5614 Feb 24 18:02 progex.texi

Here, arnold and devel are the owner and group of progex.texi, and -rw-r--r-- are the file type and
permissions. The first character is a dash for regular file, a d for directories, or one of a small set of other
characters for other kinds of files that aren't important at the moment. Each subsequent group of three characters
represents read, write, and execute permission for the owner, group, and "other," respectively.

In this example, progex.texi is readable and writable by the owner, and readable by the group and other. The
dashes indicate absent permissions, thus the file is not executable by anyone, nor is it writable by the group or
other.

The owner and group of a file are stored as numeric values known as the user ID (UID) and group ID (GID);
standard library functions that we present later in the book make it possible to print the values as human-readable
names.
A file's owner can change the permission by using the chmod (change mode) command. (As such, file permissions
are sometimes referred to as the "file mode.") A file's group can be changed with the chgrp (change group) and
chown (change owner) commands.[1]

[1] Somesystems allow regular users to change the ownership on their files to someone else, thus "giving them away."
The details are standardized by POSIX but are a bit messy. Typical GNU/Linux configurations do not allow it.

Group permissions were intended to support cooperative work: Although one person in a group or department
may own a particular file, perhaps everyone in that group needs to be able to modify it. (Consider a collaborative
marketing paper or data from a survey.)

When the system goes to check a file access (usually upon opening a file), if the UID of the process matches that
of the file, the owner permissions apply. If those permissions deny the operation (say, a write to a file with -r--
rw-rw- permissions), the operation fails; Unix and Linux do not proceed to test the group and other
permissions.[2] The same is true if the UID is different but the GID matches; if the group permissions deny the
operation, it fails.
[2] The owner can always change the permission, of course. Most users don't disable write permission for themselves.

Unix and Linux support the notion of a superuser: a user with special privileges. This user is known as root and
has the UID of 0. root is allowed to do anything; all bets are off, all doors are open, all drawers unlocked.[3]
(This can have significant security implications, which we touch on throughout the book but do not cover
exhaustively.) Thus, even if a file is mode ----------, root can still read and write the file. (One exception is
that the file can'1t be executed. But as root can add execute permission, the restriction doesn't prevent anything.)
[3] There are some rare exceptions to this rule, all of which are beyond the scope of this book.

The user/group/other, read/write/execute permissions model is simple, yet flexible enough to cover most
situations. Other, more powerful but more complicated, models exist and are implemented on different systems,
but none of them are well enough standardized and broadly enough implemented to be worth discussing in a
general-purpose text like this one.

1.1.2. Directories and Filenames

Once you have a file, you need someplace to keep it. This is the purpose of the directory (known as a "folder" on
Windows and Apple Macintosh systems). A directory is a special kind of file, which associates filenames with
particular collections of file metadata, known as inodes. Directories are special because they can only be updated
by the operating system, by the system calls described inChapter 4, "Files and File I/O," page 83. They are also
special in that the operating system dictates the format of directory entries.

Filenames may contain any valid 8-bit byte except the / (forward slash) character and ASCII NUL, the character
whose bits are all zero. Early Unix systems limited filenames to 14 bytes; modern systems allow individual
filenames to be up to 255 bytes.

The inode contains all the information about a file except its name: the type, owner, group, permissions, size,
modification and access times. It also stores the locations on disk of the blocks containing the file's data. All of
these are data about the file, not the file's data itself, thus the term metadata.

Directory permissions have a slightly different meaning from those for file permissions. Read permission means the
ability to search the directory; that is, to look through it to see what files it contains. Write permission is the ability
to create and remove files in the directory. Execute permission is the ability to go through a directory when
opening or otherwise accessing a contained file or subdirectory.
Note

If you have write permission on a directory, you can remove files in that directory, even if they don't
belong to you! When used interactively, the rm command notices this, and asks you for confirmation in
such a case.

The /tmp directory has write permission for everyone, but your files in /tmp are quite safe because
/tmp usually has the so-called sticky bit set on it:

$ ls -ld /tmp
drwxrwxrwt 11 root root 4096 May 15 17:11 /tmp

Note the t is the last position of the first field. On most directories this position has an x in it. With the
sticky bit set, only you, as the file's owner, or root may remove your files. (We discuss this in more
detail in Section 11.5.2, "Directories and the Sticky Bit," page 414.)

1.1.3. Executable Files

Remember we said that the operating system doesn't impose a structure on files? Well, we've already seen that
that was a white lie when it comes to directories. It's also the case for binary executable files. To run a program,
the kernel has to know what part of a file represents instructions (code) and what part represents data. This leads
to the notion of an object file format, which is the definition for how these things are laid out within a file on disk.

Although the kernel will only run a file laid out in the proper format, it is up to user-level utilities to create these
files. The compiler for a programming language (such as Ada, Fortran, C, or C++) creates object files, and then a
linker or loader (usually named ld) binds the object files with library routines to create the final executable. Note
that even if a file has all the right bits in all the right places, the kernel won't run it if the appropriate execute
permission bit isn't turned on (or at least one execute bit for root).

Because the compiler, assembler, and loader are user-level tools, it's (relatively) easy to change object file
formats as needs develop over time; it's only necessary to "teach" the kernel about the new format and then it can
be used. The part that loads executables is relatively small and this isn't an impossible task. Thus, Unix file formats
have evolved over time. The original format was known as a. out (Assembler OUTput). The next format, still
used on some commercial systems, is known as COFF (Common Object File Format), and the current, most
widely used format is ELF (Extensible Linking Format). Modern GNU/Linux systems use ELF.

The kernel recognizes that an executable file contains binary object code by looking at the first few bytes of the
file for special magic numbers. These are sequences of two or four bytes that the kernel recognizes as being
special. For backwards compatibility, modern Unix systems recognize multiple formats. ELF files begin with the
four characters "\177ELF".

Besides binary executables, the kernel also supports executable scripts. Such a file also begins with a magic
number: in this case, the two regular characters #!. A script is a program executed by an interpreter, such as the
shell, awk, Perl, Python, or Tcl. The #! line provides the full path to the interpreter and, optionally, one single
argument:

#! /bin/awk -f

BEGIN { print "hello, world" }


Let's assume the above contents are in a file named hello.awk and that the file is executable. When you type
'hello.awk', the kernel runs the program as if you had typed '/bin/awk -f hello.awk'. Any additional
command-line arguments are also passed on to the program. In this case, awk runs the program and prints the
universally known hello, world message.

The #! mechanism is an elegant way of hiding the distinction between binary executables and script executables.
If hello.awk is renamed to just hello, the user typing 'hello' can't tell (and indeed shouldn't have to know) that
hello isn't a binary executable program.

1.1.4. Devices

One of Unix's most notable innovations was the unification of file I/O and device I/O.[4] Devices appear as files in
the filesystem, regular permissions apply to their access, and the same I/O system calls are used for opening,
reading, writing, and closing them. All of the "magic" to make devices look like files is hidden in the kernel. This is
just another aspect of the driving simplicity principle in action: We might phrase it as no special cases for user
code.
[4] This feature first appeared in Multics, but Multics was never widely used.

Two devices appear frequently in everyday use, particularly at the shell level: /dev/null and /dev/tty.

/dev/null is the "bit bucket." All data sent to /dev/null is discarded by the operating system, and attempts to
read from it always return end-of-file (EOF) immediately.

/dev/tty is the process's current controlling terminal—the one to which it listens when a user types the interrupt
character (typically CTRL-C) or performs job control (CTRL-Z).

GNU/Linux systems, and many modern Unix systems, supply /dev/stdin, /dev/stdout, and /dev/stderr
devices, which provide a way to name the open files each process inherits upon startup.

Other devices represent real hardware, such as tape and disk drives, CD-ROM drives, and serial ports. There
are also software devices, such as pseudo-ttys, that are used for networking logins and windowing systems.
/dev/console represents the system console, a particular hardware device on minicomputers. On modern
computers, /dev/console is the screen and keyboard, but it could be a serial port.

Unfortunately, device-naming conventions are not standardized, and each operating system has different names
for tapes, disks, and so on. (Fortunately, that's not an issue for what we cover in this book.) Devices have either a
b or c in the first character of 'ls -l' output:

$ ls -l /dev/tty /dev/hda
brw-rw---- 1 root disk 3, 0 Aug 31 02:31 /dev/hda
crw-rw-rw- 1 root root 5, 0 Feb 26 08:44 /dev/tty

The initial b represents block devices, and a c represents character devices. Device files are discussed further in
Section 5.4, "Obtaining Information about Files," page 139.
1.2. The Linux/Unix Process Model
A process is a running program.[5] Processes have the following attributes:
[5] Processes can be suspended, in which case they are not "running"; however, neither are they terminated. In any case,
in the early stages of the climb up the learning curve, it pays not to be too pedantic.

A unique process identifier (the PID)

A parent process (with an associated identifier, the PPID)

Permission identifiers (UID, GID, groupset, and so on)

An address space, separate from those of all other processes

A program running in that address space

A current working directory ('.')

A current root directory (/; changing this is an advanced topic)

A set of open files, directories, or both

A permissions-to-deny mask for use in creating new files

A set of strings representing the environment

A scheduling priority (an advanced topic)

Settings for signal disposition (an advanced topic)

A controlling terminal (also an advanced topic)

When the main() function begins execution, all of these things have already been put in place for the running
program. System calls are available to query and change each of the above items; covering them is the purpose of
this book.

New processes are always created by an existing process. The existing process is termed the parent, and the
new process is termed the child. Upon booting, the kernel handcrafts the first, primordial process, which runs the
program /sbin/init; it has process ID 1 and serves several administrative functions. All other processes are
descendants of init. (init's parent is the kernel, often listed as process ID 0.)

The child-to-parent relationship is one-to-one; each process has only one parent, and thus it's easy to find out the
PID of the parent. The parent-to-child relationship is one-to-many; any given process can create a potentially
unlimited number of children. Thus, there is no easy way for a process to find out the PIDs of all its children. (In
practice, it's not necessary, anyway.) A parent process can arrange to be notified when a child process terminates
("dies"), and it can also explicitly wait for such an event.

Each process's address space (memory) is separate from that of every other. Unless two processes have made
explicit arrangement to share memory, one process cannot affect the address space of another. This is important;
it provides a basic level of security and system reliability. (For efficiency, the system arranges to share the read-
only executable code of the same program among all the processes running that program. This is transparent to
the user and to the running program.)

The current working directory is the one to which relative pathnames (those that don't start with a /) are relative.
This is the directory you are "in" whenever you issue a 'cd someplace' command to the shell.
By convention, all programs start out with three files already open: standard input, standard output, and standard
error. These are where input comes from, output goes to, and error messages go to, respectively. In the course of
this book, we will see how these are put in place. A parent process can open additional files and have them
already available for a child process; the child will have to know they're there, either by way of some convention
or by a command-line argument or environment variable.

The environment is a set of strings, each of the form 'name=value'. Functions exist for querying and setting
environment variables, and child processes inherit the environment of their parents. Typical environment variables
are things like PATH and HOME in the shell. Many programs look for the existence and value of specific
environment variables in order to control their behavior.

It is important to understand that a single process may execute multiple programs during its lifetime. Unless
explicitly changed, all of the other system-maintained attributes (current directory, open files, PID, etc.) remain
the same. The separation of "starting a new process" from "choosing which program to run" is a key Unix
innovation. It makes many operations simple and straightforward. Other operating systems that combine the two
operations are less general and more complicated to use.

1.2.1. Pipes: Hooking Processes Together

You've undoubtedly used the pipe construct ('|') in the shell to connect two or more running programs. A pipe
acts like a file: One process writes to it using the normal write operation, and the other process reads from it using
the read operation. The processes don't (usually) know that their input/output is a pipe and not a regular file.

Just as the kernel hides the "magic" for devices, making them act like regular files, so too the kernel does the work
for pipes, arranging to pause the pipe's writer when the pipe fills up and to pause the reader when no data is
waiting to be read.

The file I/O paradigm with pipes thus acts as a key mechanism for connecting running programs; no temporary
files are needed. Again, generality and simplicity at work: no special cases for user code.
1.3. Standard C vs. Original C
For many years, the de facto definition of C was found in the first edition of the book The C Programming
Language, by Brian Kernighan and Dennis Ritchie. This book described C as it existed for Unix and on the
systems to which the Bell Labs developers had ported it. Throughout this book, we refer to it as "Original C,"
although it's also common for it to be referred to as "K&R C," after the book's two authors. (Dennis Ritchie
designed and implemented C.)

The 1990 ISO Standard for C formalized the language's definition, including the functions in the C library (such as
printf() and fopen()). The C standards committee did an admirable job of standardizing existing practice and
avoided inventing new features, with one notable exception (and a few minor ones). The most visible change in
the language was the use of function prototypes, borrowed from C++.

Standard C, C++, and the Java programming language use function prototypes for function declarations and
definitions. A prototype describes not only the function's return value but also the number and type of its
arguments. With prototypes, a compiler can do complete type checking at the point of a function call:

extern int myfunc(struct my_struct *a, Declaration


struct my_struct *b,
double c, int d);

int myfunc(struct my_struct *a, Definition


struct my_struct *b,
double c, int d)
{
...
}

...
struct my_struct s, t;
int j;

...
/* Function call, somewhere else: */
j = my_func(& s, & t, 3.1415, 42);

This function call is fine. But consider an erroneous call:

j = my_func(-1, -2, 0); Wrong number and types of arguments

The compiler can immediately diagnose this call as being invalid. However, in Original C, functions are declared
without the argument list being specified:

extern int myfunc(); Returns int, arguments unknown

Furthermore, function definitions list the parameter names in the function header, and then declare the parameters
before the function body. Parameters of type int don't have to be declared, and if a function returns int, that
doesn't have to be declared either:
myfunc(a, b, c, d) Return type is int
struct my_struct *a, *b;
double c; Note, no declaration of parameter d
{
...
}

Consider again the same erroneous function call: 'j = my_func(-1, -2, 0);'. In Original C, the compiler has
no way of knowing that you've (accidentally, we assume) passed the wrong arguments to my_func(). Such
erroneous calls generally lead to hard-to-find runtime problems (such as segmentation faults, whereby the
program dies), and the Unix lint program was created to deal with these kinds of things.

So, although function prototypes were a radical departure from existing practice, their additional type checking
was deemed too important to be without, and they were added to the language with little opposition.

In 1990 Standard C, code written in the original style, for both declarations and definitions, is valid. This makes it
possible to continue to compile millions of lines of existing code with a standard-conforming compiler. New code,
obviously, should be written with prototypes because of the improved possibilities for compile-time error
checking.

1999 Standard C continues to allow original style declarations and definitions. However, the "implicit int" rule
was removed; functions must have a return type, and all parameters must be declared.

Furthermore, when a program called a function that had not been formally declared, Original C would create an
implicit declaration for the function, giving it a return type of int. 1990 Standard C did the same, additionally
noting that it had no information about the parameters. 1999 Standard C no longer provides this "auto-declare"
feature.

Other notable additions in Standard C are the const keyword, also from C++, and the volatile keyword,
which the committee invented. For the code you'll see in this book, understanding the different function
declaration and definition syntaxes is the most important thing.

For V7 code using original style definitions, we have added comments showing the equivalent prototype.
Otherwise, we have left the code alone, preferring to show it exactly as it was originally written and as you'll see it
if you download the code yourself.

Although 1999 C adds some additional keywords and features beyond the 1990 version, we have chosen to
stick to the 1990 dialect, since C99 compilers are not yet commonplace. Practically speaking, this doesn't matter:
C89 code should compile and run without change when a C99 compiler is used, and the new C99 features don't
affect our discussion or use of the fundamental Linux/Unix APIs.
1.4. Why GNU Programs Are Better
What is it that makes a GNU program a GNU program?[6] What makes GNU software "better" than other (free
or non-free) software? The most obvious difference is the GNU General Public License (GPL), which describes
the distribution terms for GNU software. But this is usually not the reason you hear people saying "Get the GNU
version of xyz, it's much better." GNU software is generally more robust, and performs better, than standard
Unix versions. In this section we look at some of the reasons why, and at the document that describes the
principles of GNU software design.
[6] This section is adapted from an article by the author that appeared in Issue 16 of Linux Journal. (See
http://www.linuxjournal.com/article.php?sid=1135 .) Reprinted and adapted by permission.

The GNU Coding Standards describes how to write software for the GNU project. It covers a range of topics.
You can read the GNU Coding Standards online at http://www.gnu.org/prep/standards.html. See the
online version for pointers to the source files in other formats.

In this section, we describe only those parts of the GNU Coding Standards that relate to program design and
implementation.

1.4.1. Program Design

Chapter 3 of the GNU Coding Standards provides general advice about program design. The four main issues
are compatibility (with standards and Unix), the language to write in, reliance on nonstandard features of other
programs (in a word, "none"), and the meaning of "portability."

Compatibility with Standard C and POSIX, and to a lesser extent, with Berkeley Unix is an important goal. But
it's not an overriding one. The general idea is to provide all necessary functionality, with command-line switches to
provide a strict ISO or POSIX mode.

C is the preferred language for writing GNU software since it is the most commonly available language. In the
Unix world, Standard C is now common, but if you can easily support Original C, you should do so. Although the
coding standards prefer C over C++, C++ is now commonplace too. One widely used GNU package written in
C++ is groff (GNU troff). With GCC supporting C++, it has been our experience that installing groff is not
difficult.

The standards state that portability is a bit of a red herring. GNU utilities are ultimately intended to run on the
GNU kernel with the GNU C Library.[7] But since the kernel isn't finished yet and users are using GNU tools on
non-GNU systems, portability is desirable, just not paramount. The standard recommends using Autoconf for
achieving portability among different Unix systems.
[7] This
statement refers to the HURD kernel, which is still under development (as of early 2004). GCC and GNU C Library
(GLIBC) development take place mostly on Linux-based systems today.

1.4.2. Program Behavior

Chapter 4 of the GNU Coding Standards provides general advice about program behavior. We will return to
look at one of its sections in detail, below. The chapter focuses on program design, formatting error messages,
writing libraries (by making them reentrant), and standards for the command-line interface.

Error message formatting is important since several tools, notably Emacs, use the error messages to help you go
straight to the point in the source file or data file at which an error occurred.

GNU utilities should use a function named getopt_long() for processing the command line. This function
provides command-line option parsing for both traditional Unix-style options ('gawk -F: ...') and GNU-style
long options ('gawk --field-separator=: ...'). All programs should provide --help and --version
options, and when a long name is used in one program, it should be used the same way in other GNU programs.
To this end, there is a rather exhaustive list of long options used by current GNU programs.

As a simple yet obvious example, --verbose is spelled exactly the same way in all GNU programs. Contrast
this to -v, -V, -d, etc., in many Unix programs. Most of Chapter 2, "Arguments, Options, and the Environment,"
page 23, is devoted to the mechanics of argument and option parsing.

1.4.3. C Code Programming

The most substantive part of the GNU Coding Standards is Chapter 5, which describes how to write C code,
covering things like formatting the code, correct use of comments, using C cleanly, naming your functions and
variables, and declaring, or not declaring, standard system functions that you wish to use.

Code formatting is a religious issue; many people have different styles that they prefer. We personally don't like
the FSF's style, and if you look at gawk, which we maintain, you'll see it's formatted in standard K&R style (the
code layout style used in both editions of the Kernighan and Ritchie book). But this is the only variation in gawk
from this part of the coding standards.

Nevertheless, even though we don't like the FSF's style, we feel that when modifying some other program,
sticking to the coding style already used is of the utmost importance. Having a consistent coding style is more
important than which coding style you pick. The GNU Coding Standards also makes this point. (Sometimes,
there is no detectable consistent coding style, in which case the program is probably overdue for a trip through
either GNU indent or Unix's cb.)

What we find important about the chapter on C coding is that the advice is good for any C coding, not just if you
happen to be working on a GNU program. So, if you're just learning C or even if you've been working in C (or
C++) for a while, we recommend this chapter to you since it encapsulates many years of experience.

1.4.4. Things That Make a GNU Program Better

We now examine the section titled Writing Robust Programs in Chapter 4, Program Behavior for All
Programs, of the GNU Coding Standards. This section provides the principles of software design that make
GNU programs better than their Unix counterparts. We quote selected parts of the chapter, with some examples
of cases in which these principles have paid off.

Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and
symbols, by allocating all data structures dynamically. In most Unix utilities, "long lines are silently
truncated." This is not acceptable in a GNU utility.

This rule is perhaps the single most important rule in GNU software design—no arbitrary limits. All GNU
utilities should be able to manage arbitrary amounts of data.

While this requirement perhaps makes it harder for the programmer, it makes things much better for the user. At
one point, we had a gawk user who regularly ran an awk program on more than 650,000 files (no, that's not a
typo) to gather statistics. gawk would grow to over 192 megabytes of data space, and the program ran for
around seven CPU hours. He would not have been able to run his program using another awk implementation.[8]
[8] This situation occurred circa 1993; the truism is even more obvious today, as users process gigabytes of log files with
gawk .

Utilities reading files should not drop NUL characters, or any other nonprinting characters including those
with codes above 0177. The only sensible exceptions would be utilities specifically intended for interface to
certain types of terminals or printers that can't handle those characters.

It is also well known that Emacs can edit any arbitrary file, including files containing binary data!

Whenever possible, try to make programs work properly with sequences of bytes that represent multibyte
characters, using encodings such as UTF-8 and others.[9] Check every system call for an error return,
unless you know you wish to ignore errors. Include the system error text (from perror or equivalent) in
every error message resulting from a failing system call, as well as the name of the file if any and the name of
the utility. Just "cannot open foo.c" or "stat failed" is not sufficient.
[9] Section 13.4, "Can You Spell That for Me, Please?", page 521, provides an overview of multibyte characters and
encodings.

Checking every system call provides robustness. This is another case in which life is harder for the programmer
but better for the user. An error message detailing what exactly went wrong makes finding and solving any
problems much easier.[10]
[10] The mechanics of checking for and reporting errors are discussed in Section 4.3, "Determining What Went Wrong,"
page 86.

Finally, we quote from Chapter 1 of the GNU Coding Standards, which discusses how to write your program
differently from the way a Unix program may have been written.

For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead,
your program will be very different. You could keep the entire input file in core and scan it there instead of
using stdio. Use a smarter algorithm discovered more recently than the Unix program. Eliminate use of
temporary files. Do it in one pass instead of two (we did this in the assembler).

Or, on the contrary, emphasize simplicity instead of speed. For some applications, the speed of today's
computers makes simpler algorithms adequate.

Or go for generality. For example, Unix programs often have static tables or fixed-size strings, which make
for arbitrary limits; use dynamic allocation instead. Make sure your program handles NULs and other funny
characters in the input files. Add a programming language for extensibility and write part of the program in
that language.

Or turn some parts of the program into independently usable libraries. Or use a simple garbage collector
instead of tracking precisely when to free memory, or use a new GNU facility such as obstacks.

An excellent example of the difference an algorithm can make is GNU diff. One of our system's early
incarnations was an AT&T 3B1: a system with a MC68010 processor, a whopping two megabytes of memory
and 80 megabytes of disk. We did (and do) lots of editing on the manual for gawk, a file that is almost 28,000
lines long (although at the time, it was only in the 10,000-lines range). We used to use 'diff -c' quite frequently
to look at our changes. On this slow system, switching to GNU diff made a stunning difference in the amount of
time it took for the context diff to appear. The difference is almost entirely due to the better algorithm that GNU
diff uses.

The final paragraph mentions the idea of structuring a program as an independently usable library, with a
command-line wrapper or other interface around it. One example of this is GDB, the GNU debugger, which is
partially implemented as a command-line tool on top of a debugging library. (The separation of the GDB core
functionality from the command interface is an ongoing development project.) This implementation makes it
possible to write a graphical debugging interface on top of the basic debugging functionality.

1.4.5. Parting Thoughts about the "GNU Coding Standards"

The GNU Coding Standards is a worthwhile document to read if you wish to develop new GNU software,
enhance existing GNU software, or just learn how to be a better programmer. The principles and techniques it
espouses are what make GNU software the preferred choice of the Unix community.
1.5. Portability Revisited
Portability is something of a holy grail; always sought after, but not always obtainable, and certainly not easily.
There are several aspects to writing portable code. The GNU Coding Standards discusses many of them. But
there are others as well. Keep portability in mind at both higher and lower levels as you develop. We recommend
these practices:

Code to standards.

Although it can be challenging, it pays to be familiar with the formal standards for the language you're using.
In particular, pay attention to the 1990 and 1999 ISO standards for C and the 2003 standard for C++
since most Linux programming is done in one of those two languages.

Also, the POSIX standard for library and system call interfaces, while large, has broad industry support.
Writing to POSIX greatly improves the chances of successfully moving your code to other systems besides
GNU/Linux. This standard is quite readable; it distills decades of experience and good practice.

Pick the best interface for the job.

If a standard interface does what you need, use it in your code. Use Autoconf to detect an unavailable
interface, and supply a replacement version of it for deficient systems. (For example, some older systems
lack the memmove() function, which is fairly easy to code by hand or to pull from the GLIBC library.)

Isolate portability problems behind new interfaces.

Sometimes, you may need to do operating-system-specific tasks that apply on some systems but not on
others. (For example, on some systems, each program has to expand command-line wildcards instead of
the shell doing it.) Create a new interface that does nothing on systems that don't need it but does the
correct thing on systems that do.

Use Autoconf for configuration.

Avoid #ifdef if possible. If not, bury it in low-level library code. Use Autoconf to do the checking for the
tests to be performed with #ifdef.
1.6. Suggested Reading

1. The C Programming Language, 2nd edition, by Brian W. Kernighan and Dennis M. Ritchie. Prentice-
Hall, Englewood Cliffs, New Jersey, USA, 1989. ISBN: 0-13-110370-9.

This is the "bible" for C, covering the 1990 version of Standard C. It is a rather dense book, with lots of
information packed into a startlingly small number of pages. You may need to read it through more than
once; doing so is well worth the trouble.

2. C, A Reference Manual, 5th edition, by Samuel P. Harbison III and Guy L. Steele, Jr. Prentice-Hall,
Upper Saddle River, New Jersey, USA, 2002. ISBN: 0-13-089592-X.

This book is also a classic. It covers Original C as well as the 1990 and 1999 standards. Because it is
current, it makes a valuable companion to The C Programming Language. It covers many important
items, such as internationalization-related types and library functions, that aren't in the Kernighan and Ritchie
book.

3. Notes on Programming in C, by Rob Pike, February 21, 1989. Available on the Web from many sites.
Perhaps the most widely cited location is http://www.lysator.liu.se/c/pikestyle.html. (Many
other useful articles are available from one level up: http://www.lysator.liu.se/c/.)

Rob Pike worked for many years at the Bell Labs research center where C and Unix were invented and did
pioneering development there. His notes distill many years of experience into a "philosophy of clarity in
programming" that is well worth reading.

4. The various links at http://www.chris-lott.org/resources/cstyle/. This site includes Rob Pike's


notes and several articles by Henry Spencer. Of particular note is the Recommended C Style and Coding
Standards, originally written at the Bell Labs Indian Hill site.
1.7. Summary

"Files and processes" summarizes the Linux/Unix worldview. The treatment of files as byte streams and
devices as files, and the use of standard input, output, and error, simplify program design and unify the data
access model. The permissions model is simple, yet flexible, applying to both files and directories.

Processes are running programs that have user and group identifiers associated with them for permission
checking, as well as other attributes such as open files and a current working directory.

The most visible difference between Standard C and Original C is the use of function prototypes for stricter
type checking. A good C programmer should be able to read Original-style code, since many existing
programs use it. New code should be written using prototypes.

The GNU Coding Standards describe how to write GNU programs. They provide numerous valuable
techniques and guiding principles for producing robust, usable software. The "no arbitrary limits" principle is
perhaps the single most important of these. This document is required reading for serious programmers.

Making programs portable is a significant challenge. Guidelines and tools help, but ultimately experience is
needed too.
Exercises

1. Read and comment on the article "The GNU Project",[11] by Richard M. Stallman, originally written in
August of 1998.
[11] http://www.gnu.org/gnu/thegnuproject.html
Chapter 2. Arguments, Options, and the
Environment
In this chapter

2.1 Option and Argument Conventions page 24

2.2 Basic Command-Line Processing page 28

2.3 Option Parsing: getopt() and getopt_long() page 30

2.4 The Environment page 40

2.5 Summary page 49

Exercises page 50

Command-line option and argument interpretation is usually the first task of any program. This chapter examines
how C (and C++) programs access their command-line arguments, describes standard routines for parsing
options, and takes a look at the environment.
2.1. Option and Argument Conventions
The word arguments has two meanings. The more technical definition is "all the 'words' on the command line."
For example:

$ ls main.c opts.c process.c

Here, the user typed four "words." All four words are made available to the program as its arguments.

The second definition is more informal: Arguments are all the words on the command line except the command
name. By default, Unix shells separate arguments from each other with whitespace (spaces or TAB characters).
Quoting allows arguments to include whitespace:

$ echo here are lots of spaces


here are lots of spaces The shell "eats" the spaces
$ echo "here are lots of spaces"
here are lots of spaces Spaces are preserved

Quoting is transparent to the running program; echo never sees the double-quote characters. (Double and single
quotes are different in the shell; a discussion of the rules is beyond the scope of this book, which focuses on C
programming.)

Arguments can be further classified as options or operands. In the previous two examples all the arguments were
operands: files for ls and raw text for echo.

Options are special arguments that each program interprets. Options change a program's behavior, or they
provide information to the program. By ancient convention, (almost) universally adhered to, options start with a
dash (a.k.a. hyphen, minus sign) and consist of a single letter. Option arguments are information needed by an
option, as opposed to regular operand arguments. For example, the fgrep program's -f option means "use the
contents of the following file as a list of strings to search for." See Figure 2.1.

Figure 2.1. Command-line components

Thus, patfile is not a data file to search, but rather it's for use by fgrep in defining the list of strings to search
for.

2.1.1. POSIX Conventions

The POSIX standard describes a number of conventions that standard-conforming programs adhere to. Nothing
requires that your programs adhere to these standards, but it's a good idea for them to do so: Linux and Unix
users the world over understand and use these conventions, and if your program doesn't follow them, your users
will be unhappy. (Or you won't have any users!) Furthermore, the functions we discuss later in this chapter relieve
you of the burden of manually adhering to these conventions for each program you write. Here they are,
paraphrased from the standard:

1. Program names should have no less than two and no more than nine characters.

2. Program names should consist of only lowercase letters and digits.

3. Option names should be single alphanumeric characters. Multidigit options should not be allowed. For
vendors implementing the POSIX utilities, the -W option is reserved for vendor-specific options.

4. All options should begin with a '-' character.

5. For options that don't require option arguments, it should be possible to group multiple options after a single
'-' character. (For example, 'foo -a -b -c' and 'foo -abc' should be treated the same way.)

6. When an option does require an option argument, the argument should be separated from the option by a
space (for example, 'fgrep -f patfile').

The standard, however, does allow for historical practice, whereby sometimes the option and the operand
could be in the same string: 'fgrep -fpatfile'. In practice, the getopt() and getopt_long()
functions interpret '-fpatfile' as '-f patfile', not as '-f -p -a -t ...'.

7. Option arguments should not be optional.

This means that when a program documents an option as requiring an option argument, that option's
argument must always be present or else the program will fail. GNU getopt() does provide for optional
option arguments since they're occasionally useful.

8. If an option takes an argument that may have multiple values, the program should receive that argument as a
single string, with values separated by commas or whitespace.

For example, suppose a hypothetical program myprog requires a list of users for its -u option. Then, it
should be invoked in one of these two ways:

myprog -u "arnold,joe,jane" Separate with commas


myprog -u "arnold joe jane" Separate with whitespace

In such a case, you're on your own for splitting out and processing each value (that is, there is no standard
routine), but doing so manually is usually straightforward.

9. Options should come first on the command line, before operands. Unix versions of getopt() enforce this
convention. GNU getopt() does not by default, although you can tell it to.

10. The special argument '--' indicates the end of all options. Any subsequent arguments on the command line
are treated as operands, even if they begin with a dash.

11.
11. The order in which options are given should not matter. However, for mutually exclusive options, when one
option overrides the setting of another, then (so to speak) the last one wins. If an option that has arguments
is repeated, the program should process the arguments in order. For example, 'myprog -u arnold -u
jane' is the same as 'myprog -u "arnold, jane"'. (You have to enforce this yourself; getopt()
doesn't help you.)

12. It is OK for the order of operands to matter to a program. Each program should document such things.

13. Programs that read or write named files should treat the single argument '-' as meaning standard input or
standard output, as is appropriate for the program.

Note that many standard programs don't follow all of the above conventions. The primary reason is historical
compatibility; many such programs predate the codifying of these conventions.

2.1.2. GNU Long Options

As we saw in Section 1.4.2, "Program Behavior", page 16, GNU programs are encouraged to use long options
of the form --help, --verbose, and so on. Such options, since they start with '--', do not conflict with the
POSIX conventions. They also can be easier to remember, and they provide the opportunity for consistency
across all GNU utilities. (For example, --help is the same everywhere, as compared with -h for "help," -i for
"information," and so on.) GNU long options have their own conventions, implemented by the getopt_long()
function:

1. For programs implementing POSIX utilities, every short (single-letter) option should also have a long
option.

2. Additional GNU-specific long options need not have a corresponding short option, but we recommend that
they do.

3. Long options can be abbreviated to the shortest string that remains unique. For example, if there are two
options --verbose and --verbatim, the shortest possible abbreviations are --verbo and --verba.

4. Option arguments are separated from long options either by whitespace or by an = sign. For example, --
sourcefile=/some/file or --sourcefile/some/file.

5. Options and arguments may be interspersed with operands on the command line; getopt_long() will
rearrange things so that all options are processed and then all operands are available sequentially. (This
behavior can be suppressed.)

6. Option arguments can be optional. For such options, the argument is deemed to be present if it's in the same
string as the option. This works only for short options. For example, if -x is such an option, given 'foo -
xYANKEES -y', the argument to -x is 'YANKEES'. For 'foo -x -y', there is no argument to -x.

7. Programs can choose to allow long options to begin with a single dash. (This is common with many X
Window programs.)

Much of this will become clearer when we examine getopt_long() later in the chapter.

The GNU Coding Standards devotes considerable space to listing all the long and short options used by GNU
programs. If you're writing a program that accepts long options, see if option names already in use might make
sense for you to use as well.
2.2. Basic Command-Line Processing
A C program accesses its command-line arguments through its parameters, argc and argv. The argc parameter
is an integer, indicating the number of arguments there are, including the command name. There are two common
ways to declare main(), varying in how argv is declared:

int main(int argc, char *argv[]) int main(int argc, char **argv)
{ {
... ...
} }

Practically speaking, there's no difference between the two declarations, although the first is conceptually clearer:
argv is an array of pointers to characters. The second is more commonly used: argv is a pointer to a pointer.
Also, the second definition is technically more correct, and it is what we use. Figure 2.2 depicts this situation.

Figure 2.2. Memory for argv

By convention, argv[0] is the program's name. (For details, see Section 9.1.4.3, "Program Names and argv[0],"
page 297.) Subsequent entries are the command line arguments. The final entry in the argv array is a NULL
pointer.

argc indicates how many arguments there are; since C is zero-based, it is always true that 'argv[argc] ==
NULL'. Because of this, particularly in Unix code, you will see different ways of checking for the end of arguments,
such as looping until a counter is greater than or equal to argc, or until 'argv[i] == 0' or while '*argv !=
NULL' and so on. These are all equivalent.

2.2.1. The V7 echo Program

Perhaps the simplest example of command-line processing is the V7 echo program, which prints its arguments to
standard output, separated by spaces and terminated with a newline. If the first argument is -n, then the trailing
newline is omitted. (This is used for prompting from shell scripts.) Here's the code:[1]
[1] See /usr/src/cmd/echo.c in the V7 distribution.

1 #include <stdio.h>
2
3 main(argc, argv) int main(int argc, char **argv)
4 int argc;
5 char *argv[];
6 {
7 register int i, nflg;
8
9 nflg = 0;
10 if(argc > 1 && argv[1][0] == '-' && argv[1][1] == 'n') {
11 nflg++;
12 argc––;
13 argv++;
14 }
15 for(i=1; i<argc; i++) {
16 fputs(argv[i], stdout);
17 if (i < argc-1)
18 putchar(' ');
19 }
20 if(nflg == 0)
21 putchar('\n');
22 exit(0);
23 }

Only 23 lines! There are two points of interest. First, decrementing argc and simultaneously incrementing argv
(lines 12 and 13) are common ways of skipping initial arguments. Second, the check for -n (line 10) is simplistic.
-no-newline-at-the-end also works. (Compile it and try it!)

Manual option parsing is common in V7 code because the getopt() function hadn't been invented yet.

Finally, here and in other places throughout the book, we see use of the register keyword. At one time, this
keyword provided a hint to the compiler that the given variables should be placed in CPU registers, if possible.
Use of this keyword is obsolete; modern compilers all base register assignment on analysis of the source code,
ignoring the register keyword. We've chosen to leave code using it alone, but you should be aware that it has
no real use anymore.[2]
[2] When we asked Jim Meyering, the Coreutils maintainer, about instances of register in the GNU Coreutils, he gave us
an interesting response. He removes them when modifying code, but otherwise leaves them alone to make it easier to
integrate changes submitted against existing versions.
2.3. Option Parsing: getopt() and getopt_long()
Circa 1980, for System III, the Unix Support Group within AT&T noted that each Unix program used ad hoc
techniques for parsing arguments. To make things easier for users and developers, they developed most of the
conventions we listed earlier. (The statement in the System III intro(1) manpage is considerably less formal than
what's in the POSIX standard, though.)

The Unix Support Group also developed the getopt() function, along with several external variables, to make it
easy to write code that follows the standard conventions. The GNU getopt_long() function supplies a
compatible version of getopt(), as well as making it easy to parse long options of the form described earlier.

2.3.1. Single-Letter Options

The getopt() function is declared as follows:

#include <unistd.h> POSIX

int getopt(int argc, char *const argv[], const char *optstring);

extern char *optarg;


extern int optind, opterr, optopt;

The arguments argc and argv are normally passed straight from those of main(). optstring is a string of
option letters. If any letter in the string is followed by a colon, then that option is expected to have an argument.

To use getopt(), call it repeatedly from a while loop until it returns -1. Each time that it finds a valid option
letter, it returns that letter. If the option takes an argument, optarg is set to point to it. Consider a program that
accepts a -a option that doesn't take an argument and a -b argument that does:

int oc; /* option character */


char *b_opt_arg;

while ((oc = getopt(argc, argv, "ab:")) != -1) {


switch (oc) {
case 'a':
/* handle -a, set a flag, whatever */
break;
case 'b':
/* handle -b, get arg value from optarg */
b_opt_arg = optarg;
break;
case ':':
... /* error handling, see text */
case '?':
default:
... /* error handling, see text */
}
}

As it works, getopt() sets several variables that control error handling.


char *optarg
The argument for an option, if the option accepts one.
int optind

The current index in argv. When the while loop has finished, remaining operands are found in
argv[optind] through argv[argc-1]. (Remember that 'argv[argc] == NULL'.)

int opterr

When this variable is nonzero (which it is by default), getopt() prints its own error messages for invalid
options and for missing option arguments.
int optopt

When an invalid option character is found, getopt() returns either a '?' or a ':' (see below), and
optopt contains the invalid character that was found.

People being human, it is inevitable that programs will be invoked incorrectly, either with an invalid option or with
a missing option argument. In the normal case, getopt() prints its own messages for these cases and returns the
'?' character. However, you can change its behavior in two ways.

First, by setting opterr to 0 before invoking getopt(), you can force getopt() to remain silent when it finds a
problem.

Second, if the first character in the optstring argument is a colon, then getopt() is silent and it returns a
different character depending upon the error, as follows:

Invalid option

getopt() returns a '?' and optopt contains the invalid option character. (This is the normal behavior.)

Missing option argument

getopt() returns a ':'. If the first character of optstring is not a colon, then getopt() returns a '?',
making this case indistinguishable from the invalid option case.

Thus, making the first character of optstring a colon is a good idea since it allows you to distinguish between
"invalid option" and "missing option argument." The cost is that using the colon also silences getopt(), forcing
you to supply your own error messages. Here is the previous example, this time with error message handling:

int oc; /* option character */


char *b_opt_arg;

while ((oc = getopt(argc, argv, ":ab:")) != -1) {


switch (oc) {
case 'a':
/* handle -a, set a flag, whatever */
break;
case 'b':
/* handle -b, get arg value from optarg */
b_opt_arg = optarg;
break;
case ':':
/* missing option argument */
fprintf(stderr, "%s: option '-%c' requires an argument\n",
argv[0], optopt);
break;
case '?':
default:
/* invalid option */
fprintf(stderr, "%s: option '-%c' is invalid: ignored\n",
argv[0], optopt);
break;
}
}

A word about flag or option variable-naming conventions: Much Unix code uses names of the form xflg for any
given option letter x (for example, nflg in the V7 echo; xflag is also common). This may be great for the
program's author, who happens to know what the x option does without having to check the documentation. But
it's unkind to someone else trying to read the code who doesn't know the meaning of all the option letters by
heart. It is much better to use names that convey the option's meaning, such as no_newline for echo's -n
option.

2.3.2. GNU getopt() and Option Ordering

The standard getopt() function stops looking for options as soon as it encounters a command-line argument
that doesn't start with a '-'. GNU getopt() is different: It scans the entire command line looking for options. As
it goes along, it permutes (rearranges) the elements of argv, so that when it's done, all the options have been
moved to the front and code that proceeds to examine argv[optind] through argv[argc-1] works correctly.
In all cases, the special argument '--' terminates option scanning.

You can change the default behavior by using a special first character in optstring, as follows:
optstring[0] == '+'

GNU getopt() behaves like standard getopt(); it returns options in the order in which they are found,
stopping at the first nonoption argument. This will also be true if POSIXLY_CORRECT exists in the
environment.
optstring[0] == '-'

GNU getopt() returns every command-line argument, whether or not it represents an argument. In this
case, for each such argument, the function returns the integer 1 and sets optarg to point to the string.

As for standard getopt(), if the first character of optstring is a ':', then GNU getopt() distinguishes
between "invalid option" and "missing option argument" by returning '?' or ':', respectively. The ':' in
optstring can be the second character if the first character is '+' or '-'.

Finally, if an option letter in optstring is followed by two colon characters, then that option is allowed to have
an optional option argument. (Say that three times fast!) Such an argument is deemed to be present if it's in the
same argv element as the option, and absent otherwise. In the case that it's absent, GNU getopt() returns the
option letter and sets optarg to NULL. For example, given—

while ((c = getopt(argc, argv, "ab::")) != 1)


...

—for -bYANKEES, the return value is 'b', and optarg points to "YANKEES", while for -b or '-b YANKEES', the
return value is still 'b' but optarg is set to NULL. In the latter case, "YANKEES" is a separate command-line
argument.

2.3.3. Long Options


The getopt_long() function handles the parsing of long options of the form described earlier. An additional
routine, getopt_long_only() works identically, but it is used for programs where all options are long and
options begin with a single '-' character. Otherwise, both work just like the simpler GNU getopt() function.
(For brevity, whenever we say "getopt_long()," it's as if we'd said "getopt_long() and
getopt_long_only().") Here are the declarations, from the GNU/Linux getopt(3) manpage:

#include <getopt.h> GLIBC

int getopt_long(int argc, char *const argv[],


const char *optstring,
const struct option *longopts, int *longindex);

int getopt_long_only(int argc, char *const argv[],


const char *optstring,
const struct option *longopts, int *longindex);

The first three arguments are the same as for getopt(). The next option is a pointer to an array of struct
option, which we refer to as the long options table and which is described shortly. The longindex parameter,
if not set to NULL, points to a variable which is filled in with the index in longopts of the long option that was
found. This is useful for error diagnostics, for example.

2.3.3.1. Long Options Table

Long options are described with an array of struct option structures. The struct option is declared in
<getopt.h>; it looks like this:

struct option {
const char *name;
int has_arg;
int *flag;
int val;
};

The elements in the structure are as follows:


const char *name

This is the name of the option, without any leading dashes, for example, "help" or "verbose".
int has_arg

This describes whether the long option has an argument, and if so, what kind of argument. The value must
be one of those presented in Table 2.1.

Table 2.1. Values for has_arg

Symbolic constant Numeric value Meaning


no_argument 0 The option does not take an
argument.
Symbolic constant Numeric value Meaning
required_argument 1 The option requires an argument.
optional_argument 2 The option's argument is
optional.

The symbolic constants are macros for the numeric values given in the table. While the numeric values work,
the symbolic constants are considerably easier to read, and you should use them instead of the
corresponding numbers in any code that you write.
int *flag

If this pointer is NULL, then getopt_long() returns the value in the val field of the structure. If it's not
NULL, the variable it points to is filled in with the value in val and getopt_long() returns 0. If the flag
isn't NULL but the long option is never seen, then the pointed-to variable is not changed.
int val

This is the value to return if the long option is seen or to load into *flag if flag is not NULL. Typically, if
flag is not NULL, then val is a true/false value, such as 1 or 0. On the other hand, if flag is NULL, then
val is usually a character constant. If the long option corresponds to a short one, the character constant
should be the same one that appears in the optstring argument for this option. (All of this will become
clearer shortly when we see some examples.)

Each long option has a single entry with the values appropriately filled in. The last element in the array should have
zeros for all the values. The array need not be sorted; getopt_long() does a linear search. However, sorting it
by long name may make it easier for a programmer to read.

The use of flag and val seems confusing at first encounter. Let's step back for a moment and examine why it
works the way it does. Most of the time, option processing consists of setting different flag variables when
different option letters are seen, like so:

while ((c = getopt(argc, argv, ":af:hv")) != -1) {


switch (c) {
case 'a':
do_all = 1;
break;
case 'f':
myfile = optarg;
break;
case 'h':
do_help = 1;
break;
case 'v':
do_verbose = 1;
break;
... Error handling code here
}
}

When flag is not NULL, getopt_long() sets the variable for you. This reduces the three cases in the
previous switch to one case. Here is an example long options table and the code to go with it:

int do_all, do_help, do_verbose; /* flag variables */


char *myfile;

struct option longopts[] = {


{ "all", no_argument, & do_all, 1 },
{ "file", required_argument, NULL, 'f' },
{ "help", no_argument, & do_help, 1 },
{ "verbose", no_argument, & do_verbose, 1 },
{ 0, 0, 0, 0 }
};
...
while ((c = getopt_long(argc, argv, ":f:", longopts, NULL)) != -1) {
switch (c) {
case 'f':
myfile = optarg;
break;
case 0:
/* getopt_long() set a variable, just keep going */
break;
... Error handling code here
}
}

Notice that the value passed for the optstring argument no longer contains 'a', 'h', or 'v'. This means that
the corresponding short options are not accepted. To allow both long and short options, you would have to
restore the corresponding cases from the first example to the switch.

Practically speaking, you should write your programs such that each short option also has a corresponding long
option. In this case, it's easiest to have flag be NULL and val be the corresponding single letter.

2.3.3.2. Long Options, POSIX Style

The POSIX standard reserves the -W option for vendor-specific features. Thus, by definition, -W isn't portable
across different systems.

If W appears in the optstring argument followed by a semicolon (note: not a colon), then getopt_long()
treats -Wlongopt the same as --longopt. Thus, in the previous example, change the call to be:

while ((c = getopt_long(argc, argv, ":f:W;", longopts, NULL)) != -1) {

With this change, -Wall is the same as --all and -Wfile=myfile is the same as --file=myfile. The use of
a semicolon makes it possible for a program to use -W as a regular option, if desired. (For example, GCC uses it
as a regular option, whereas gawk uses it for POSIX conformance.)

2.3.3.3. getopt_long() Return Value Summary

As should be clear by now, getopt_long() provides a flexible mechanism for option parsing. Table 2.2
summarizes the possible return values and their meaning.

Table 2.2. getopt_long() return values


Return code Meaning
0 getopt_long() set a flag as found in the long option table.
1 optarg points at a plain command-line argument.
'?' Invalid option.
':' Missing option argument.
'x' Option character 'x'.
–1 End of options.

Finally, we enhance the previous example code, showing the full switch statement:

int do_all, do_help, do_verbose; /* flag variables */


char *myfile, *user; /* input file, user name */

struct option longopts[] = {


{ "all", no_argument, & do_all, 1 },
{ "file", required_argument, NULL, 'f' },
{ "help", no_argument, & do_help, 1 },
{ "verbose", no_argument, & do_verbose, 1 },
{ "user" , optional_argument, NULL, 'u' },
{ 0, 0, 0, 0 }
};
...
while ((c = getopt_long(argc, argv, ":ahvf:u::W;", longopts, NULL)) != –1) {
switch (c) {
case 'a':
do_all = 1;
break;
case 'f':
myfile = optarg;
break;
case 'h':
do_help = 1;
break;
case 'u':
if (optarg != NULL)
user = optarg;
else
user = "root";
break;
case 'v':
do_verbose = 1;
break;
case 0: /* getopt_long() set a variable, just keep going */
break;
#if 0
case 1:
/*
* Use this case if getopt_long() should go through all
* arguments. If so, add a leading '-' character to optstring.
* Actual code, if any, goes here.
*/
break;
#endif
case ':': /* missing option argument */
fprintf(stderr, "%s: option `-%c' requires an argument\n",
argv[0], optopt);
break;
case '?':
default: /* invalid option */
fprintf(stderr, "%s: option `-%c' is invalid: ignored\n",
argv[0], optopt);
break;
}
}

In your programs, you may wish to have comments for each option letter explaining what each one does.
However, if you've used descriptive variable names for each option letter, comments are not as necessary.
(Compare do_verbose to vflg.)

2.3.3.4. GNU getopt() or getopt_long() in User Programs

You may wish to use GNU getopt() or getopt_long() in your own programs and have them run on non-
Linux systems. That's OK; just copy the source files from a GNU program or from the GNU C Library (GLIBC)
CVS archive.[3] The source files are getopt.h, getopt.c, and getopt1.c. They are licensed under the GNU
Lesser General Public License, which allows library functions to be included even in proprietary programs. You
should include a copy of the file COPYING.LIB with your program, along with the files getopt.h, getopt.c,
and getopt1.c.
[3] See http://sources.redhat.com .

Include the source files in your distribution, and compile them along with any other source files. In your source
code that calls getopt_long(), use '#include <getopt.h>', not '#include "getopt.h"'. Then, when
compiling, add -I. to the C compiler's command line. That way, the local copy of the header file will be found
first.

You may be wondering, "Gee, I already use GNU/Linux. Why should I include getopt_long() in my
executable, making it bigger, if the routine is already in the C library?" That's a good question. However, there's
nothing to worry about. The source code is set up so that if it's compiled on a system that uses GLIBC, the
compiled files will not contain any code! Here's the proof, on our system:

$ uname -a Show system name and type


Linux example 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 GNU/Linux
$ ls -l getopt.o getopt1.o Show file sizes
-rw-r--r-- 1 arnold devel 9836 Mar 24 13:55 getopt.o
-rw-r--r-- 1 arnold devel 10324 Mar 24 13:55 getopt1.o
$ size getopt.o getopt1.o Show sizes included in executable
text data bss dec hex filename
0 0 0 0 0 getopt.o
0 0 0 0 0 getopt1.o

The size command prints the sizes of the various parts of a binary object or executable file. We explain the
output in Section 3.1, "Linux/Unix Address Space," page 52. What's important to understand right now is that,
despite the nonzero sizes of the files themselves, they don't contribute anything to the final executable. (We think
this is pretty neat.)
2.4. The Environment
The environment is a set of 'name=value' pairs for each program. These pairs are termed environment
variables. Each name consists of one to any number of alphanumeric characters or underscores ('_'), but the
name may not start with a digit. (This rule is enforced by the shell; the C API can put anything it wants to into the
environment, at the likely cost of confusing subsequent programs.)

Environment variables are often used to control program behavior. For example, if POSIXLY_CORRECT exists in
the environment, many GNU programs disable extensions or historical behavior that isn't compatible with the
POSIX standard.

You can decide (and should document) the environment variables that your program will use to control its
behavior. For example, you may wish to use an environment variable for debugging options instead of a
command-line argument. The advantage of using environment variables is that users can set them in their startup
file and not have to remember to always supply a particular set of command-line options.

Of course, the disadvantage to using environment variables is that they can silently change a program's behavior.
Jim Meyering, the maintainer of the Coreutils, put it this way:

It makes it easy for the user to customize how the program works without changing how the program is
invoked. That can be both a blessing and a curse. If you write a script that depends on your having a certain
environment variable set, but then have someone else use that same script, it may well fail (or worse, silently
produce invalid results) if that other person doesn't have the same environment settings.

2.4.1. Environment Management Functions

Several functions let you retrieve the values of environment variables, change their values, or remove them. Here
are the declarations:

#include <stdlib.h>

char *getenv(const char *name); ISO C: Retrieve environment variable


int setenv(const char *name, const char *value, POSIX: Set environment variable
int overwrite);
int putenv(char *string); XSI: Set environment variable, uses s
void unsetenv(const char *name); POSIX: Remove environment variable
int clearenv(void); Common: Clear entire environment

The getenv() function is the one you will use 99 percent of the time. The argument is the environment variable
name to look up, such as "HOME" or "PATH". If the variable exists, getenv() returns a pointer to the character
string value. If not, it returns NULL. For example:

char *pathval;

/* Look for PATH; if not present, supply a default value */


if ((pathval = getenv("PATH")) == NULL)
pathval = "/bin:/usr/bin:/usr/ucb";

Occasionally, environment variables exist, but with empty values. In this case, the return value will be non-NULL,
but the first character pointed to will be the zero byte, which is the C string terminator, '\0'. Your code should be
careful to check that the return value pointed to is not NULL. Even if it isn't NULL, also check that the string is not
empty if you intend to use its value for something. In any case, don't just blindly use the returned value.

To change an environment variable or to add a new one to the environment, use setenv():

if (setenv("PATH", "/bin:/usr/bin:/usr/ucb", 1) != 0) {
/* handle failure */
}

It's possible that a variable already exists in the environment. If the third argument is true (nonzero), then the
supplied value overwrites the previous one. Otherwise, it doesn't. The return value is -1 if there was no memory
for the new variable, and 0 otherwise. setenv() makes private copies of both the variable name and the new
value for storing in the environment.

A simpler alternative to setenv() is putenv(), which takes a single "name=value" string and places it in the
environment:

if (putenv("PATH=/bin:/usr/bin:/usr/ucb") != 0) {
/* handle failure */
}

putenv() blindly replaces any previous value for the same variable. Also, and perhaps more importantly, the
string passed to putenv() is placed directly into the environment. This means that if your code later modifies this
string (for example, if it was an array, not a string constant), the environment is modified also. This in turn means
that you should not use a local variable as the parameter for putenv(). For all these reasons setenv() is
preferred.

Note

The GNU putenv() has an additional (documented) quirk to its behavior. If the argument string is a
name, then without an = character, the named variable is removed. The GNU env program, which we
look at later in this chapter, relies on this behavior.

The unsetenv() function removes a variable from the environment :

unsetenv("PATH");

Finally, the clearenv() function clears the environment entirely:

if (clearenv() != 0) {
/* handle failure */
}
This function is not standardized by POSIX, although it's available in GNU/Linux and several commercial Unix
variants. You should use it if your application must be very security conscious and you want it to build its own
environment entirely from scratch. If clearenv() is not available, the GNU/Linux clearenv(3) manpage
recommends using 'environ = NULL;' to accomplish the task.

2.4.2. The Entire Environment: environ

The correct way to deal with the environment is through the functions described in the previous section. However,
it's worth a look at how things are managed "under the hood."

The external variable environ provides access to the environment in the same way that argv provides access to
the command-line arguments. You must declare the variable yourself. Although standardized by POSIX,
environ is purposely not declared by any standardized header file. (This seems to evolve from historical
practice.) Here is the declaration:

extern char **environ; /* Look Ma, no header file! */ POSIX

Like argv, the final element in environ is NULL. There is no "environment count" variable that corresponds to
argc, however. This simple program prints out the entire environment:

/* ch02-printenv.c --- Print out the environment. */

#include <stdio.h>

extern char **environ;

int main(int argc, char **argv)


{
int i;

if (environ != NULL)
for (i = 0; environ[i] != NULL; i++)
printf("%s\n", environ[i]);
return 0;
}

Although it's unlikely to happen, this program makes sure that environ isn't NULL before attempting to use it.

Variables are kept in the environment in random order. Although some Unix shells keep the environment sorted
by variable name, there is no formal requirement that this be so, and many shells don't keep them sorted.

As something of a quirk of the implementation, you can access the environment by declaring a third parameter to
main():

int main(int argc, char **argv, char **envp)


{
...
}

You can then use envp as you would have used environ. Although you may see this occasionally in old code,
we don't recommend its use; environ is the official, standard, portable way to access the entire environment,
should you need to do so.

2.4.3. GNU env

To round off the chapter, here is the GNU version of the env command. This command adds variables to the
environment for the duration of one command. It can also be used to clear the environment for that command or
to remove specific environment variables. The program serves double-duty for us, since it demonstrates both
getopt_long() and several of the functions discussed in this section. Here is how the program is invoked:

$ env --help
Usage: env [OPTION] ... [-] [NAME=VALUE] ... [COMMAND [ARG] ...]
Set each NAME to VALUE in the environment and run COMMAND.

-i, --ignore-environment start with an empty environment


-u, --unset=NAME remove variable from the environment
--help display this help and exit
--version output version information and exit

A mere - implies -i. If no COMMAND, print the resulting environment.

Report bugs to <[email protected]>.

Here are some sample invocations:

$ env - myprog arg1 Clear environment, run program with arg

$ env - PATH=/bin:/usr/bin myprog arg1 Clear environment, add PATH, run progra

$ env -u IFS PATH=/bin:/usr/bin myprog arg1 Unset IFS, add PATH, run program

The code begins with a standard GNU copyright statement and explanatory comment. We have omitted both for
brevity. (The copyright statement is discussed in Appendix C, "GNU General Public License", page 657. The --
help output shown previously is enough to understand how the program works.) Following the copyright and
comments are header includes and declarations. The 'N_("string")' macro invocation (line 93) is for use in
internationalization and localization of the software, topics covered in Chapter 13, "Internationalization and
Localization," page 485. For now, you can treat it as if it were the contained string constant.

80 #include <config.h>
81 #include <stdio.h>
82 #include <getopt.h>
83 #include <sys/types.h>
84 #include <getopt.h>
85
86 #include "system.h"
87 #include "error.h"
88 #include "closeout.h"
89
90 /* The official name of this program (e.g., no 'g' prefix). */
91 #define PROGRAM_NAME "env"
92
93 #define AUTHORS N_ ("Richard Mlynarik and David MacKenzie")
94
95 int putenv();
96
97 extern char **environ;
98
99 /* The name by which this program was run. */
100 char *program_name;
101
102 static struct option const longopts[] =
103 {
104 {"ignore-environment", no_argument, NULL, 'i'},
105 {"unset", required_argument, NULL, 'u'},
106 {GETOPT_HELP_OPTION_DECL},
107 {GETOPT_VERSION_OPTION_DECL},
108 {NULL, 0, NULL, 0}
109 };

The GNU Coreutils contain a large number of programs, many of which perform the same common tasks (for
example, argument parsing). To make maintenance easier, many common idioms are defined as macros.
GETOPT_HELP_OPTION_DECL and GETOPT_VERSION_OPTION (lines 106 and 107) are two such. We examine
their definitions shortly. The first function, usage(), prints the usage information and exits. The _("string")
macro (line 115, and used throughout the program) is also for internationalization, and for now you should also
treat it as if it were the contained string constant.

111 void
112 usage (int status)
113 {
114 if (status != 0)
115 fprintf (stderr, _("Try '%s --help' for more information.\n"),
116 program_name);
117 else
118 {
119 printf (_("\
120 Usage: %s [OPTION]... [-] [NAME=VALUE]... [COMMAND [ARG] ...]\n"),
121 program_name);
122 fputs (_("\
123 Set each NAME to VALUE in the environment and run COMMAND.\n\
124 \n\
125 -i, --ignore-environment start with an empty environment\n\
126 -u, --unset=NAME remove variable from the environment\n\
127 "), stdout);
128 fputs (HELP_OPTION_DESCRIPTION, stdout);
129 fputs (VERSION_OPTION_DESCRIPTION, stdout);
130 fputs (_("\
131 \n\
132 A mere - implies -i. If no COMMAND, print the resulting environment.\n\
133 "), stdout);
134 printf (_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
135 }
136 exit (status);
137 }

The first part of main() declares variables and sets up the internationalization. The functions setlocale(),
bindtextdomain(), and textdomain() (lines 147–149) are all discussed in Chapter 13, "Internationalization
and Localization," page 485. Note that this program does use the envp argument to main() (line 140). It is the
only one of the Coreutils programs to do so. Finally, the call to atexit() on line 151 (see Section 9.1.5.3,
"Exiting Functions," page 302) registers a Coreutils library function that flushes all pending output and closes
stdout, reporting a message if there were problems. The next bit processes the command-line arguments, using
getopt_long().

139 int
140 main (register int argc, register char **argv, char **envp)
141 {
142 char *dummy_environ[1];
143 int optc;
144 int ignore_environment = 0;
145
146 program_name = argv[0];
147 setlocale (LC_ALL, " ");
148 bindtextdomain (PACKAGE, LOCALEDIR);
149 textdomain (PACKAGE);
150
151 atexit (close_stdout);
152
153 while ((optc = getopt_long (argc, argv, "+iu:", longopts, NULL)) != -1)
154 {
155 switch (optc)
156 {
157 case 0:
158 break;
159 case 'i':
160 ignore_environment = 1;
161 break;
162 case 'u':
163 break;
164 case_GETOPT_HELP_CHAR;
165 case_GETOPT_VERSION_CHAR (PROGRAM_NAME, AUTHORS);
166 default:
167 usage (2);
168 }
169 }
170
171 if (optind != argc && !strcmp (argv[optind], "-"))
172 ignore_environment = 1;

Here are the macros, from src/sys2.h in the Coreutils distribution, that define the declarations we saw earlier
and the 'case_GETOPT_xxx' macros used above (lines 164–165):

/* Factor out some of the common --help and --version processing code. */

/* These enum values cannot possibly conflict with the option values
ordinarily used by commands, including CHAR_MAX + 1, etc. Avoid
CHAR_MIN - 1, as it may equal -1, the getopt end-of-options value. */
enum
{
GETOPT_HELP_CHAR = (CHAR_MIN - 2),
GETOPT_VERSION_CHAR = (CHAR_MIN - 3)
};

#define GETOPT_HELP_OPTION_DECL \
"help", no_argument, 0, GETOPT_HELP_CHAR
#define GETOPT_VERSION_OPTION_DECL \
"version", no_argument, 0, GETOPT_VERSION_CHAR
#define case_GETOPT_HELP_CHAR \
case GETOPT_HELP_CHAR: \
usage (EXIT_SUCCESS); \
break;

#define case_GETOPT_VERSION_CHAR(Program_name, Authors) \


case GETOPT_VERSION_CHAR: \
version_etc (stdout, Program_name, PACKAGE, VERSION, Authors); \
exit (EXIT_SUCCESS); \
break;

The upshot of this code is that --help prints the usage message and --version prints version information. Both
exit successfully. ("Success" and "failure" exit statuses are described in Section 9.1.5.1, "Defining Process Exit
Status," page 300.) Given that the Coreutils have dozens of utilities, it makes sense to factor out and standardize
as much repetitive code as possible.

Returning to env.c:

174 environ = dummy_environ;


175 environ[0] = NULL;
176
177 if (!ignore_environment)
178 for (; *envp; envp++)
179 putenv (*envp);
180
181 optind = 0; /* Force GNU getopt to re-initialize. */
182 while ((optc = getopt_long (argc, argv, "+iu:", longopts, NULL)) != –1)
183 if (optc == 'u')
184 putenv (optarg); /* Requires GNU putenv. */
185
186 if (optind != argc && !strcmp (argv[optind], "-")) Skip options
187 ++optind;
188
189 while (optind < argc && strchr (argv[optind], '=')) Set environment variables
190 putenv (argv[optind++]);
191
192 /* If no program is specified, print the environment and exit. */
193 if (optind == argc)
194 {
195 while (*environ)
196 puts (*environ++);
197 exit (EXIT_SUCCESS);
198 }

Lines 174–179 copy the existing environment into a fresh copy of the environment. The global variable environ
is set to point to an empty local array. The envp parameter maintains access to the original environment.

Lines 181–184 remove any environment variables as requested by the -u option. The program does this by
rescanning the command line and removing names listed there. Environment variable removal relies on the GNU
putenv() behavior discussed earlier: that when called with a plain variable name, putenv() removes the
environment variable.

After any options, new or replacement environment variables are supplied on the command line. Lines 189–190
continue scanning the command line, looking for environment variable settings of the form 'name=value'.
Upon reaching line 192, if nothing is left on the command line, env is supposed to print the new environment, and
exit. It does so (lines 195–197).

If arguments are left, they represent a command name to run and arguments to pass to that new command. This is
done with the execvp() system call (line 200), which replaces the current program with the new one. (This call
is discussed in Section 9.1.4, "Starting New Programs: The exec() Family," page 293; don't worry about the
details for now.) If this call returns to the current program, it failed. In such a case, env prints an error message
and exits.

200 execvp (argv[optind], &argv[optind]);


201
202 {
203 int exit_status = (errno == ENOENT ? 127 : 126);
204 error (0, errno, "%s", argv[optind]);
205 exit (exit_status);
206 }
207 }

The exit status values, 126 and 127 (determined on line 203), conform to POSIX. 127 means the program that
execvp() attempted to run didn't exist. (ENOENT means the file doesn't have an entry in the directory.) 126
means that the file exists, but something else went wrong.
2.5. Summary

C programs access their command-line arguments through the parameters argc and argv. The getopt()
function provides a standard way for consistent parsing of options and their arguments. The GNU version of
getopt() provides some extensions, and getopt_long() and getopt_long_only() make it possible
to easily parse long-style options.

The environment is a set of 'name=value' pairs that each program inherits from its parent. Programs can, at
their author's whim, use environment variables to change their behavior, in addition to any command-line
arguments. Standard routines (getenv(), setenv(), putenv(), and unsetenv()) exist for retrieving
environment variable values, changing them, or removing them. If necessary, the entire environment is
available through the external variable environ or through the char **envp third argument to main().
The latter technique is discouraged.
Exercises

1. Assume a program accepts options -a, -b, and -c, and that -b requires an argument. Write the manual
argument parsing code for this program, without using getopt() or getopt_long(). Accept -- to end
option processing. Make sure that -ac works, as do -bYANKEES, -b YANKEES, and -abYANKEES. Test
your program.

2. Implement getopt(). For the first version, don't worry about the case in which 'optstring[0] == ':' .
You may also ignore opterr.

3. Add code for 'optstring[0] == ':' and opterr to your version of getopt().

4. Print and read the GNU getopt.h, getopt.c and getopt1.c files.

5. Write a program that declares both environ and envp and compares their values.

6. Parsing command line arguments and options is a wheel that many people can't refrain from reinventing.
Besides getopt() and getopt_long(), you may wish to examine different argument-parsing packages,
such as:

The Plan 9 From Bell Labs arg(2) argument-parsing library,[4]


[4] http://plan9.bell-labs.com/magic/man2html/2/arg

Argp,[5]
[5] http://www.gnu.org/manual/glibc/html_node/Argp.html

Argv,[6]
[6] http://256.com/sources/argv

Autoopts,[7]
[7] http://autogen.sourceforge.net/autoopts.html

GNU Gengetopt,[8]
[8] ftp://ftp.gnu.org/gnu/gengetopt/

Opt,[9]
[9] http://nis-www.lanl.gov/~jt/software/opt/opt-3.19.tar.gz

Popt.[10] See also the popt(3) manpage on a GNU/Linux system.


[10] http://freshmeat.net/projects/popt/?topic_id=809

7. Extra credit: Why can't a C compiler completely ignore the register keyword? Hint: What operation
cannot be applied to a register variable?
Chapter 3. User-Level Memory Management
In this chapter

3.1 Linux/Unix Address Space page 52

3.2 Memory Allocation page 56

3.3 Summary page 80

Exercises page 81

Without memory for storing data, it's impossible for a program to get any work done. (Or rather, it's impossible
to get any useful work done.) Real-world programs can't afford to rely on fixed-size buffers or arrays of data
structures. They have to be able to handle inputs of varying sizes, from small to large. This in turn leads to the use
of dynamically allocated memory—memory allocated at runtime instead of at compile time. This is how the
GNU "no arbitrary limits" principle is put into action.

Because dynamically allocated memory is such a basic building block for real-world programs, we cover it early,
before looking at everything else there is to do. Our discussion focuses exclusively on the user-level view of the
process and its memory; it has nothing to do with CPU architecture.
3.1. Linux/Unix Address Space
For a working definition, we've said that a process is a running program. This means that the operating system has
loaded the executable file for the program into memory, has arranged for it to have access to its command-line
arguments and environment variables, and has started it running. A process has five conceptually different areas of
memory allocated to it:

Code

Often referred to as the text segment, this is the area in which the executable instructions reside. Linux and
Unix arrange things so that multiple running instances of the same program share their code if possible; only
one copy of the instructions for the same program resides in memory at any time. (This is transparent to the
running programs.) The portion of the executable file containing the text segment is the text section.

Initialized data

Statically allocated and global data that are initialized with nonzero values live in the data segment. Each
process running the same program has its own data segment. The portion of the executable file containing
the data segment is the data section.

Zero-initialized data

Global and statically allocated data that are initialized to zero by default are kept in what is colloquially
called the BSS area of the process.[1] Each process running the same program has its own BSS area. When
running, the BSS data are placed in the data segment. In the executable file, they are stored in the BSS
section.
[1] BSS is an acronym for "Block Started by Symbol," a mnemonic from the IBM 7094 assembler.

The format of a Linux/Unix executable is such that only variables that are initialized to a nonzero value
occupy space in the executable's disk file. Thus, a large array declared 'static char somebuf[2048];',
which is automatically zero-filled, does not take up 2 KB worth of disk space. (Some compilers have
options that let you place zero-initialized data into the data segment.)

Heap

The heap is where dynamic memory (obtained by malloc() and friends) comes from. As memory is
allocated on the heap, the process's address space grows, as you can see by watching a running program
with the ps command.

Although it is possible to give memory back to the system and shrink a process's address space, this is
almost never done. (We distinguish between releasing no-longer-needed dynamic memory and shrinking the
address space; this is discussed in more detail later in this chapter.)

It is typical for the heap to "grow upward." This means that successive items that are added to the heap are
added at addresses that are numerically greater than previous items. It is also typical for the heap to start
immediately after the BSS area of the data segment.

Stack

The stack segment is where local variables are allocated. Local variables are all variables declared inside
the opening left brace of a function body (or other left brace) that aren't defined as static.

On most architectures, function parameters are also placed on the stack, as well as "invisible" bookkeeping
information generated by the compiler, such as room for a function return value and storage for the return
address representing the return from a function to its caller. (Some architectures do all this with registers.)

It is the use of a stack for function parameters and return values that makes it convenient to write recursive
functions (functions that call themselves).

Variables stored on the stack "disappear" when the function containing them returns; the space on the stack
is reused for subsequent function calls.

On most modern architectures, the stack "grows downward," meaning that items deeper in the call chain are
at numerically lower addresses.

When a program is running, the initialized data, BSS, and heap areas are usually placed into a single contiguous
area: the data segment. The stack segment and code segment are separate from the data segment and from each
other. This is illustrated in Figure 3.1.

Figure 3.1. Linux/Unix process address space

Although it's theoretically possible for the stack and heap to grow into each other, the operating system prevents
that event, and any program that tries to make it happen is asking for trouble. This is particularly true on modern
systems, on which process address spaces are large and the gap between the top of the stack and the end of the
heap is a big one. The different memory areas can have different hardware memory protection assigned to them.
For example, the text segment might be marked "execute only," whereas the data and stack segments would have
execute permission disabled. This practice can prevent certain kinds of security attacks. The details, of course,
are hardware and operating-system specific and likely to change over time. Of note is that both Standard C and
C++ allow const items to be placed in read-only memory. The relationship among the different segments is
summarized in Table 3.1.

Table 3.1. Executable program segments and their locations

Address space
Program memory segment Executable file section
Code Text Text
Initialized data Data Data
BSS Data BSS
Heap Data
Stack Stack

The size program prints out the size in bytes of each of the text, data, and BSS sections, along with the total size
in decimal and hexadecimal. (The ch03-memaddr.c program is shown later in this chapter; see Section 3.2.5,
"Address Space Examination," page 78.)

$ cc -o ch03-memaddr.c -o ch03-memaddr Compile the program


$ ls -l ch03-memaddr Show total size
-rwxr-xr-x 1 arnold devel 12320 Nov 24 16:45 ch03-memaddr
$ size ch03-memaddr Show component sizes
text data bss dec hex filename
1458 276 8 1742 6ce ch03-memaddr
$ strip ch03-memaddr Remove symbols
$ ls -l ch03-memaddr Show total size again
-rwxr-xr-x 1 arnold devel 3480 Nov 24 16:45 ch03-memaddr
$ size ch03-memaddr Component sizes haven't changed
text data bss dec hex filename
1458 276 8 1742 6ce ch03-memaddr

The total size of what gets loaded into memory is only 1742 bytes, in a file that is 12,320 bytes long. Most of that
space is occupied by the symbols, a list of the program's variables and function names. (The symbols are not
loaded into memory when the program runs.) The strip program removes the symbols from the object file. This
can save significant disk space for a large program, at the cost of making it impossible to debug a core dump[2]
should one occur. (On modern systems this isn't worth the trouble; don't use strip.) Even after removing the
symbols, the file is still larger than what gets loaded into memory since the object file format maintains additional
data about the program, such as what shared libraries it may use, if any.[3]
[2] A core dump is the memory image of a running process created when the process terminates unexpectedly. It may be
used later for debugging. Unix systems named the file core , and GNU/Linux systems use core.pid , where pid is the
process ID of the process that died.
[3] Thedescription here is a deliberate simplification. Running programs occupy much more space than the size program
indicates, since shared libraries are included in the address space. Also, the data segment will grow as a program allocates
memory.
Finally, we'll mention that threads represent multiple threads of execution within a single address space.
Typically, each thread has its own stack, and a way to get thread local data, that is, dynamically allocated data
for private use by the thread. We don't otherwise cover threads in this book, since they are an advanced topic.
3.2. Memory Allocation
Four library functions form the basis for dynamic memory management from C. We describe them first, followed
by descriptions of the two system calls upon which these library functions are built. The C library functions in turn
are usually used to implement other library functions that allocate memory and the C++ new and delete
operators.

Finally, we discuss a function that you will see used frequently, but which we don't recommend.

3.2.1. Library Calls: malloc(), calloc(), realloc(), free()

Dynamic memory is allocated by either the malloc() or calloc() functions. These functions return pointers to
the allocated memory. Once you have a block of memory of a certain initial size, you can change its size with the
realloc() function. Dynamic memory is released with the free() function.

Debugging the use of dynamic memory is an important topic in its own right. We discuss tools for this purpose in
Section 15.5.2, "Memory Allocation Debuggers," page 612.

3.2.1.1. Examining C Language Details

Here are the function declarations from the GNU/Linux malloc(3) manpage:

#include <stdlib.h> ISO C

void *calloc(size_t nmemb, size_t size); Allocate and zero fill


void *malloc(size_t size); Allocate raw memory
void free(void *ptr); Release memory
void *realloc(void *ptr, size_t size); Change size of existing allocation

The allocation functions all return type void *. This is a typeless or generic pointer; all you can do with such a
pointer is cast it to a different type and assign it to a typed pointer. Examples are coming up.

The type size_t is an unsigned integral type that represents amounts of memory. It is used for dynamic memory
allocation, and we see many uses of it throughout the book. On most modern systems, size_t is unsigned
long, but it's better to use size_t explicitly than to use a plain unsigned integral type.

The ptrdiff_t type is used for address calculations in pointer arithmetic, such as calculating where in an array a
pointer may be pointing:

#define MAXBUF ...


char *p;
char buf[MAXBUF];
ptrdiff_t where;

p = buf;
while (some condition ) {
...
p += something ;
...
where = p - buf; /* what index are we at? */
}
The <stdlib.h> header file declares many of the standard C library routines and types (such as size_t), and it
also defines the preprocessor constant NULL, which represents the "null" or invalid pointer. (This is a zero value,
such as 0 or '((void *) 0)'. The C++ idiom is to use 0 explicitly; in C, however, NULL is preferred, and we
find it to be much more readable for C code.)

3.2.1.2. Initially Allocating Memory: malloc()

Memory is allocated initially with malloc(). The value passed in is the total number of bytes requested. The
return value is a pointer to the newly allocated memory or NULL if memory could not be allocated. In the latter
event, errno will be set to indicate the error. (errno is a special variable that system calls and library functions
set to indicate what went wrong. It's described in Section 4.3, "Determining What Went Wrong," page 86.) For
example, suppose we wish to allocate a variable number of some structure. The code looks something like this:

struct coord { /* 3D coordinates */


int x, y, z;
} *coordinates;
unsigned int count; /* how many we need */
size_t amount; /* total amount of memory */

/* ... determine count somehow... */


amount = count * sizeof(struct coord); /* how many bytes to allocate */

coordinates = (struct coord *) malloc(amount); /* get the space */


if (coordinates == NULL) {
/* report error, recover or give up */
}
/* ... use coordinates ... */

The steps shown here are quite boilerplate. The order is as follows:

1. Declare a pointer of the proper type to point to the allocated memory.

2. Calculate the size in bytes of the memory to be allocated. This involves multiplying a count of objects
needed by the size of the individual object. This size in turn is retrieved from the C sizeof operator, which
exists for this purpose (among others). Thus, while the size of a particular struct may vary across
compilers and architectures, sizeof always returns the correct value and the source code remains correct
and portable.

When allocating arrays for character strings or other data of type char, it is not necessary to multiply by
sizeof(char), since by definition this is always 1. But it won't hurt anything either.

3. Allocate the storage by calling malloc(), assigning the function's return value to the pointer variable. It is
good practice to cast the return value of malloc() to that of the variable being assigned to. In C it's not
required (although the compiler may generate a warning). We strongly recommend always casting the
return value.

Note that in C++, assignment of a pointer value of one type to a pointer of another type does requires a
cast, whatever the context. For dynamic memory management, C++ programs should use new and
delete, to avoid type problems, and not malloc() and free().
4. Check the return value. Never assume that memory allocation will succeed. If the allocation fails, malloc()
returns NULL. If you use the value without checking, it is likely that your program will immediately die from a
segmentation violation (or segfault), which is an attempt to use memory not in your address space.

If you check the return value, you can at least print a diagnostic message and terminate gracefully. Or you
can attempt some other method of recovery.

Once we've allocated memory and set coordinates to point to it, we can then treat coordinates as if it were
an array, although it's really a pointer:

int cur_x, cur_y, cur_z;


size_t an_index;
an_index = something;
cur_x = coordinates[an_index].x;
cur_y = coordinates[an_index].y;
cur_z = coordinates[an_index].z;

The compiler generates correct code for indexing through the pointer to retrieve the members of the structure at
coordinates[an_index].

Note

The memory returned by malloc() is not initialized. It can contain any random garbage. You should
immediately initialize the memory with valid data or at least with zeros. To do the latter, use memset()
(discussed in Section 12.2, "Low-Level Memory: The memXXX() Functions," page 432):

memset(coordinates, '\0', amount);

Another option is to use calloc(), described shortly.

Geoff Collyer recommends the following technique for allocating memory:

some_type *pointer;

pointer = malloc(count * sizeof(*pointer));

This approach guarantees that the malloc() will allocate the correct amount of memory without your having to
consult the declaration of pointer. If pointer's type later changes, the sizeof operator automatically ensures
that the count of bytes to allocate stays correct. (Geoff's technique omits the cast that we just discussed. Having
the cast there also ensures a diagnostic if pointer's type changes and the call to malloc() isn't updated.)

3.2.1.3. Releasing Memory: free()


When you're done using the memory, you "give it back" by using the free() function. The single argument is a
pointer previously obtained from one of the other allocation routines. It is safe (although useless) to pass a null
pointer to free():

free(coordinates);
coordinates = NULL; /* not required, but a good idea */

Once free(coordinates) is called, the memory pointed to by coordinates is off limits. It now "belongs" to
the allocation subroutines, and they are free to manage it as they see fit. They can change the contents of the
memory or even release it from the process's address space! There are thus several common errors to watch out
for with free():

Accessing freed memory

If unchanged, coordinates continues to point at memory that no longer belongs to the application. This is
called a dangling pointer. In many systems, you can get away with continuing to access this memory, at
least until the next time more memory is allocated or freed. In many others though, such access won't work.

In sum, accessing freed memory is a bad idea: It's not portable or reliable, and the GNU Coding
Standards disallows it. For this reason, it's a good idea to immediately set the program's pointer variable to
NULL. If you then accidentally attempt to access freed memory, your program will immediately fail with a
segmentation fault (before you've released it to the world, we hope).

Freeing the same pointer twice

This causes "undefined behavior." Once the memory has been handed back to the allocation routines, they
may merge the freed block with other free storage under management. Freeing something that's already
been freed is likely to lead to confusion or crashes at best, and so-called double frees have been known to
lead to security problems.

Passing a pointer not obtained from malloc(), calloc(), or realloc()

This seems obvious, but it's important nonetheless. Even passing in a pointer to somewhere in the middle of
dynamically allocated memory is bad:

free (coordinates + 10); /* Release all but first 10 elements. */

This call won't work, and it's likely to lead to disastrous consequences, such as a crash. (This is because
many malloc() implementations keep "bookkeeping" information in front of the returned data. When
free() goes to use that information, it will find invalid data there. Other implementations have the
bookkeeping information at the end of the allocated chunk; the same issues apply.)

Buffer overruns and underruns

Accessing memory outside an allocated chunk also leads to undefined behavior, again because this is likely
to be bookkeeping information or possibly memory that's not even in the address space. Writing into such
memory is much worse, since it's likely to destroy the bookkeeping data.

Failure to free memory

Any dynamic memory that's not needed should be released. In particular, memory that is allocated inside
loops or recursive or deeply nested function calls should be carefully managed and released. Failure to take
care leads to memory leaks, whereby the process's memory can grow without bounds; eventually, the
process dies from lack of memory.

This situation can be particularly pernicious if memory is allocated per input record or as some other
function of the input: The memory leak won't be noticed when run on small inputs but can suddenly become
obvious (and embarrassing) when run on large ones. This error is even worse for systems that must run
continuously, such as telephone switching systems. A memory leak that crashes such a system can lead to
significant monetary or other damage.

Even if the program never dies for lack of memory, constantly growing programs suffer in performance,
because the operating system has to manage keeping in-use data in physical memory. In the worst case, this
can lead to behavior known as thrashing, whereby the operating system is so busy moving the contents of
the address space into and out of physical memory that no real work gets done.

While it's possible for free() to hand released memory back to the system and shrink the process address
space, this is almost never done. Instead, the released memory is kept available for allocation by the next call to
malloc(), calloc(), or realloc().

Given that released memory continues to reside in the process's address space, it may pay to zero it out before
releasing it. Security-sensitive programs may choose to do this, for example.

See Section 15.5.2, "Memory Allocation Debuggers," page 612, for discussion of a number of useful dynamic-
memory debugging tools.

3.2.1.4. Changing Size: realloc()

Dynamic memory has a significant advantage over statically declared arrays, which is that it's possible to use
exactly as much memory as you need, and no more. It's not necessary to declare a global, static, or automatic
array of some fixed size and hope that it's (a) big enough and (b) not too big. Instead, you can allocate exactly as
much as you need, no more and no less.

Additionally, it's possible to change the size of a dynamically allocated memory area. Although it's possible to
shrink a block of memory, more typically, the block is grown. Changing the size is handled with realloc().
Continuing with the coordinates example, typical code goes like this:

int new_count;
size_t new_amount;
struct coord *newcoords;

/* set new_count, for example: */


new_count = count * 2; /* double the storage */
new_amount = new_count * sizeof(struct coord);

newcoords = (struct coord *) realloc(coordinates, new_amount);


if (newcoords == NULL) {
/* report error, recover or give up */
}

coordinates = newcoords;
/* continue using coordinates ... */

As with malloc(), the steps are boilerplate in nature and are similar in concept:
1. Compute the new size to allocate, in bytes.

2. Call realloc() with the original pointer obtained from malloc() (or from calloc() or an earlier call to
realloc()) and the new size.

3. Cast and assign the return value of realloc(). More discussion of this shortly.

4. As for malloc(), check the return value to make sure it's not NULL. Any memory allocation routine can
fail.

When growing a block of memory, realloc() often allocates a new block of the right size, copies the data from
the old block into the new one, and returns a pointer to the new one.

When shrinking a block of data, realloc() can often just update the internal bookkeeping information and
return the same pointer. This saves having to copy the original data. However, if this happens, don't assume you
can still use the memory beyond the new size!

In either case, you can assume that if realloc() doesn't return NULL, the old data has been copied for you into
the new memory. Furthermore, the old pointer is no longer valid, as if you had called free() with it, and you
should not use it. This is true of all pointers into that block of data, not just the particular one used to call free().

You may have noticed that our example code used a separate variable to point to the changed storage block. It
would be possible (but a bad idea) to use the same initial variable, like so:

coordinates = realloc(coordinates, new_amount);

This is a bad idea for the following reason. When realloc() returns NULL, the original pointer is still valid; it's
safe to continue using that memory. However, if you reuse the same variable and realloc() returns NULL,
you've now lost the pointer to the original memory. That memory can no longer be used. More important, that
memory can no longer be freed! This creates a memory leak, which is to be avoided.

There are some special cases for the Standard C version of realloc(): When the ptr argument is NULL,
realloc() acts like malloc() and allocates a fresh block of storage. When the size argument is 0,
realloc() acts like free() and releases the memory that ptr points to. Because (a) this can be confusing and
(b) older systems don't implement this feature, we recommend using malloc() when you mean malloc() and
free() when you mean free().

Here is another, fairly subtle, "gotcha."[4] Consider a routine that maintains a static pointer to some dynamically
allocated data, which the routine occasionally has to grow. It may also maintain automatic (that is, local) pointers
into this data. (For brevity, we omit error checking code. In production code, don't do that.) For example:
[4] It is derived from real-life experience with gawk .

void manage_table(void)
{
static struct table *table;
struct table *cur, *p;
int i;
size_t count;

...
table = (struct table *) malloc(count * sizeof(struct table));
/* fill table */
cur = & table[i]; /* point at i'th item */
...
cur->i = j; /* use pointer */
...
if (some condition) { /* need to grow table */
count += count/2;
p = (struct table *) realloc(table, count * sizeof(struct table));
table = p;
}

cur->i = j; /* PROBLEM 1: update table element */

other_routine(); /* PROBLEM 2: see text */


cur->j = k; /* PROBLEM 2: see text */
...
}

This looks straightforward; manage_table() allocates the data, uses it, changes the size, and so on. But there
are some problems that don't jump off the page (or the screen) when you are looking at this code.

In the line marked 'PROBLEM 1', the cur pointer is used to update a table element. However, cur was assigned
on the basis of the initial value of table. If some condition was true and realloc() returned a different
block of memory, cur now points into the original, freed memory! Whenever table changes, any pointers into
the memory need to be updated too. What's missing here is the statement 'cur = & table[i];' after table is
reassigned following the call to realloc().

The two lines marked 'PROBLEM 2' are even more subtle. In particular, suppose other_routine() makes a
recursive call to manage_table(). The table variable could be changed again, completely invisibly! Upon
return from other_routine(), the value of cur could once again be invalid.

One might think (as we did) that the only solution is to be aware of this and supply a suitably commented
reassignment to cur after the function call. However, Brian Kernighan kindly set us straight. If we use indexing,
the pointer maintenance issue doesn't even arise:

table = (struct table *) malloc(count * sizeof(struct table));


/* fill table */
...
table[i].i = j; /* Update a member of the i'th element */
...
if (some condition) { /* need to grow table */
count += count/2;
p = (struct table *) realloc(table, count * sizeof(struct table));
table = p;
}

table[i].i = j; /* PROBLEM 1 goes away */


other_routine(); /* Recursively calls us, modifies table */
table[i].j = k; /* PROBLEM 2 goes away also */

Using indexing doesn't solve the problem if you have a global copy of the original pointer to the allocated data; in
that case, you still have to worry about updating your global structures after calling realloc().
Note

As with malloc(), when you grow a piece of memory, the newly allocated memory returned from
realloc() is not zero-filled. You must clear it yourself with memset() if that's necessary, since
realloc() only allocates the fresh memory; it doesn't do anything else.

3.2.1.5. Allocating and Zero-filling: calloc()

The calloc() function is a straightforward wrapper around malloc(). Its primary advantage is that it zeros the
dynamically allocated memory. It also performs the size calculation for you by taking as parameters the number of
items and the size of each:

coordinates = (struct coord *) calloc(count, sizeof(struct coord));

Conceptually, at least, the calloc() code is fairly simple. Here is one possible implementation:

void *calloc(size_t nmemb, size_t size)


{
void *p;
size_t total;

total = nmemb * size; Compute size


p = malloc(total); Allocate the memory

if (p != NULL) If it worked ...


memset(p, '\0', total); Fill it with zeros

return p; Return value is NULL or pointer


}

Many experienced programmers prefer to use calloc() since then there's never any question about the contents
of the newly allocated memory.

Also, if you know you'll need zero-filled memory, you should use calloc(), because it's possible that the
memory malloc() returns is already zero-filled. Although you, the programmer, can't know this, calloc() can
know about it and avoid the call to memset().

3.2.1.6. Summarizing from the GNU Coding Standards

To summarize, here is what the GNU Coding Standards has to say about using the memory allocation routines:

Check every call to malloc or realloc to see if it returned zero. Check realloc even if you are making
the block smaller; in a system that rounds block sizes to a power of 2, realloc may get a different block if
you ask for less space.

In Unix, realloc can destroy the storage block if it returns zero. GNU realloc does not have this bug: If
it fails, the original block is unchanged. Feel free to assume the bug is fixed. If you wish to run your program
on Unix, and wish to avoid lossage in this case, you can use the GNU malloc.

You must expect free to alter the contents of the block that was freed. Anything you want to fetch from
the block, you must fetch before calling free.

In three short paragraphs, Richard Stallman has distilled the important principles for doing dynamic memory
management with malloc(). It is the use of dynamic memory and the "no arbitrary limits" principle that makes
GNU programs so robust and more capable than their Unix counterparts.

We do wish to point out that the C standard requires realloc() to not destroy the original block if it returns
NULL.

3.2.1.7. Using Private Allocators

The malloc() suite is a general-purpose memory allocator. It has to be able to handle requests for arbitrarily
large or small amounts of memory and do all the book-keeping when different chunks of allocated memory are
released. If your program does considerable dynamic memory allocation, you may thus find that it spends a large
proportion of its time in the malloc() functions.

One thing you can do is write a private allocator—a set of functions or macros that allocates large chunks of
memory from malloc() and then parcels out small chunks one at a time. This technique is particularly useful if
you allocate many individual instances of the same relatively small structure.

For example, GNU awk (gawk) uses this technique. From the file awk.h in the gawk distribution (edited slightly
to fit the page):

#define getnode(n) if (nextfree) n = nextfree, nextfree = nextfree->nextp;\


else n = more_nodes()

#define freenode(n) ((n)->flags = 0, (n)->exec_count = 0,\


(n)->nextp = nextfree, nextfree = (n))

The nextfree variable points to a linked list of NODE structures. The getnode() macro pulls the first structure
off the list if one is there. Otherwise, it calls more_nodes() to allocate a new list of free NODEs. The
freenode() macro releases a NODE by putting it at the head of the list.

Note

When first writing your application, do it the simple way: use malloc() and free() directly. If and
only if profiling your program shows you that it's spending a significant amount of time in the memory-
allocation functions should you consider writing a private allocator.

3.2.1.8. Example: Reading Arbitrarily Long Lines

Since this is, after all, Linux Programming by Example, it's time for a real-life example. The following code is
the readline() function from GNU Make 3.80 (ftp://ftp.gnu.org/gnu/make/make-3.80.tar.gz). It
can be found in the file read.c.

Following the "no arbitrary limits" principle, lines in a Makefile can be of any length. Thus, this routine's primary
job is to read lines of any length and make sure that they fit into the buffer being used.

A secondary job is to deal with continuation lines. As in C, lines that end with a backslash logically continue to the
next line. The strategy used is to maintain a buffer. As many lines as will fit in the buffer are kept there, with
pointers keeping track of the start of the buffer, the current line, and the next line. Here is the structure:

struct ebuffer
{
char *buffer; /* Start of the current line in the buffer. */
char *bufnext; /* Start of the next line in the buffer. */
char *bufstart; /* Start of the entire buffer. */
unsigned int size; /* Malloc'd size of buffer. */
FILE *fp; /* File, or NULL if this is an internal buffer. */
struct floc floc; /* Info on the file in fp (if any). */
};

The size field tracks the size of the entire buffer, and fp is the FILE pointer for the input file. The floc structure
isn't of interest for studying the routine.

The function returns the number of lines in the buffer. (The line numbers here are relative to the start of the
function, not the source file.)

1 static long
2 readline (ebuf) static long readline(struct ebuffer *ebuf)
3 struct ebuffer *ebuf;
4 {
5 char *p;
6 char *end;
7 char *start;
8 long nlines = 0;
9
10 /* The behaviors between string and stream buffers are different enough to
11 warrant different functions. Do the Right Thing. */
12
13 if (!ebuf->fp)
14 return readstring (ebuf);
15
16 /* When reading from a file, we always start over at the beginning of the
17 buffer for each new line. */
18
19 p = start = ebuf->bufstart;
20 end = p + ebuf->size;
21 *p = '\0';

We start by noticing that GNU Make is written in K&R C for maximal portability. The initial part declares
variables, and if the input is coming from a string (such as from the expansion of a macro), the code hands things
off to a different function, readstring() (lines 13 and 14). The test '!ebuf->fp' (line 13) is a shorter (and less
clear, in our opinion) test for a null pointer; it's the same as 'ebuf->fp == NULL'.

Lines 19–21 initialize the pointers, and insert a NUL byte, which is the C string terminator character, at the end of
the buffer. The function then starts a loop (lines 23–95), which runs as long as there is more input.

23 while (fgets (p, end - p, ebuf->fp) != 0)


24 {
25 char *p2;
26 unsigned long len;
27 int backslash;
28
29 len = strlen (p);
30 if (len == 0)
31 {
32 /* This only happens when the first thing on the line is a '\0'.
33 It is a pretty hopeless case, but (wonder of wonders) Athena
34 lossage strikes again! (xmkmf puts NULs in its makefiles.)
35 There is nothing really to be done; we synthesize a newline so
36 the following line doesn't appear to be part of this line. */
37 error (&ebuf->floc,
38 _("warning: NUL character seen; rest of line ignored"));
39 p[0] = '\n';
40 len = 1;
41 }

The fgets() function (line 23) takes a pointer to a buffer, a count of bytes to read, and a FILE * variable for
the file to read from. It reads one less than the count so that it can terminate the buffer with '\0'. This function is
good since it allows you to avoid buffer overflows. It stops upon encountering a newline or end-of-file, and if the
newline is there, it's placed in the buffer. It returns NULL on failure or the (pointer) value of the first argument on
success.

In this case, the arguments are a pointer to the free area of the buffer, the amount of room left in the buffer, and
the FILE pointer to read from.

The comment on lines 32–36 is self-explanatory; if a zero byte is encountered, the program prints an error
message and pretends it was an empty line. After compensating for the NUL byte (lines 30–41), the code
continues.

43 /* Jump past the text we just read. */


44 p += len;
45
46 /* If the last char isn't a newline, the whole line didn't fit into the
47 buffer. Get some more buffer and try again. */
48 if (p[-1] != '\n')
49 goto more_buffer;
50
51 /* We got a newline, so add one to the count of lines. */
52 ++nlines;

Lines 43–52 increment the pointer into the buffer past the data just read. The code then checks whether the last
character read was a newline. The construct p[-1] (line 48) looks at the character in front of p, just as p[0] is
the current character and p[1] is the next. This looks strange at first, but if you translate it into terms of pointer
math, *(p-1), it makes more sense, and the indexing form is possibly easier to read.

If the last character was not a newline, this means that we've run out of space, and the code goes off (with goto)
to get more (line 49). Otherwise, the line count is incremented.

54 #if !defined(WINDOWS32) && !defined(__MSDOS__)


55 /* Check to see if the line was really ended with CRLF; if so ignore
56 the CR. */
57 if ((p - start) > 1 && p[-2] == '\r')
58 {
59 --p;
60 p[-1] = '\n';
61 }
62 #endif

Lines 54–62 deal with input lines that follow the Microsoft convention of ending with a Carriage Return-Line
Feed (CR-LF) combination, and not just a Line Feed (or newline), which is the Linux/Unix convention. Note that
the #ifdef excludes the code on Microsoft systems; apparently the <stdio.h> library on those systems
handles this conversion automatically. This is also true of other non-Unix systems that support Standard C.

64 backslash = 0;
65 for (p2 = p - 2; p2 >= start; --p2)
66 {
67 if (*p2 != '\\')
68 break;
69 backslash = !backslash;
70 }
71
72 if (!backslash)
73 {
74 p[-1] = '\0';
75 break;
76 }
77
78 /* It was a backslash/newline combo. If we have more space, read
79 another line. */
80 if (end - p >= 80)
81 continue;
82
83 /* We need more space at the end of our buffer, so realloc it.
84 Make sure to preserve the current offset of p. */
85 more_buffer:
86 {
87 unsigned long off = p - start;
88 ebuf->size *= 2;
89 start = ebuf->buffer = ebuf->bufstart = (char *) xrealloc (start,
90 ebuf->size);
91 p = start + off;
92 end = start + ebuf->size;
93 *p = '\0';
94 }
95 }

So far we've dealt with the mechanics of getting at least one complete line into the buffer. The next chunk handles
the case of a continuation line. It has to make sure, though, that the final backslash isn't part of multiple
backslashes at the end of the line. It tracks whether the total number of such backslashes is odd or even by
toggling the backslash variable from 0 to 1 and back. (Lines 64–70.)

If the number is even, the test '! backslash' (line 72) will be true. In this case, the final newline is replaced with
a NUL byte, and the code leaves the loop.

On the other hand, if the number is odd, then the line contained an even number of backslash pairs (representing
escaped backslashes, \\ as in C), and a final backslash-newline combination.[5] In this case, if at least 80 free
bytes are left in the buffer, the program continues around the loop to read another line (lines 78–81). (The use
of the magic number 80 isn't great; it would have been better to define and use a symbolic constant.)
[5] This code has the scent of practical experience about it: It wouldn't be surprising to learn that earlier versions simply
checked for a final backslash before the newline, until someone complained that it didn't work when there were multiple
backslashes at the end of the line.

Upon reaching line 83, the program needs more space in the buffer. Here's where the dynamic memory
management comes into play. Note the comment about preserving p (lines 83–84); we discussed this earlier in
terms of reinitializing pointers into dynamic memory. end is also reset. Line 89 resizes the memory.

Note that here the function being called is xrealloc(). Many GNU programs use "wrapper" functions around
malloc() and realloc() that automatically print an error message and exit if the standard routines return NULL.
Such a wrapper might look like this:

extern const char *myname; /* set in main() */

void *xrealloc(void *ptr, size_t amount)


{
void *p = realloc(ptr, amount);

if (p == NULL) {
fprintf(stderr, "%s: out of memory!\n", myname);
exit(1);
}
}

Thus, if xrealloc() returns, it's guaranteed to return a valid pointer. (This strategy complies with the "check
every call for errors" principle while avoiding the code clutter that comes with doing so using the standard routines
directly.) In addition, this allows valid use of the construct 'ptr = xrealloc(ptr, new_size)', which we
otherwise warned against earlier.

Note that it is not always appropriate to use such a wrapper. If you wish to handle errors yourself, you shouldn't
use it. On the other hand, if running out of memory is always a fatal error, then such a wrapper is quite handy.

97 if (ferror (ebuf->fp))
98 pfatal_with_name (ebuf->floc.filenm);
99
100 /* If we found some lines, return how many.
101 If we didn't, but we did find _something_, that indicates we read the last
102 line of a file with no final newline; return 1.
103 If we read nothing, we're at EOF; return -1. */
104
105 return nlines ? nlines : p == ebuf->bufstart ? -1 : 1;
106 }

Finally, the readline() routine checks for I/O errors, and then returns a descriptive return value. The function
pfatal_with_name() (line 98) doesn't return.

3.2.1.9. GLIBC Only: Reading Entire Lines: getline() and getdelim()

Now that you've seen how to read an arbitrary-length line, you can breathe a sigh of relief that you don't have to
write such a function for yourself. GLIBC provides two functions to do this for you:

#define _GNU_SOURCE 1 GLIBC


#include <stdio.h>
#include <sys/types.h> /* for ssize_t */

ssize_t getline(char **lineptr, size_t *n, FILE *stream);


ssize_t getdelim(char **lineptr, size_t *n, int delim, FILE *stream);

Defining the constant _GNU_SOURCE brings in the declaration of the getline() and geTDelim() functions.
Otherwise, they're implicitly declared as returning int. <sys/types.h> is needed so you can declare a variable
of type ssize_t to hold the return value. (An ssize_t is a "signed size_t." It's meant for the same use as a
size_t, but for places where you need to be able to hold negative values as well.)

Both functions manage dynamic storage for you, ensuring that the buffer containing an input line is always big
enough to hold the input line. They differ in that getline() reads until a newline character, and geTDelim()
uses a user-provided delimiter character. The common arguments are as follows:
char **lineptr

A pointer to a char * pointer to hold the address of a dynamically allocated buffer. It should be initialized
to NULL if you want getline() to do all the work. Otherwise, it should point to storage previously
obtained from malloc().
size_t *n

An indication of the size of the buffer. If you allocated your own buffer, *n should contain the buffer's size.
Both functions update *n to the new buffer size if they change it.
FILE *stream

The location from which to get input characters.

The functions return -1 upon end-of-file or error. The strings hold the terminating newline or delimiter (if there
was one), as well as a terminating zero byte. Using getline() is easy, as shown in ch03-getline.c:

/* ch03-getline.c --- demonstrate getline(). */

#define _GNU_SOURCE 1
#include <stdio.h>
#include <sys/types.h>

/* main --- read a line and echo it back out until EOF. */

int main(void)
{
char *line = NULL;
size_t size = 0;
ssize_t ret;

while ((ret = getline(& line, & size, stdin)) != -1)


printf(" (%lu) %s", size, line);

return 0;
}

Here it is in action, showing the size of the buffer. The third input and output lines are purposely long, to force
getline() to grow the buffer; thus, they wrap around:

$ ch03-getline Run the program


this is a line
(120) this is a line
And another line.
(120) And another line.
A llllllllllllllllloooooooooooooooooooooooooooooooonnnnnnnnnnnnnnnnnnngggg
gggggggg llliiiiiiiiiiiiiiiiiiinnnnnnnnnnnnnnnnnnnneeeeeeeeee
(240) A llllllllllllllllloooooooooooooooooooooooooooooooonnnnnnnnnnnnnnnng
nnnggggggggggg llliiiiiiiiiiiiiiiiiiinnnnnnnnnnnnnnnnnnnneeeeeeeeee

3.2.2. String Copying: strdup()

One extremely common operation is to allocate storage for a copy of a string. It's so common that many
programs provide a simple function for it instead of using inline code, and often that function is named strdup():

#include <string.h>

/* strdup --- malloc() storage for a copy of string and copy it */

char *strdup(const char *str)


{
size_t len;
char *copy;

len = strlen(str) + 1; /* include room for terminating '\0' */


copy = malloc(len);

if (copy != NULL)
strcpy(copy, str);

return copy; /* returns NULL if error */


}

With the 2001 POSIX standard, programmers the world over can breathe a little easier: This function is now part
of POSIX as an XSI extension:

#include <string.h> XSI

char *strdup(const char *str); Duplicate str

The return value is NULL if there was an error or a pointer to dynamically allocated storage holding a copy of str.
The returned value should be freed with free() when it's no longer needed.

3.2.3. System Calls: brk() and sbrk()

The four routines we've covered (malloc(), calloc(), realloc(), and free()) are the standard, portable
functions to use for dynamic memory management.

On Unix systems, the standard functions are implemented on top of two additional, very primitive routines, which
directly change the size of a process's address space. We present them here to help you understand how
GNU/Linux and Unix work ("under the hood" again); it is highly unlikely that you will ever need to use these
functions in a regular program. They are declared as follows:
#include <unistd.h> Common
#include <malloc.h> /* Necessary for GLIBC 2 systems */

int brk(void *end_data_segment);


void *sbrk(ptrdiff_t increment);

The brk() system call actually changes the process's address space. The address is a pointer representing the
end of the data segment (really the heap area, as shown earlier in Figure 3.1). Its argument is an absolute logical
address representing the new end of the address space. It returns 0 on success or -1 on failure.

The sbrk() function is easier to use; its argument is the increment in bytes by which to change the address space.
By calling it with an increment of 0, you can determine where the address space currently ends. Thus, to increase
your address space by 32 bytes, use code like this:

char *p = (char *) sbrk(0); /* get current end of address space */


if (brk(p + 32) < 0) {
/* handle error */
}
/* else, change worked */

Practically speaking, you would not use brk() directly. Instead, you would use sbrk() exclusively to grow (or
even shrink) the address space. (We show how to do this shortly, in Section 3.2.5, "Address Space
Examination," page 78.)

Even more practically, you should never use these routines. A program using them can't then use malloc() also,
and this is a big problem, since many parts of the standard library rely on being able to use malloc(). Using
brk() or sbrk() is thus likely to lead to hard-to-find program crashes.

But it's worth knowing about the low-level mechanics, and indeed, the malloc() suite of routines is implemented
with sbrk() and brk().

3.2.4. Lazy Programmer Calls: alloca()

"Danger, Will Robinson! Danger!"

—The Robot

There is one additional memory allocation function that you should know about. We discuss it only so that you'll
understand it when you see it, but you should not use it in new programs! This function is named alloca(); it's
declared as follows:

/* Header on GNU/Linux, possibly not all Unix systems */ Common


#include <alloca.h>

void *alloca(size_t size);

The alloca() function allocates size bytes from the stack. What's nice about this is that the allocated storage
disappears when the function returns. There's no need to explicitly free it because it goes away automatically, just
as local variables do.
At first glance, alloca() seems like a programming panacea; memory can be allocated that doesn't have to be
managed at all. Like the Dark Side of the Force, this is indeed seductive. And it is similarly to be avoided, for the
following reasons:

The function is nonstandard; it is not included in any formal standard, either ISO C or POSIX.

The function is not portable. Although it exists on many Unix systems and GNU/Linux, it doesn't exist on
non-Unix systems. This is a problem, since it's often important for code to be multiplatform, above and
beyond just Linux and Unix.

On some systems, alloca() can't even be implemented. All the world is not an Intel x86 processor, nor is
all the world GCC.

Quoting the manpage (emphasis added): "The alloca function is machine and compiler dependent. On
many systems its implementation is buggy. Its use is discouraged."

Quoting the manpage again: "On many systems alloca cannot be used inside the list of arguments of a
function call, because the stack space reserved by alloca would appear on the stack in the middle of the
space for the function arguments."

It encourages sloppy coding. Careful and correct memory management isn't hard; you just to have to think
about what you're doing and plan ahead.

GCC generally uses a built-in version of the function that operates by using inline code. As a result, there are
other consequences of alloca(). Quoting again from the manpage:

The fact that the code is inlined means that it is impossible to take the address of this function, or to change
its behavior by linking with a different library.

The inlined code often consists of a single instruction adjusting the stack pointer, and does not check for
stack overflow. Thus, there is no NULL error return.

The manual page doesn't go quite far enough in describing the problem with GCC's built-in alloca(). If there's a
stack overflow, the return value is garbage. And you have no way to tell! This flaw makes GCC's alloca()
impossible to use in robust code.

All of this should convince you to stay away from alloca() for any new code that you may write. If you're going
to have to write portable code using malloc() and free() anyway, there's no reason to also write code using
alloca().

3.2.5. Address Space Examination

The following program, ch03-memaddr.c, summarizes everything we've seen about the address space. It does
many things that you should not do in practice, such as call alloca() or use brk() and sbrk() directly:

1 /*
2 * ch03-memaddr.c --- Show address of code, data and stack sections,
3 * as well as BSS and dynamic memory.
4 */
5
6 #include <stdio.h>
7 #include <malloc.h> /* for definition of ptrdiff_t on GLIBC */
8 #include <unistd.h>
9 #include <alloca.h> /* for demonstration only */
10
11 extern void afunc(void); /* a function for showing stack growth */
12
13 int bss_var; /* auto init to 0, should be in BSS */
14 int data_var = 42; /* init to nonzero, should be data */
15
16 int
17 main(int argc, char **argv) /* arguments aren't used */
18 {
19 char *p, *b, *nb;
20
21 printf("Text Locations:\n");
22 printf("\tAddress of main: %p\n", main);
23 printf("\tAddress of afunc: %p\n", afunc);
24
25 printf("Stack Locations:\n");
26 afunc();
27
28 p = (char *) alloca(32);
29 if (p != NULL) {
30 printf("\tStart of alloca()'ed array: %p\n", p);
31 printf("\tEnd of alloca()'ed array: %p\n", p + 31);
32 }
33
34 printf("Data Locations:\n");
35 printf("\tAddress of data_var: %p\n", & data_var);
36
37 printf("BSS Locations:\n");
38 printf("\tAddress of bss_var: %p\n", & bss_var);
39
40 b = sbrk((ptrdiff_t) 32); /* grow address space */
41 nb = sbrk((ptrdiff_t) 0);
42 printf("Heap Locations:\n");
43 printf("\tInitial end of heap: %p\n", b);
44 printf("\tNew end of heap: %p\n", nb);
45
46 b = sbrk((ptrdiff_t) -16); /* shrink it */
47 nb = sbrk((ptrdiff_t) 0);
48 printf("\tFinal end of heap: %p\n", nb);
49 }
50
51 void
52 afunc(void)
53 {
54 static int level = 0; /* recursion level */
55 auto int stack_var; /* automatic variable, on stack */
56
57 if (++level == 3) /* avoid infinite recursion */
58 return;
59
60 printf("\tStack level %d: address of stack_var: %p\n",
61 level, & stack_var);
62 afunc(); /* recursive call */
63 }

This program prints the locations of the two functions main() and afunc() (lines 22–23). It then shows how the
stack grows downward, letting afunc() (lines 51–63) print the address of successive instantiations of its local
variable stack_var. (stack_var is purposely declared auto, to emphasize that it's on the stack.) It then shows
the location of memory allocated by alloca() (lines 28–32). Finally it prints the locations of data and BSS
variables (lines 34–38), and then of memory allocated directly through sbrk() (lines 40–48). Here are the results
when the program is run on an Intel GNU/Linux system:

$ ch03-memaddr
Text Locations:
Address of main: 0x804838c
Address of afunc: 0x80484a8
Stack Locations:
Stack level 1: address of stack_var: 0xbffff864
Stack level 2: address of stack_var: 0xbffff844 Stack grows downward
Start of alloca()'ed array: 0xbffff860
End of alloca()'ed array: 0xbffff87f Addresses are on the stack
Data Locations:
Address of data_var: 0x80496b8
BSS Locations:
Address of bss_var: 0x80497c4 BSS is above data variables
Heap Locations:
Initial end of heap: 0x80497c8 Heap is immediately above BSS
New end of heap: 0x80497e8 And grows upward
Final end of heap: 0x80497d8 Address spaces can shrink
3.3. Summary

Every Linux (and Unix) program has different memory areas. They are stored in separate parts of the
executable program's disk file. Some of the sections are loaded into the same part of memory when the
program is run. All running copies of the same program share the executable code (the text segment). The
size program shows the sizes of the different areas for relocatable object files and fully linked executable
files.

The address space of a running program may have holes in it, and the size of the address space can change
as memory is allocated and released. On modern systems, address 0 is not part of the address space, so
don't attempt to dereference NULL pointers.

At the C level, memory is allocated or reallocated with one of malloc(), calloc(), or realloc().
Memory is freed with free(). (Although realloc() can do everything, using it that way isn't
recommended). It is unusual for freed memory to be removed from the address space; instead, it is reused
for later allocations.

Extreme care must be taken to

Free only memory received from the allocation routines,

Free such memory once and only once,

Free unused memory, and

Not "leak" any dynamically allocated memory.

POSIX provides the strdup() function as a convenience, and GLIBC provides getline() and
getdelim() for reading arbitrary-length lines.

The low-level system call interface functions, brk() and sbrk(), provide direct but primitive access to
memory allocation and deallocation. Unless you are writing your own storage allocator, you should not use
them.

The alloca() function for allocating memory on the stack exists, but is not recommended. Like being able
to recognize poison ivy, you should know it only so that you'll know to avoid it.
Exercises

1. Starting with the structure—

struct line {
size_t buflen;
char *buf;
FILE *fp;
};

—write your own readline() function that will read an any-length line. Don't worry about backslash
continuation lines. Instead of using fgets() to read lines, use getc() to read characters one at a time.

2. Does your function preserve the terminating newline? Explain why or why not.

3. How does your function handle lines that end in CR-LF?

4. How do you initialize the structure? With a separate routine? With a documented requirement for specific
values in the structure?

5. How do you indicate end-of-file? How do you indicate that an I/O error has occurred? For errors, should
your function print an error message? Explain why or why not.

6. Write a program that uses your function to test it, and another program to generate input data to the first
program. Test your function.

7. Rewrite your function to use fgets() and test it. Is the new code more complex or less complex? How
does its performance compare to the getc() version?

8. Study the V7 end(3) manpage (/usr/man/man3/end.3 in the V7 distribution). Does it shed any light on
how 'sbrk(0)' might work?

9. Enhance ch03-memaddr.c to print out the location of the arguments and the environment. In which part of
the address space do they reside?
Chapter 4. Files and File I/O
In this chapter

4.1 Introducing the Linux/Unix I/O Model page 84

4.2 Presenting a Basic Program Structure page 84

4.3 Determining What Went Wrong page 86

4.4 Doing Input and Output page 91

4.5 Random Access: Moving Around within a File page 102

4.6 Creating Files page 106

4.7 Forcing Data to Disk page 113

4.8 Setting File Length page 114

4.9 Summary page 115

Exercises page 115

This chapter describes basic file operations: opening and creating files, reading and writing them, moving around in
them, and closing them. Along the way it presents the standard mechanisms for detecting and reporting errors.
The chapter ends off by describing how to set a file's length and force file data and metadata to disk.
4.1. Introducing the Linux/Unix I/O Model
The Linux/Unix API model for I/O is straightforward. It can be summed up in four words: open, read, write,
close. In fact, those are the names of the system calls: open(), read(), write(), close(). Here are their
declarations:

#include <sys/types.h> POSIX


#include <sys/stat.h> /* for mode_t */
#include <fcntl.h> /* for flags for open() */
#include <unistd.h> /* for ssize_t */

int open(const char *pathname, int flags, mode_t mode);


ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);
int close(int fd);

In the next and subsequent sections, we illustrate the model by writing a very simple version of cat. It's so simple
that it doesn't even have options; all it does is concatenate the contents of the named files to standard output. It
does do minimal error reporting. Once it's written, we compare it to the V7 cat.

We present the program top-down, starting with the command line. In succeeding sections, we present error
reporting and then get down to brass tacks, showing how to do actual file I/O.
4.2. Presenting a Basic Program Structure
Our version of cat follows a structure that is generally useful. The first part starts with an explanatory comment,
header includes, declarations, and the main() function:

1 /*
2 * ch04-cat.c --- Demonstrate open(), read(), write(), close(),
3 * errno and strerror().
4 */
5
6 #include <stdio.h> /* for fprintf(), stderr, BUFSIZ */
7 #include <errno.h> /* declare errno */
8 #include <fcntl.h> /* for flags for open() */
9 #include <string.h> /* declare strerror() */
10 #include <unistd.h> /* for ssize_t */
11 #include <sys/types.h>
12 #include <sys/stat.h> /* for mode_t */
13
14 char *myname;
15 int process(char *file);
16
17 /* main --- loop over file arguments */
18
19 int
20 main(int argc, char **argv)
21 {
22 int i;
23 int errs = 0;
24
25 myname = argv[0];
26
27 if (argc == 1)
28 errs = process("-");
29 else
30 for (i = 1; i < argc; i++)
31 errs += process(argv[i]);
32
33 return (errs != 0);
34 }
... continued later in the chapter ...

The myname variable (line 14) is used later for error messages; main() sets it to the program name (argv[0]) as
its first action (line 25). Then main() loops over the arguments. For each argument, it calls a function named
process() to do the work.

When given the filename - (a single dash, or minus sign), Unix cat reads standard input instead of trying to open
a file named -. In addition, with no arguments, cat reads standard input. ch04-cat implements both of these
behaviors. The check for 'argc == 1' (line 27) is true when there are no filename arguments; in this case,
main() passes "-" to process(). Otherwise, main() loops over all the arguments, treating them as files to be
processed. If one of them happens to be "-", the program then processes standard input.

If process() returns a nonzero value, it means that something went wrong. Errors are added up in the errs
variable (lines 28 and 31). When main() ends, it returns 0 if there were no errors, and 1 if there were (line 33).
This is a fairly standard convention, whose meaning is discussed in more detail in Section 9.1.5.1, "Defining
Process Exit Status", page 300.

The structure presented in main() is quite generic: process() could do anything we want to the file. For
example (ignoring the special use of "-"), process() could just as easily remove files as concatenate them!

Before looking at the process() function, we have to describe how system call errors are represented and then
how I/O is done. The process() function itself is presented in Section 4.4.3, "Reading and Writing", page 96.
4.3. Determining What Went Wrong
"If anything can go wrong, it will".

—Murphy's Law

"Be prepared."

—The Boy Scouts

Errors can occur anytime. Disks can fill up, users can enter invalid data, the server on a network from which a file
is being read can crash, the network can die, and so on. It is important to always check every operation for
success or failure.

The basic Linux system calls almost universally return -1 on error, and 0 or a positive value on success. This lets
you know that the operation has succeeded or failed:

int result;

result = some_system_call(param1, param2);


if (result < 0) {
/* error occurred, do something */
}
else
/* all ok, proceed */

Knowing that an error occurred isn't enough. It's necessary to know what error occurred. For that, each process
has a predefined variable named errno. Whenever a system call fails, errno is set to one of a set of predefined
error values. errno and the predefined values are declared in the <errno.h> header file:

#include <errno.h> ISO C

extern int errno;

errno itself may be a macro that acts like an int variable; it need not be a real integer. In particular, in threaded
environments, each thread will have its own private version of errno. Practically speaking, though, for all the
system calls and functions in this book, you can treat errno like a simple int.

4.3.1. Values for errno

The 2001 POSIX standard defines a large number of possible values for errno. Many of these are related to
networking, IPC, or other specialized tasks. The manpage for each system call describes the possible errno
values that can occur; thus, you can write code to check for particular errors and handle them specially if need be.
The possible values are defined by symbolic constants. Table 4.1 lists the constants provided by GLIBC.

Table 4.1. GLIBC values for errno


Name Meaning
E2BIG Argument list too long.
EACCES Permission denied.
EADDRINUSE Address in use.
EADDRNOTAVAIL Address not available.
EAFNOSUPPORT Address family not supported.
EAGAIN Resource unavailable, try again (may be the same
value as EWOULDBLOCK).
EALREADY Connection already in progress.
EBADF Bad file descriptor.
EBADMSG Bad message.
EBUSY Device or resource busy.
ECANCELED Operation canceled.
ECHILD No child processes.
ECONNABORTED Connection aborted.
ECONNREFUSED Connection refused.
ECONNRESET Connection reset.
EDEADLK Resource deadlock would occur.
EDESTADDRREQ Destination address required.
EDOM Mathematics argument out of domain of function.
EDQUOT Reserved.
EEXIST File exists.
EFAULT Bad address.
EFBIG File too large.
EHOSTUNREACH Host is unreachable.
EIDRM Identifier removed.
EILSEQ Illegal byte sequence.
EINPROGRESS Operation in progress.
EINTR Interrupted function.
EINVAL Invalid argument.
EIO I/O error.
EISCONN Socket is connected.
EISDIR Is a directory.
ELOOP Too many levels of symbolic links.
EMFILE Too many open files.
Name Meaning
EMLINK Too many links.
EMSGSIZE Message too large.
EMULTIHOP Reserved.
ENAMETOOLONG Filename too long.
ENEtdOWN Network is down.
ENEtrESET Connection aborted by network.
ENETUNREACH Network unreachable.
ENFILE Too many files open in system.
ENOBUFS No buffer space available.
ENODEV No such device.
ENOENT No such file or directory.
ENOEXEC Executable file format error.
ENOLCK No locks available.
ENOLINK Reserved.
ENOMEM Not enough space.
ENOMSG No message of the desired type.
ENOPROTOOPT Protocol not available.
ENOSPC No space left on device.
ENOSYS Function not supported.
ENOTCONN The socket is not connected.
ENOTDIR Not a directory.
ENOTEMPTY Directory not empty.
ENOTSOCK Not a socket.
ENOTSUP Not supported.
ENOTTY Inappropriate I/O control operation.
ENXIO No such device or address.
EOPNOTSUPP Operation not supported on socket.
EOVERFLOW Value too large to be stored in data type.
EPERM Operation not permitted.
EPIPE Broken pipe.
EPROTO Protocol error.
EPROTONOSUPPORT Protocol not supported.
EPROTOTYPE Protocol wrong type for socket.
ERANGE Result too large.
Name Meaning
EROFS Read-only file system.
ESPIPE Invalid seek.
ESRCH No such process.
ESTALE Reserved.
ETIMEDOUT Connection timed out.
ETXTBSY Text file busy.
EWOULDBLOCK Operation would block (may be the same value
as EAGAIN).
EXDEV Cross-device link.

Many systems provide other error values as well, and older systems may not have all the errors just listed. You
should check your local intro(2) and errno(2) manpages for the full story.

Note

errno should be examined only after an error has occurred and before further system calls are made.
Its initial value is 0. However, nothing changes errno between errors, meaning that a successful system
call does not reset it to 0. You can, of course, manually set it to 0 initially or whenever you like, but this
is rarely done.

Initially, we use errno only for error reporting. There are two useful functions for error reporting. The first is
perror():

#include <stdio.h> ISO C

void perror(const char *s);

The perror() function prints a program-supplied string, followed by a colon, and then a string describing the
value of errno:

if (some_system_call(param1, param2) < 0) {


perror("system call failed");
return 1;
}

We prefer the strerror() function, which takes an error value parameter and returns a pointer to a string
describing the error:

#include <string.h> ISO C


char *strerror(int errnum);

strerror() provides maximum flexibility in error reporting, since fprintf() makes it possible to print the
error in any way we like:

if (some_system_call(param1, param2) < 0) {


fprintf(stderr, "%s: %d, %d: some_system_call failed: %s\n",
argv[0], param1, param2, strerror(errno));
return 1;
}

You will see many examples of both functions throughout the book.

4.3.2. Error Message Style

C provides several special macros for use in error reporting. The most widely used are __FILE__ and
__LINE__, which expand to the name of the source file and the current line number in that file. These have been
available in C since its beginning. C99 defines an additional predefined identifier, __func__, which represents the
name of the current function as a character string. The macros are used like this:

if (some_system_call(param1, param2) < 0) {


fprintf(stderr, "%s: %s (%s %d): some_system_call(%d, %d) failed: %s\n",
argv[0], __func__, __FILE__, __LINE__,
param1, param2, strerror(errno));
return 1;
}

Here, the error message includes not only the program's name but also the function name, source file name, and
line number. The full list of identifiers useful for diagnostics is provided in Table 4.2.

Table 4.2. C99 diagnostic identifiers

Identifier C version Meaning


__DATE__ C89 Date of compilation in the form
"Mmm nn yyyy".
__FILE__ Original Source-file name in the form
"program.c".
__LINE__ Original Source-file line number in the
form 42.
__TIME__ C89 Time of compilation in the form
"hh:mm:ss".
__func__ C99 Name of current function, as if
declared const char
__func__[] = "name".
The use of __FILE__ and __LINE__ was quite popular in the early days of Unix, when most people had source
code and could find the error and fix it. As Unix systems became more commercial, use of these identifiers
gradually diminished, since knowing the source code location isn't of much help to someone who only has a binary
executable.

Today, although GNU/Linux systems come with source code, said source code often isn't installed by default.
Thus, using these identifiers for error messages doesn't seem to provide much additional value. The GNU Coding
Standards don't even mention them.
4.4. Doing Input and Output
All I/O in Linux is accomplished through file descriptors. This section introduces file descriptors, describes how
to obtain and release them, and explains how to do I/O with them.

4.4.1. Understanding File Descriptors

A file descriptor is an integer value. Valid file descriptors start at 0 and go up to some system-defined limit.
These integers are in fact simple indexes into each process's table of open files. (This table is maintained inside the
operating system; it is not accessible to a running program.) On most modern systems, the size of the table is
large. The command 'ulimit -n' prints the value:

$ ulimit -n
1024

From C, the maximum number of open files is returned by the geTDtablesize() (get descriptor table size)
function:

#include <unistd.h> Common

int getdtablesize(void);

This small program prints the result of the function:

/* ch04-maxfds.c --- Demonstrate getdtablesize(). */

#include <stdio.h> /* for fprintf(), stderr, BUFSIZ */


#include <unistd.h> /* for ssize_t */

int
main(int argc, char **argv)
{
printf("max fds: %d\n", getdtablesize());
exit(0);
}

When compiled and run, not surprisingly the program prints the same value as printed by ulimit:

$ ch04-maxfds
max fds: 1024

File descriptors are held in normal int variables; it is typical to see declarations of the form 'int fd' for use with
I/O system calls. There is no predefined type for file descriptors.

In the usual case, every program starts running with three file descriptors already opened for it. These are
standard input, standard output, and standard error, on file descriptors 0, 1, and 2, respectively. (If not otherwise
redirected, each one is connected to your keyboard and screen.)

Obvious Manifest Constants. An


Oxymoron?
When working with file-descriptor-based system calls and the standard input, output and error, it is
common practice to use the integer constants 0, 1, and 2 directly in code. In the overwhelming
majority of cases, such manifest constants are a bad idea. You never know what the meaning is of
some random integer constant and whether the same constant used elsewhere is related to it or not.
To this end, the POSIX standard requires the definition of the following symbolic constants in
<unistd.h>:

STDIN_FILENO The "file number" for standard input: 0.


STDOUT_FILENO The file number for standard output: 1.
STDERR_FILENO The file number for standard error: 2.

However, in our humble opinion, using these macros is overkill. First, it's painful to type 12 or 13
characters instead of just 1. Second, the use of 0, 1, and 2 is so standard and so well known that
there's really no grounds for confusion as to the meaning of these particular manifest constants.

On the other hand, use of these constants leaves no doubt as to what was intended. Consider this
statement:

int fd = 0;

Is fd being initialized to refer to standard input, or is the programmer being careful to initialize his
variables to a reasonable value? You can't tell.

One approach (as recommended by Geoff Collyer) is to use the following enum definition:

enum { Stdin, Stdout, Stderr };

These constants can then be used in place of 0, 1, and 2. They are both readable and easier to type.

4.4.2. Opening and Closing Files

New file descriptors are obtained (among other sources) from the open() system call. This system call opens a
file for reading or writing and returns a new file descriptor for subsequent operations on the file. We saw the
declaration earlier:

#include <sys/types.h> POSIX


#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int open(const char *pathname, int flags, mode_t mode);

The three arguments are as follows:


const char *pathname

A C string, representing the name of the file to open.


int flags

The bitwise-OR of one or more of the constants defined in <fcntl.h>. We describe them shortly.
mode_t mode

The permissions mode of a file being created. This is discussed later in the chapter, see Section 4.6,
"Creating Files," page 106. When opening an existing file, omit this parameter.[1]
[1] open() is one of the few variadic system calls.

The return value from open() is either the new file descriptor or -1 to indicate an error, in which case errno will
be set. For simple I/O, the flags argument should be one of the values in Table 4.3.

Table 4.3. Flag values for open()

Symbolic constant Value Meaning


O_RDONLY 0 Open file only for reading; writes
will fail.
O_WRONLY 1 Open file only for writing; reads
will fail.
O_RDWR 2 Open file for reading and writing.

We will see example code shortly. Additional values for flags are described in Section 4.6, "Creating Files,"
page 106. Much early Unix code didn't use the symbolic values. Instead, the numeric value was used. Today this
is considered bad practice, but we present the values so that you'll recognize their meanings if you see them.

The close() system call closes a file: The entry for it in the system's file descriptor table is marked as unused,
and no further operations may be done with that file descriptor. The declaration is

#include <unistd.h> POSIX

int close(int fd);

The return value is 0 on success, -1 on error. There isn't much you can do if an error does occur, other than
report it. Errors closing files are unusual, but not unheard of, particularly for files being accessed over a network.
Thus, it's good practice to check the return value, particularly for files opened for writing.
If you choose to ignore the return value, specifically cast it to void, to signify that you don't care about the result:

(void) close(fd); /* throw away return value */

The flip side of this advice is that too many casts to void tend to the clutter the code. For example, despite the
"always check the return value" principle, it's exceedingly rare to see code that checks the return value of
printf() or bothers to cast it to void. As with many aspects of C programming, experience and judgment
should be applied here too.

As mentioned, the number of open files, while large, is limited, and you should always close files when you're
done with them. If you don't, you will eventually run out of file descriptors, a situation that leads to a lack of
robustness on the part of your program.

The system closes all open files when a process exits, but—except for 0, 1, and 2—it's bad form to rely on this.

When open() returns a new file descriptor, it always returns the lowest unused integer value. Always. Thus, if file
descriptors 0–6 are open and the program closes file descriptor 5, then the next call to open() returns 5, not 7.
This behavior is important; we see later in the book how it's used to cleanly implement many important Unix
features, such as I/O redirection and piping.

4.4.2.1. Mapping FILE * Variables to File Descriptors

The Standard I/O library functions and FILE * variables from <stdio.h>, such as stdin, stdout, and
stderr, are built on top of the file-descriptor-based system calls.

Occasionally, it's useful to directly access the file descriptor associated with a <stdio.h> file pointer if you need
to do something not defined by the ISO C standard. The fileno() function returns the underlying file descriptor:

#include <stdio.h> POSIX

int fileno(FILE *stream);

We will see an example later, in Section 4.4.4, "Example: Unix cat," page 99.

4.4.2.2. Closing All Open Files

Open files are inherited by child processes from their parent processes. They are, in effect, shared. In particular,
the position in the file is shared. We leave the details for discussion later, in Section 9.1.1.2, "File Descriptor
Sharing," page 286.

Since programs can inherit open files, you may occasionally see programs that close all their files in order to start
out with a "clean slate." In particular, code like this is typical:

int i;

/* leave 0, 1, and 2 alone */


for (i = 3; i < getdtablesize(); i++)
(void) close(i);
Assume that the result of geTDtablesize() is 1024. This code works, but it makes (1024 – 3) * 2 = 2042
system calls. 1020 of them are needless, since the return value from getdtablesize() doesn't change. Here is
a better way to write this code:

int i, fds;

for (i = 3, fds = getdtablesize(); i < fds; i++)


(void) close(i);

Such an optimization does not affect the readability of the code, and it can make a difference, particularly on slow
systems. In general, it's worth looking for cases in which loops compute the same result repeatedly, to see if such
a computation can't be pulled out of the loop. In all such cases, though, be sure that you (a) preserve the code's
correctness and (b) preserve its readability!

4.4.3. Reading and Writing

I/O is accomplished with the read() and write() system calls, respectively:

#include <sys/types.h> POSIX


#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

ssize_t read(int fd, void *buf, size_t count);


ssize_t write(int fd, const void *buf, size_t count);

Each function is about as simple as can be. The arguments are the file descriptor for the open file, a pointer to a
buffer to read data into or to write data from, and the number of bytes to read or write.

The return value is the number of bytes actually read or written. (This number can be smaller than the requested
amount: For a read operation this happens when fewer than count bytes are left in the file, and for a write
operation it happens if a disk fills up or some other error occurs.) The return value is -1 if an error occurred, in
which case errno indicates the error. When read() returns 0, it means that end-of-file has been reached.

We can now show the rest of the code for ch04-cat. The process() routine uses 0 if the input filename is "-",
for standard input (lines 50 and 51). Otherwise, it opens the given file:

36 /*
37 * process --- do something with the file, in this case,
38 * send it to stdout (fd 1).
39 * Returns 0 if all OK, 1 otherwise.
40 */
41
42 int
43 process(char *file)
44 {
45 int fd;
46 ssize_t rcount, wcount;
47 char buffer[BUFSIZ];
48 int errors = 0;
49
50 if (strcmp(file, "-") == 0)
51 fd = 0;
52 else if ((fd = open(file, O_RDONLY)) < 0) {
53 fprintf(stderr, "%s: %s: cannot open for reading: %s\n",
54 myname, file, strerror(errno));
55 return 1;
56 }

The buffer buffer (line 47) is of size BUFSIZ; this constant is defined by <stdio.h> to be the "optimal" block
size for I/O. Although the value for BUFSIZ varies across systems, code that uses this constant is clean and
portable.

The core of the routine is the following loop, which repeatedly reads data until either end-of-file or an error is
encountered:

58 while ((rcount = read(fd, buffer, sizeof buffer)) > 0) {


59 wcount = write(1, buffer, rcount);
60 if (wcount != rcount) {
61 fprintf(stderr, "%s: %s: write error: %s\n",
62 myname, file, strerror(errno));
63 errors++;
64 break;
65 }
66 }

The rcount and wcount variables (line 45) are of type ssize_t, "signed size_t," which allows them to hold
negative values. Note that the count value passed to write() is the return value from read() (line 59). While
we want to read fixed-size BUFSIZ chunks, it is unlikely that the file itself is a multiple of BUFSIZ bytes big. When
the final, smaller, chunk of bytes is read from the file, the return value indicates how many bytes of buffer
received new data. Only those bytes should be copied to standard output, not the entire buffer.

The test 'wcount != rcount' on line 60 is the correct way to check for write errors; if some, but not all, of the
data were written, then wcount will be positive but smaller than rcount.

Finally, process() checks for read errors (lines 68–72) and then attempts to close the file. In the (unlikely) event
that close() fails (line 75), it prints an error message. Avoiding the close of standard input isn't strictly necessary
in this program, but it's a good habit to develop for writing larger programs, in case other code elsewhere wants
to do something with it or if a child program will inherit it. The last statement (line 82) returns 1 if there were
errors, 0 otherwise.

68 if (rcount < 0) {
69 fprintf(stderr, "%s: %s: read error: %s\n",
70 myname, file, strerror(errno));
71 errors++;
72 }
73
74 if (fd != 0) {
75 if (close(fd) < 0) {
76 fprintf(stderr, "%s: %s: close error: %s\n",
77 myname, file, strerror(errno));
78 errors++;
79 }
80 }
81
82 return (errors != 0);
83 }

ch04-cat checks every system call for errors. While this is tedious, it provides robustness (or at least clarity):
When something goes wrong, ch04-cat prints an error message that is as specific as possible. The combination
of errno and strerror() makes this easy to do. That's it for ch04-cat, only 88 lines of code!

To sum up, there are several points to understand about Unix I/O:

I/O is uninterpreted.

The I/O system calls merely move bytes around. They do no interpretation of the data; all interpretation is
up to the user-level program. This makes reading and writing binary structures just as easy as reading and
writing lines of text (easier, really, although using binary data introduces portability problems).

I/O is flexible.

You can read or write as many bytes at a time as you like. You can even read and write data one byte at a
time, although doing so for large amounts of data is more expensive that doing so in large chunks.

I/O is simple.

The three-valued return (negative for error, zero for end-of-file, positive for a count) makes programming
straightforward and obvious.

I/O can be partial.

Both read() and write() can transfer fewer bytes than requested. Application code (that is, your code)
must always be aware of this.

4.4.4. Example: Unix cat

As promised, here is the V7 version of cat.[2] It begins by checking for options. The V7 cat accepts a single
option, -u, for doing unbuffered output.
[2] See /usr/src/cmd/cat.c in the V7 distribution. The program compiles without change under GNU/Linux.

The basic design is similar to the one shown above; it loops over the files named by the command-line arguments
and reads each file, one character at a time, sending the characters to standard output. Unlike our version, it uses
the <stdio.h> facilities. In many ways code using the Standard I/O library is easier to read and write, since all
buffering issues are hidden by the library.

1 /*
2 * Concatenate files.
3 */
4
5 #include <stdio.h>
6 #include <sys/types.h>
7 #include <sys/stat.h>
8
9 char stdbuf[BUFSIZ];
10
11 main(argc, argv) int main(int argc, char **argv)
12 char **argv;
13 {
14 int fflg = 0;
15 register FILE *fi;
16 register c;
17 int dev, ino = -1;
18 struct stat statb;
19
20 setbuf(stdout, stdbuf);
21 for( ; argc>1 && argv[1][0]=='-'; argc--,argv++) {
22 switch(argv[1] [1]) { Process options
23 case 0:
24 break;
25 case 'u':
26 setbuf(stdout, (char *)NULL);
27 continue;
28 }
29 break;
30 }
31 fstat(fileno(stdout), &statb); Lines 31–36 explained in Chapter 5
32 statb.st_mode &= S_IFMT;
33 if (statb.st_mode!=S_IFCHR && statb.st_mode!=S_IFBLK) {
34 dev = statb.st_dev;
35 ino = statb.st_ino;
36 }
37 if (argc < 2) {
38 argc = 2;
39 fflg++;
40 }
41 while (--argc > 0) { Loop over files
42 if (fflg || (*++argv) [0]=='-' && (*argv) [1]=='\0')
43 fi = stdin;
44 else {
45 if ((fi = fopen(*argv, "r")) == NULL) {
46 fprintf(stderr, "cat: can't open %s\n", *argv);
47 continue;
48 }
49 }
50 fstat(fileno(fi), &statb); Lines 50–56 explained in Chapter 5
51 if (statb.st_dev==dev && statb.st_ino==ino) {
52 fprintf(stderr, "cat: input %s is output\n",
53 fflg?"-": *argv);
54 fclose(fi);
55 continue;
56 }
57 while ((c = getc(fi)) != EOF) Copy file contents to stdout
58 putchar(c);
59 if (fi!=stdin)
60 fclose(fi);
61 }
62 return(0);
63 }

Of note is that the program always exits successfully (line 62); it could have been written to note errors and
indicate them in main()'s return value. (The mechanics of process exiting and the meaning of different exit status
values are discussed in Section 9.1.5.1, "Defining Process Exit Status," page 300.)

The code dealing with the struct stat and the fstat() function (lines 31–36 and 50–56) is undoubtedly
opaque, since we haven't yet covered these functions, and won't until the next chapter. (But do note the use of
fileno() on line 50 to get at the underlying file descriptor associated with the FILE * variables.) The idea
behind the code is to make sure that no input file is the same as the output file. This is intended to prevent infinite
file growth, in case of a command like this:

$ cat myfile >> myfile Append one copy of myfile onto itself?

And indeed, the check works:

$ echo hi > myfile Create a file


$ v7cat myfile >> myfile Attempt to append it onto itself
cat: input myfile is output

If you try this with ch04-cat, it will keep running, and myfile will keep growing until you interrupt it. The GNU
version of cat does perform the check. Note that something like the following is beyond cat's control:

$ v7cat < myfile > myfile


cat: input - is output
$ ls -l myfile
-rw-r--r-- 1 arnold devel 0 Mar 24 14:17 myfile

In this case, it's too late because the shell truncated myfile (with the > operator) before cat ever gets a chance
to examine the file!

In Section 5.4.4.2, "The V7 cat Revisited," page 150, we explain the struct stat code.
4.5. Random Access: Moving Around within a File
So far, we have discussed sequential I/O, whereby data are read or written beginning at the front of the file and
continuing until the end. Often, this is all a program needs to do. However, it is possible to do random access
I/O; that is, read data from an arbitrary position in the file, without having to read everything before that position
first.

The offset of a file descriptor is the position within an open file at which the next read or write will occur. A
program sets the offset with the lseek() system call:

#include <sys/types.h> /* for off_t */ POSIX


#include <unistd.h> /* declares lseek() and whence values */

off_t lseek(int fd, off_t offset, int whence);

The type off_t (offset type) is a signed integer type representing byte positions (offsets from the beginning)
within a file. On 32-bit systems, the type is usually a long. However, many modern systems allow very large files,
in which case off_t may be a more unusual type, such as a C99 int64_t or some other extended type.
lseek() takes three arguments, as follows:

int fd

The file descriptor for the open file.


off_t offset

A position to which to move. The interpretation of this value depends on the whence parameter. offset
can be positive or negative: Negative values move toward the front of the file; positive values move toward
the end of the file.
int whence

Describes the location in the file to which offset is relative. See Table 4.4.

Table 4.4. whence values for lseek()

Symbolic constant Value Meaning


SEEK_SET 0 offset is absolute, that is, relative to the
beginning of the file.
SEEK_CUR 1 offset is relative to the current position
in the file.
SEEK_END 2 offset is relative to the end of the file.

Much old code uses the numeric values shown in Table 4.4. However, any new code you write should use the
symbolic values, whose meanings are clearer.

The meaning of the values and their effects upon file position are shown in Figure 4.1. Assuming that the file has
3000 bytes and that the current offset is 2000 before each call to lseek(), the new position after each call is as
shown:

Figure 4.1. Offsets for lseek()

Negative offsets relative to the beginning of the file are meaningless; they fail with an "invalid argument" error.

The return value is the new position in the file. Thus, to find out where in the file you are, use

off_t curpos;
...
curpos = lseek(fd, (off_t) 0, SEEK_CUR);

The l in lseek() stands for long. lseek() was introduced in V7 Unix when file sizes were extended; V6 had
a simple seek() system call. As a result, much old documentation (and code) treats the offset parameter as if it
had type long, and instead of a cast to off_t, it's not unusual to see an L suffix on constant offset values:

curpos = lseek(fd, 0L, SEEK_CUR);

On systems with a Standard C compiler, where lseek() is declared with a prototype, such old code continues
to work since the compiler automatically promotes the 0L from long to off_t if they are different types.

One interesting and important aspect of lseek() is that it is possible to seek beyond the end of a