Apache
Apache
Server Administration
Charles Aulds
SYBEX®
This page intentionally left blank
This page intentionally left blank
Linux Apache Web Server
Administration
Charles Aulds
Copyright © 2001 SYBEX Inc., 1151 Marina Village Parkway, Alameda, CA 94501. World rights reserved.
No part of this publication may be stored in a retrieval system, transmitted, or reproduced in any way, includ-
ing but not limited to photocopy, photograph, magnetic, or other record, without the prior agreement and
written permission of the publisher.
ISBN: 0-7821-2734-7
SYBEX and the SYBEX logo are either registered trademarks or trademarks of SYBEX Inc. in the United
States and/or other countries.
Screen reproductions produced with FullShot 99. FullShot 99 © 1991-1999 Inbit Incorporated. All rights
reserved.
FullShot is a trademark of Inbit Incorporated.
Netscape Communications, the Netscape Communications logo, Netscape, and Netscape Navigator are
trademarks of Netscape Communications Corporation.
Netscape Communications Corporation has not authorized, sponsored, endorsed, or approved this publica-
tion and is not responsible for its content. Netscape and the Netscape Communications Corporate Logos are
trademarks and trade names of Netscape Communications Corporation. All other product names and/or
logos are trademarks of their respective owners.
TRADEMARKS: SYBEX has attempted throughout this book to distinguish proprietary trademarks from
descriptive terms by following the capitalization style used by the manufacturer.
The author and publisher have made their best efforts to prepare this book, and the content is based upon
final release software whenever possible. Portions of the manuscript may be based upon pre-release versions
supplied by software manufacturer(s). The author and the publisher make no representation or warranties of
any kind with regard to the completeness or accuracy of the contents herein and accept no liability of any kind
including but not limited to performance, merchantability, fitness for any particular purpose, or any losses or
damages of any kind caused or alleged to be caused directly or indirectly from this book.
10 9 8 7 6 5 4 3 2 1
Foreword
Linux and open-source software are synonymous in the minds of most people. Many cor-
porations fear Linux and reject it for mission-critical applications because it is open
source. They mistakenly believe that it will be less secure or less reliable because the code
is openly available and the system has been developed by a diverse collection of groups
and individuals from around the world. Yet those same organizations depend on open-
source systems every day, often without being aware of it.
The Internet is a system built on open-source software. From the very beginning, when
the U.S. government placed the source code of the Internet Protocol in the public domain,
open-source software has led the way in the development of the Internet. To this day, the
Internet and the applications that run on it depend on open-source software.
One of the greatest success stories of the Internet is the World Wide Web—the Internet’s
killer application. The leading Web server software is Apache, an open source product.
No library of Linux system administration books could be complete without a book on
Apache configuration and administration.
Linux and Apache are a natural combination—two reliable, powerful, open source prod-
ucts that combine to create a great Web server!
Craig Hunt
September 2000
Acknowledgments
If I ever believed that a technical book was the work of a single author, I no longer hold
that belief. In this short section, I would like to personally acknowledge a few of the many
people who participated in writing this book. A lot of credit goes to the Sybex production
and editing team, most of whom I didn’t work with directly and will never know.
Craig Hunt, editor of this series, read all of the material and helped organize the book,
giving it a continuity and structure that brings together all of the many pieces of the
Apache puzzle. Before I met Craig, however, I knew Maureen Adams, the acquisition
editor who recommended me for this book. Her confidence in my ability to accomplish
this gave me the resolve to go further than simply saying, “I believe that some day I might
write a book.” Associate Publisher Neil Edde’s can-do attitude and problem-solving skills
also helped the project over a few bumps in the road.
Also part of the Sybex team, production editor Dennis Fitzgerald kept the project on
schedule. Many times, prioritizing a long list of things that needed to be done is the first
step toward their accomplishment. Jim Compton, editor, provided invaluable editing
assistance, and often surprised me with his keen grasp of the technical material, many
times suggesting changes that went far beyond the merely syntactic or grammatical. Will
Deutsch was the technical editor for this book, and his research background and experi-
ence filled in more than a few gaps in my own store of knowledge.
Electronic publishing specialist Franz Baumhackl handled the typesetting and layout
promptly and skillfully, as usual.
I must thank my employer, Epic Data — Connectware Products Group, for allowing me
the freedom to work on this book. Particular thanks go to Linda Matthews, who was my
supervisor during most of the project.
I also appreciate the time my keen engineering friend, Carl Sewell, spent reviewing all of
the material I’d written, and I thank my Epic colleague Robert Schaap, whose knowledge
of Apache and comments on the use of the mod_rewrite module proved quite valuable.
Last, but certainly most of all, I want to thank my dear wife, Andrea, for her unwavering
support during what turned out to be a much harder endeavor than I anticipated. Finding
time to devote to this project was the biggest challenge I had to overcome, and she found
ways to give me that, taking on many of the household and outdoor chores that had been
my responsibility.
Contents at a Glance
Introduction . . . . . . . . . . . . . . . . . . . . . . . xviii
Appendices 533
Index . . . . . . . . . . . . . . . . . . . . . . . . . . 590
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . xviii
Appendices 533
Appendix A Apache Directives . . . . . . . . . . . . . . . . 535
Appendix B Online References . . . . . . . . . . . . . . . . 549
WWW and HTTP Resources . . . . . . . . . . . . . 550
General Apache Resources . . . . . . . . . . . . . . 551
Resources for Apache Modules . . . . . . . . . . . . 555
Contents xvii
Index . . . . . . . . . . . . . . . . . . . . . . . . . . 590
Introduction
The first Internet Web servers were experimental implementations of the concepts, proto-
cols, and standards that underlie the World Wide Web. Originally a performance-oriented
alternative to these early Web servers, Apache has been under active development by a
large cadre of programmers around the world. Apache is the most widely used Web server
for commercial Web sites and it is considered by many Webmasters to be superior to com-
mercial Web server software.
Like Linux, Apache owes much of its incredible success to the fact that it has always been
distributed as open-source software. Apache is freely available under a nonrestrictive
license (which I’ll discuss in Chapter 1) and distributed in the form of source code, which
can be examined or modified. There’s nothing up the developers’ sleeves. While the sharing
of intellectual property has always appealed to those who program computers primarily for
the sheer love of it, until quite recently the motivations of the open-source community were
lost on the business world, which understood only the bottom line on the balance sheet.
Today, however, the situation is much different from what it was when Apache and Linux
were first introduced, and many companies now see open-source in terms of cost savings
or as a way of leveraging technology without having to develop it from scratch. The open
source software model seems to be with us to stay, and many companies have been struc-
tured to profit from it, by offering solutions and services based on computer programs they
didn’t create. While recent security and performance enhancements to its commercial rivals
have left Apache’s technical superiority in question, there is no doubt that Apache is a
robust product in the same class as commercial Web engines costing far more. Apache is
free, which enables anyone willing to make a moderate investment in inexpensive com-
puting equipment to host Web services with all the features of a world-class site.
Linux is an excellent platform upon which to run a Web server. A review of Web server
engines by Network Computing magazine made the point that, while some commercial
applications now (surprisingly) exhibit performance superior to that of Apache, the
underlying operating system plays a critical role in determining the overall reliability,
security, and availability of a Web server. This is particularly true in e-commerce appli-
cations. Apache was given high marks when coupled with the robustness provided by the
Linux operating system. While Apache is now available for non-Unix/Linux platforms,
the real value of Apache is realized on Unix-like operating systems. To an increasing
number of businesses today that means using Linux, with its unparalleled ability to com-
pete on a price/performance basis.
Whenever Linux is used to provide commercial-quality Web services, Apache is the first and
best choice of web server software. The intended reader of this book is someone who is
using both Apache and Linux for the same reasons: quality, reliability, features, and price.
Appendices
Four appendices present essential reference information about various aspects of Apache
administration.
Conventions
This book uses the following typographical conventions:
Program Font is used to identify the Linux and Apache commands and direc-
tives, file and path names, and URLs that occur within the body of the text and
in listings and examples.
Bold is used to indicate something that must be typed in as shown, such as com-
mand-line input in listings.
Italic is used in directive or command syntax to indicate a variable for which
you must provide a value. For example,
UserDir enabled usernames
means that in entering the UserDir directive with the enabled option, you would
need to supply real user names.
[ ] in a directive’s syntax enclose an item that is optional.
| is a vertical bar that means you should choose one keyword or another in a
directive’s syntax.
Linux Library
Part 1 How Things Work
How Things Work
Featuring:
■ A brief history of the World Wide Web and Apache
■ How the HyperText Transfer Protocol (HTTP) works
■ HTTP/1.0 response codes and other headers
■ Apache’s importance in the marketplace
■ Other Web servers: free and commercial alternatives to Apache
■ Major features of Apache
■ Features planned for Apache version 2.0
This page intentionally left blank
An Overview of the
1
World Wide Web
N o book written about Apache, the most widely used Web server software on the
Internet today, would be complete without a discussion of the World Wide Web (WWW)
itself—how it came into existence, and how it works. Understanding the underlying tech-
nology is a key part of mastering any technical topic, and the technology that underlies
Apache is the World Wide Web. This chapter is an introductory overview of a vast subject.
The chapter begins with a history of the World Wide Web, introducing the Apache Web
server, and then moves through an explanation of how the Web works, with a short intro-
ductory tour to the inner workings of the HyperText Markup Language (HTML) and the
HyperText Transfer Protocol (HTTP). We’ll look at new features of the HTTP/1.1 version
of the protocol and use three common tools to observe the protocols in action.
made up the first Apache server. In the true spirit and style of what is best about the
March 1989
Tim Berners-Lee proposes
a hypertext-based computer system.
October 1990
Work begins on the project that Berners-Lee
March 1991 names World Wide Web.
The National Science Foundation lifts restrictions
on the commercial use of the Internet.
January 1992
First version of line-mode browser released
to public.
February 1993
NCSA releases first version of Mosaic
for X Window System. October 1993
Number of known HTTP servers exceeds 200.
March 1994
Marc Andreessen, et al. form Mosaic
Communications (later renamed Netscape). February 1995
Apache Group (later renamed the Apache
Software Foundation) founded.
April 1995
First version of Apache released. August 1995
January 1996 Netcraft survey shows 18,957 HTTP servers.
Netcraft survey shows 74,709 HTTP servers. April 1996
Apache is the most widely used Web server
on the Internet.
April 1997
The number of HTTP servers exceeds 1 million.
Web page: “Cool!” No competing scheme for exchanging information that ignored the
“cool” factor stood a chance against the Web.
At the heart of the design of the Web is the concept of the hyperlink. The clickable links
on a Web page can point to resources located anywhere in the world. The designers of the
first hypertext information system started with this concept. For this concept to work on
a major scale, three pieces of the Web had to be invented. First, there had to be a univer-
sally accepted method of uniquely defining each Web resource. This naming scheme is the
Uniform Resource Locator (URL), described in the accompanying sidebar. The second
piece was a scheme for formatting Web-delivered documents so that a named resource
could become a clickable link in another document. This formatting scheme is the Hyper-
Text Markup Language (HTML). The third piece of the Web is some means to bring
everything together into one huge information system. That piece of the puzzle is the net-
work communication protocol that links any client workstation to any of millions of web
servers: the HyperText Transfer Protocol (HTTP)
A hyperlink embedded in HTML-formatted page is only one way to use a URL, but it is
the hyperlink that gave rise to the Web. If we had to resort to exchanging URLs by hand-
writing them on napkins, there would be no Web. Most of us think of the Web in terms
of visiting Web sites, but the mechanism is not one of going somewhere, it is one of
retrieving a resource (usually a Web page) across a network using the unique identifier for
the resource: its URL.
URLs can also be manually entered into a text box provided for that purpose in a Web
browser, or saved as a bookmark for later point-and-click retrieval. Most e-mail pro-
grams today allow URLs to be included in the message body so that the recipient can
simply click on them to retrieve the named resource. Some e-mail packages allow you to
embed images in the message body using URLs. When the message is read, the image is
retrieved separately; it could reside on any Internet server, not necessarily the sender’s
machine.
What is a URL?
Each URL is composed of three parts, a mechanism (or protocol) for retrieving the
resource, the hostname of a server that can provide the resource, and a name for the
resource. The resource name is usually a filename preceded by a partial path, which in
Apache is relative to the path defined as the DocumentRoot. Here’s an example of a URL: PART 1
http://www.apache.org/docs/misc/FAQ.html
This URL identifies a resource on a server whose Internet name is www.apache.org.
The resource has the filename FAQ.html and probably resides in a directory named
misc, which is a subdirectory of docs, a subdirectory of the directory the server knows
as DocumentRoot, although as we’ll see later, there are ways to redirect requests to
other parts of the file system. The URL also identifies the Hypertext Transfer Protocol
(HTTP) as the protocol to be used to retrieve the files. The http:// protocol is so widely
used that it is the default if nothing is entered for the protocol. The only other common
retrieval method you’re likely to see in a URL is ftp://, although your particular
browser probably supports a few others, including news:// and gopher://.
A URL can also invoke a program such as a CGI script written in Perl, which might
look like this:
http://jackal.hiwaay.net/cgi-bin/comments.cgi
It was the Web browser, with its ability to render attractive screens from HTML-formatted
documents, that initially caught the eye of the public. Beneath the pretty graphical inter-
face of the browser, however, the Web is an information-delivery system consisting of
client and server software components that communicate over a network. These compo-
nents communicate using the HyperText Transfer Protocol (HTTP). The following sec-
tions describe this client/server relationship and the HTTP protocol used to move Web
data around the world. This provides an introduction to the subject of the book, Apache,
which is the foremost implementation of the HTTP server component.
Although the Web server’s primary purpose is to distribute information from a central
computer, modern Web servers perform other tasks as well. Before the file transfer, most
modern Web servers send descriptive information about the requested resource, instructing
the client how to interpret or format the resource. Many Web servers perform user authen-
tication and data encryption to permit applications like online credit card purchasing.
Another common feature of Web servers is that they provide database access on behalf of
the client, eliminating the need for the client to use a full-featured database client appli-
cation. Apache provides all of these features.
The one protocol that all Web servers and browsers must support is the Hypertext
Upon receiving this request, a server responds by sending a document stored in the file
welcome.html, if it exists in the server’s defined DocumentRoot directory, or an error
response if it does not. Today’s Web servers still respond to HTTP/0.9 requests, but only the
very oldest browsers in existence still form their requests in that manner. HTTP/0.9 was
officially laid to rest in May 1996 with the release of Request for Comments (RFC) 1945
(“Hypertext Transfer Protocol—HTTP/1.0”), which formally defined HTTP version 1.0.
The most important addition to the HTTP protocol in version 1.0 was the use of headers
that describe the data being transferred. It is these headers that instruct the browser how to
treat the data. The most common header used on the Web is certainly this one:
Content-Type: text/html
This header instructs the browser to treat the data that follows it as text formatted using
the HyperText Markup Language (HTML). HTML formatting codes embedded in the
text describe how the browser will render the page. Most people think of HTML when
they think of the Web. We’re all familiar with how an HTML document appears in a
browser, with its tables, images, clickable buttons and, most importantly, clickable links
to other locations. The use of HTML is not limited to applications written for the Web.
The most popular electronic mail clients in use today all support the formatting of mes-
sage bodies in HTML.
The important thing to remember is that the Web’s most commonly used formatting spec-
ification (HTML) and the network transfer protocol used by all Web servers and
browsers (HTTP) are independent. Neither relies exclusively on the other or insists on its
use. Of the two, HTTP is the specification most tightly associated with the Web and needs
to be part of all World Wide Web server and browser software.
The shakeout in the Web browser market, reducing the field of major competitors to just
Method Purpose
HEAD Identical to GET except that the server does not return a message body
to the client. Essentially, this returns only the HTTP header information.
POST Instructs the server to receive information from the client; used most
often to receive information entered into Web forms.
PUT Allows the client to send the resource identified in the request URL to the
server. The server, if it will accept the PUT, opens a file into which it
saves the information it receives from the client.
Method Purpose
TRACE Initiates a loopback of the request message for testing purposes, allow-
ing the client to see exactly what is being seen by the server.
DELETE Requests that the server delete the resource identified in the request URL.
CONNECT Instructs a Web proxy to tunnel a connection from the client to the
server, rather than proxying the request.
Using lwp-request
If you’ve installed the collection of Perl modules and utility scripts collectively known as
libwww-perl, you can use the lwp-request script that comes with that package to test HTTP
connections. With this script, you can specify different request methods and display options.
The following example illustrates the use of the -e argument to display response headers
(more on headers shortly) with the -d argument to suppress the content in the response:
# lwp-request -e -d http://jackal.hiwaay.net/
Cache-Control: max-age=604800
Connection: close
Date: Wed, 21 Jun 2000 14:17:36 GMT
Accept-Ranges: bytes
Server: Apache/1.3.12 (Unix) mod_perl/1.24
Content-Length: 3942
Content-Type: text/html
The HTTP Protocol 15
ETag: "34062-f66-392bdcf1"
libwww-perl consists of several scripts, supported by the following standard Perl modules
(available separately, although most easily installed as part of the libwww-perl bundle):
URI Support for Uniform Resource Identifiers
Net::FTP Support for the FTP protocol
MIME::Base64 Required for authentication headers
Digest::MD5 Required for Digest authentication
HTML::HeadParser Support for HTML headers
Even though you may not actually use the functionality of one of these modules, they
must be properly installed on your machine to use the utility scripts provided with
libwww-perl. Use the following commands to install all things at once, on a Linux system
on which you have the CPAN.pm module:
# cpan
cpan> install Bundle::LWP
Among the utilities provided with libwww-perl, the most important (and the one most
useful for examining the exchange of headers in an HTTP transaction) is lwp-request.
Another that I find very useful, however, is lwp-download, which can be used to retrieve
a resource from a remote server. Note that besides the HTTP shown in this example, you
can use FTP:
# lwp-download http://jackal.hiwaay.net
Saving to 'index.html'...
3.85 KB received
16 Chapter 1 An Overview of the World Wide Web
CPAN
The best way to maintain the latest versions of all Perl modules is to use the CPAN.pm
module. This powerful module is designed to ensure that you have the latest avail-
able versions of Perl modules registered with the Comprehensive Perl Archive
Network or CPAN (http://cpan.org). CPAN archives virtually everything that has to
do with Perl, including software as source code and binary ports, along with docu-
mentation, code samples, and newsgroup postings. The CPAN site is mirrored at
over 100 sites around the world, for speed and reliability. You generally choose one
nearest you geographically.
The CPAN.pm Perl module completely automates the processes of comparing your
installed modules against the latest available in the CPAN archives, downloading
modules, building modules (using the enclosed makefiles) and installing them. The
module is intelligent enough to connect to any one of the CPAN mirror sites and
(using FTP) can download lists of the latest modules for comparison against your
local system to see whether you have modules installed that need upgrading. Once
you install it, CPAN.pm even updates itself! Not only does the module automate the
process of updating and installing modules, it makes the process almost bulletproof.
I have never experienced problems with the module.
Another powerful Perl tool for observing the HTTP protocol is HttpSniffer.pl.
Although not as convenient as lwp-request, because it does require setup and a separate
client component (usually a Web browser), HttpSniffer.pl allows you to “snoop” on
a real-world HTTP exchange, and it is more useful when you need to examine header
exchanges with a browser (during content negotiation, for example).
Using HttpSniffer.pl
If you are using a fairly up-to-date version of Perl (at least version 5.004), you should con-
sider a utility called HttpSniffer.pl to monitor the headers that are exchanged between
a client browser and a Web server. HttpSniffer.pl acts as an HTTP tunnel, connecting
directly to a remote server, and forwarding connections from client browsers, displaying
the headers (or writing them to a log file) exchanged between the client and server.
Download HttpSniffer.pl directly from its author’s Web site at www.schmerg.com.
You can run the program on any platform running Perl 5.004 (or later). Figure 1.2
shows a typical session. The command window in the foreground shows how I invoked
HttpSniffer.pl, pointing it at my Web server, jackal.hiwaay.net, with the -r argu-
ment. HttpSniffer.pl, by default, receives connections on TCP port 8080, and forwards
The HTTP Protocol 17
them to the specified remote host. The browser in the background (running on the same
HttpSniffer.pl is not only an invaluable debugging tool, it is also the best way to learn
the purpose of HTTP headers, by watching the actual headers that are part of an HTTP
exchange. If you have access to a proxy server, on a remote server, or through Apache’s
mod_proxy (discussed in Chapter 13), you can point HttpSniffer.pl at the proxy, and
then configure your client browser to connect to HttpSniffer.pl as an HTTP proxy
server. That way, you can use your browser to connect to any remote host, as you nor-
mally would, and all requests will be redirected (or proxied) by HttpSniffer.pl. Be pre-
pared for lots of output, though. Generally, you should invoke HttpSniffer.pl with a
line like the following (the -l argument causes all of the output from the command to be
written into the text file specified):
# HttpSniffer.pl -r jackal.hiwaay.net -l /tmp/httpheaders.txt
18 Chapter 1 An Overview of the World Wide Web
The only problem with HttpSniffer.pl and lwp-request is that they are not available
on every Linux system. But telnet is. I use telnet in all of the following examples
because every Linux administrator has access to it and can duplicate these examples.
However, if you have HttpSniffer.pl or lwp-request, I encourage you to use them for
testing.
Using telnet
You can connect directly to a Web server and enter the HTTP request manually with the
Linux telnet command, which allows you to connect to a specific Transmission Control
Protocol (TCP) port on the remote system. Not only will this allow you to see the com-
plete exchange of messages between the client and server, it also gives you complete con-
trol of the session and provides a valuable tool for troubleshooting your Web server.
Enter the following telnet command at the shell prompt, replacing somehost.com with
the name of any server accessible from your workstation and known to be running a Web
server:
telnet somehost.com 80
This command instructs telnet to connect to TCP port 80, which is the well-known port
reserved for HTTP connections. You should receive some confirmation of a successful
connection, but you will not receive data immediately from the remote server. If the pro-
cess listening on Port 80 of the remote system is an HTTP server (as it should be), it sends
nothing upon receiving a connection, because it is waiting for data from the client. This
behavior is defined by the HTTP specification.
The examples that follow are actual traces from my Linux server, which hosts a fully
operational Apache server. I telnet to localhost, which is a special reserved hostname for
the local system. You can do the same, if the system on which you are executing telnet
also hosts an HTTP server. (If you stay with me through Chapter 5, you’ll have a working
system on which to test these commands.) Until then, you can connect to any Web server
on the Internet to perform these tests.
$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
At this point, telnet has an open connection to the remote HTTP server, which is waiting
for a valid HTTP request. The simplest request you can enter is
GET /
This requests the default Web page for the directory defined as the server root. A properly
configured HTTP server should respond with a valid page. Our request, which makes no
The HTTP Protocol 19
mention of the HTTP version we wish to use, will cause the server to assume we are using
</BODY>
</HTML>
The server, which assumes you are a client that understands only HTTP/0.9, simply
sends the requested resource (in this case, the default page for my Web site). In the fol-
lowing example, I’ve issued the same request, but this time my GET line specifies
HTTP/1.0 as the version of HTTP I’m using. Notice this time that the server will not
respond as soon you type the request and press Enter. It waits for additional informa-
tion (this is normal HTTP/1.0 behavior). Two carriage-return/line-feed character pairs
are required to indicate the end of an HTTP/1.0 request.
$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.0
HTTP/1.1 200 OK
Date: Thu, 16 Dec 1999 08:56:36 GMT
Server: Apache/1.3.9 (Unix) mod_perl/1.19
2734ch01.fm Page 20 Wednesday, August 29, 2001 7:14 AM
<HTML>
<HEAD>
<TITLE>Charles Aulds's Home Page</TITLE>
Deleted Lines
</HTML>
The response categories contain more than 40 individual response codes. Each is accom-
panied by a short comment that is intended to make the code understandable to the user.
To see a full list of these codes, go to the HTML Writers Guild at www.hwg.org/lists/
hwg-servers/response_codes.html.
The HTTP Protocol 21
When using telnet to test an HTTP connection, it is best to replace the GET request
$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1
177
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>Bad Request</H1>
Your browser sent a request that this server could not understand.<P>
client sent HTTP/1.1 request without hostname (see RFC2068 section 9, and
14.23>
<HR>
<ADDRESS>Apache/1.3.9 Server at Jackal.hiwaay.net Port 80</ADDRESS>
</BODY></HTML>
The response code header clearly indicates that our request failed. This is because HTTP/1.1
requires the client browser to furnish a hostname if it chooses to use HTTP/1.1. Note that the
choice of HTTP version is always the client’s. This hostname will usually be the same as the
22 Chapter 1 An Overview of the World Wide Web
hostname of the Web server. (Chapter 6 discusses virtual hosting, in which a single Web server
answers requests for multiple hostnames.)
In addition to warning the client about a failed request, the server makes note of all
request failures in its own log file. The failed request in Listing 1.2 causes the following
error to be logged by the server:
[Wed May 14 04:58:18 2000] [client 192.168.1.2] client sent HTTP/1.1 request
without hostname (see RFC2068 section 9, and 14.23): /
Request redirection is an essential technique for many Web servers, as resources are
moved or retired. (Chapter 10 shows how to use Apache’s tools for aliasing and redirec-
tion.) Listing 1.3 illustrates a redirected request.
# telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET /~caulds HTTP/1.0
<HR>
If the browser specifies HTTP/1.1 in the request line, the very next line must identify a
hostname for the request, as in Listing 1.4. PART 1
# telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.jackal.hiwaay.net
HTTP/1.1 200 OK
Date: Thu, 16 Dec 1999 11:03:20 GMT
Server: Apache/1.3.9 (Unix) mod_perl/1.19
Last-Modified: Tue, 14 Dec 1999 17:19:11 GMT
ETag: "dd857-ea1-38567c0f"
Accept-Ranges: bytes
Content-Length: 3745
Content-Type: text/html
<HTML>
<HEAD>
<TITLE>Charles Aulds's Home Page</TITLE>
Deleted Lines
If our server answers requests for several virtual hosts, the Host: header of the request
would identify the virtual host that should respond to the request. Better support for vir-
tual site hosting is one of the major enhancements to the HTTP protocol in version 1.1.
example, the test shown in Listing 1.4 produced seven additional headers after the
response code header: Date, Server, Last Modified, ETag, Accept-Ranges, Content-
Length, and Content-Type. The following sections briefly outline these and other HTTP
headers.
General Headers Headers that carry information about the messages being transmitted
between client and server are lumped into the category of general headers. These headers
do not provide information about the content of the messages being transmitted between
the client and server. Instead, they carry information that applies to the entire session and
to both client request and server response portions of the transaction.
Cache-Control Specifies directives to proxy servers (Chapter 13).
Connection Allows the sender to specify options for this network connection.
Date Standard representation of the date and time the message was sent.
Pragma Used to convey non-HTTP information to any recipient that under-
stands the contents of the header. The contents are not part of HTTP.
Trailer Indicates a set of header fields that can be found in the trailer of a
multiple-part message.
Transfer-Encoding Indicates any transformations that have been applied to
the message body in order to correctly transfer it.
Upgrade Used by the client to specify additional communication protocols it
supports and would like to use if the server permits.
Via Tacked onto the message by proxies or gateways to show that they handled
the message.
Warning Specifies additional information about the status or transformation of
a message which might not be reflected in the message itself.
Request Headers Request headers are used to pass information from HTTP client to
server; these headers always follow the one mandatory line in a request, which contains
the URI of the request itself. Request headers act as modifiers for the actual request,
allowing the client to include additional information that qualifies the request, usually
specifying what constitutes an acceptable response.
Accept Lists all MIME media types the client is capable of accepting.
Accept-Charset Lists all character set the client is capable of accepting.
Accept-Encoding Lists all encodings (particularly compression schemes) the
client is capable of accepting.
Accept-Language Lists all languages the client is willing to accept.
The HTTP Protocol 25
Response Headers The server uses response headers to pass information in addition to
the request response to the requesting client. Response headers usually provide informa-
tion about the response message itself, and not necessarily about the resource being sent
to satisfy a client request. Increasingly, response headers serve to provide information
26 Chapter 1 An Overview of the World Wide Web
used by caching gateways or proxy server engines. The response headers will be an impor-
tant part of the discussion on proxy caching (Chapter 13).
Accept-Ranges Specifies units (usually bytes) in which the server will accept
range requests.
Age The server’s estimated time (in seconds) required to fulfill this request.
Etag Contains the current value of the requested entity tag.
Location Contains a URI to which the client request should be redirected.
Proxy-Authenticate Indicates the authentication schema and parameters
applicable to the proxy for this request.
Retry-After Used by the server to indicate how long a URI is expected to be
unavailable.
Server Contains information about the software used by the origin server to
handle the request. Apache identifies itself using this header. In Listing 1.4, notice
that the server describes the version of Perl supported.
Vary Indicates that the resource has multiple sources that may vary according to
the supplied list of request headers.
WWW-Authenticate Used with a 401-Unauthorized response code to indicate that
the requested URI needs authentication and specifies the authorization scheme
required (usually a username/password pair) and the name of the authorization
realm.
Entity Headers Entity headers contain information directly related to the resource
being provided to the client in fulfillment of the request, in other words, the response mes-
sage content or body. This information is used by the client to determine how to render
the resource or which application to invoke to handle it (for example, the Adobe Acrobat
reader). Entity headers contain metainformation (or information about information), the
subject of Chapter 16.
Allow Informs the client of valid methods associated with the resource.
Content-Encoding Indicates the encoding (usually compression) scheme
applied to the contents.
Content-Language Indicates the natural language of the contents.
Content-Length Contains the size of the body of the HTTP message.
Content-Location Supplies the resource location for the resource in the message
body, usually used when the resource should be requested using another URI.
Content-MD5 Contains the MD5 digest of the message body, used to verify the
integrity of the resource.
In Sum 27
NOTE More details about these and other headers are available in the HTTP
specification RFC 2616.
In Sum
This chapter looked at the World Wide Web, its origins and history, and described briefly
how it functions. An essential part of the design of the Web is the standard set of proto-
cols that allow applications to interoperate with any of the millions of other systems that
make up the Web. The essential protocol that enables the Web to exist is the HyperText
Transfer Protocol (HTTP), which defines how data is communicated between Web clients
and servers. I demonstrated a couple of ways to view HTTP headers and listed those
headers that are defined by the HTTP specification (or RFC). The chapter concluded with
a discussion of the important enhancements to the HTTP protocol that were added in its
current version, HTTP/1.1. This information provides the foundation for understanding
what Apache does and how it does it.
Although Apache is quite well established as the leading Web server on the Internet, it is
by no means the only Web server to compete for that status. The next chapter provides
a brief look at the most important of its competitors, in order to place Apache in its
proper context as it stands out as the best of the breed, even among champion contenders.
I’ll also discuss the important changes that are being made to Apache for the upcoming
2.0 commercial release. These changes will help Apache maintain its dominance of the
Internet Web server market.
This page intentionally left blank
Apache and Other
2
Servers
C hapter 1 presented a brief historical overview of the World Wide Web and the
technologies that make it possible. Fast-forward to the present, and there are a number of
good servers for the Web. This chapter provides a very brief description of the best of those
and compares their architectures to the architecture used by Apache.
I generally don’t like one-size-fits-all systems, and I try to avoid products that are marketed
as the best solution for everyone’s needs. Apache is an exception to this rule, largely
because it is easily customized by design. While Apache runs well on commercial Unix plat-
forms and Microsoft Windows NT, it truly shines on the open-source Unix variants.
Apache is the number one choice for a Web server for both Linux and FreeBSD, and in this
chapter, I’ll tell you why.
The first part of the chapter takes a look at the major Web servers in use on the Internet. The
chapter continues with a look at the present state of Apache, including its current feature set
and features planned for the next release, and ends with a discussion of why Apache is an
excellent and exciting choice to run an Internet Web site.
30 Chapter 2 Apache and Other Servers
Looking at my data, I noticed that a large percentage of my Apache sites are running the
binary versions of Apache provided with a canned Linux distribution (mostly Red Hat
and Debian). I concluded that the majority of Internet Web sites today are hosted on Intel
Pentium systems running either Apache on Linux or Microsoft IIS 4.0 on NT, and Apache
holds the lion’s share of the spoils.
Alternatives to Apache
The surveys say that while Apache leads the pack, it is not the only server in widespread
use. This section examines the features and architectures of several other Web servers.
Alternatives to Apache 31
Mathopd
Minimization is taken to the extreme with Mathopd (available from its author at
http://mathop.diva.nl/). The number of options and features in Mathopd is deliber-
ately small. The server is made available only for Unix and Linux operating systems.
Why would anyone want to run Mathopd? The code is designed to handle a very large
number of simultaneous connections. Like the thttpd server, Mathopd uses the select()
system call in Unix, rather than spawning a number of processes or threads to handle
multiple client connections. The result is a very fast Web server, designed to handle the
basic functions required by HTTP/1.1 and occupying a very small memory footprint on
a Unix machine.
A cinch to install and configure, and optimized for the maximum possible speed in serving
static documents to a large number of connecting clients, Mathopd at first seemed a very
attractive alternative to Apache. However Mathopd offers no user authentication, secure
32 Chapter 2 Apache and Other Servers
connections, or support for programming. Upon reflection, I realized that the server was
too limiting for most administrators, without the ability to add functionality, and almost
no one has data pipes sufficiently large to require the speed of Mathopd. What it does,
though, it does better than anyone.
Boa
The last server I’ll mention in the free software category is Boa (www.boa.org), a respect-
able alternative to Apache for those administrators who are looking for greater speed and
system security and are willing to sacrifice some functionality to get it. Boa is another of
the nonforking single-process servers that use the select() system call to multitask I/O.
Boa turns in very good numbers for CGI scripts; probably some of the best numbers (mea-
sured in transactions handled per second) that you’ll get on a Linux Web server. The per-
formance gain apparently comes from that fact that output from CGI scripts spawned by
Boa is sent directly to the client. This is unlike most Web servers, which receive data
output from CGI programs and send it to the Web client (browser).
Stronghold
For those sites that require strong security based on the Secure Sockets Layer (SSL), using
a commercial server often seems an attractive alternative to open-source Apache. There
are good reasons for these e-commerce Web sites to use commercial software. Probably
the best reason to choose a commercial solution is for the support offered by the vendor.
If you go the commercial route, you should take full advantage of that product support.
You are paying not so much for the product as for that company’s expertise in setting up
an SSL Web site. You should expect all the handholding necessary from these companies
in getting your site up and running. Another advantage of a commercial SSL product is
that most include a license to use the cryptographic technology patented to RSA Security,
Inc. The patent that requires licensing of RSA technology applies only in the United
States, and expires in September 2000, so this may be of no relevance to you. If, however,
you are maintaining a Web site in the U.S. and wish to use SSL, you may need to secure
such a license, and purchasing a commercial SSL product is one way to do that. There are
alternatives, though, that I’ll discuss in detail in Chapter 15.
Alternatives to Apache 33
If you are seriously considering a commercial SSL Web product, Stronghold should be
NOTE America Online, Inc. (which owns Netscape Communications) and Sun
Microsystems, Inc. formed the Sun-Netscape Alliance, which now sells Netscape
Enterprise Server as the iPlanet Enterprise Server, Enterprise Edition
(www.iplanet.com). A rose by another name?
Many IT managers in the past liked Netscape Enterprise Server because it is backed by
Netscape Communications, and the support offered by the company can be valuable. In
my opinion, however, the odds of finding documentation that addresses your problem, or
a savvy techie who’s willing to offer truly useful advice, or better still, someone who has
overcome the problem before, are much better with an open-source application like
Apache. Online resources (like those listed in Appendix C) are often every bit as valuable
34 Chapter 2 Apache and Other Servers
Roxen
Roxen is actually not a single Web server product; the name is used to refer to a line of
Internet server products offered by Idonex AB of Linköping, Sweden (www.roxen.com).
Roxen Challenger is the Web server and is available for free download. Roxen Chal-
lenger, however, is part of a larger set of integrated Web site development tools called
Roxen Platform. Roxen SiteBuilder is a workgroup environment that lets a group of Web
site developers collaborate in designing a Web site. Like most modern development sys-
tems, SiteBuilder concentrates on separating site display and content.
At a cost of $11,800, Roxen Platform requires a serious financial commitment even
though the Challenger Web server is free. Without the costly developer’s tools, Roxen
Challenger offers no advantages over Apache, which is far more widely used and, as a
result, better supported.
Zeus
The Zeus Web server from Zeus Technology of Cambridge, England (www.zeus.co.uk)
is an excellent commercial Web server for Linux. Zeus consistently turns in superlative
numbers in benchmark tests (like the SPECWeb96 Web server benchmarks published by
the Standard Performance Evaluation Corporation, www.spec.org/osg/web96).
The original version of Zeus was designed for raw speed, with a minimum of overhead
(features and functions). That version of Zeus is still available as version 1.0. Subsequent
releases of the product include a full list of advanced functions expected in a modern e-
commerce Web server. Zeus competes well with Apache in nearly every area, including
speed, functionality, configurability, and scalability. The one area in which Zeus cannot
best Apache is cost. Zeus Web Server version 3 currently costs $1699, with a discounted
price to qualified academic and charitable organizations of $85.
Two features of Zeus that have traditionally appealed to Web server administrators are
its Apache/NCSA httpd compatibility (support for .htaccess files, for example) and the
fact that it can be completely configured from a Web browser. Zeus is especially popular
with Web hosting services and ISPs that host customer Web sites, and the company
increasingly targets this market. Zeus is available for Unix and Linux platforms.
IBM
Most of the Web servers discovered in my survey that did not fall into one of the big three
(Apache, Microsoft, Netscape) were running on some type of IBM hardware, indicated
The Features of Apache 35
by Lotus-Domino. Most of them are really running a special version of Apache. Several
Standards Compliance Apache offers full compliance with the HTTP/1.1 standard
(RFC 2616). Apache has strong support for all the improvements made to the HTTP
protocol in version 1.1, such as support for virtual hosts, persistent connections, client
file uploading, enhanced error reporting, and resource caching (in proxy or gateway
servers).
Apache also supports sophisticated content negotiation by HTTP/1.1 browsers,
allowing multiple formats for a single resource to be served to meet the requirements of
different clients. Multiple natural language support is a good example of how this is
commonly used. Chapter 16, “Metainformation and Content Negotiation,” discusses
content negotiation.
36 Chapter 2 Apache and Other Servers
Scalability Apache provides support for large numbers of Web sites on a single
machine. Virtual hosting is the subject of Chapter 6 and is of particular interest to anyone
who needs to host several Web sites on a single server. Many commercial Web hosting
services take full advantage of Apache’s low cost and strong support for virtual hosting.
Dynamic Shared Objects Apache also supports Dynamic Shared Objects (DSOs).
This permits loading of extension modules at runtime. Features can be added or removed
without recompiling the server engine. Throughout the book, when explaining how to
install a module, I will demonstrate how to compile it as a DSO and enable it for use when
Apache is started. There are a few modules that cannot be dynamically linked to Apache
and must be compiled into the Apache runtime, but not many. The DSO mechanism will
be preserved in future releases of Apache, and learning to compile and use DSO modules
is a critical skill for Apache administrators.
Customizability Apache can be fully customized by writing modules using the Apache
module API. Currently, these can be written in C or Perl. The code to implement a min-
imal module is far smaller than one might think. Source code is completely available for
examination, or alteration. The Apache license permits almost any use, private or com-
mercial.
Another important feature is customizable logging, including the ability to write to mul-
tiple logs from different virtual servers. Apache logging is the subject of Chapter 12.
Also customizable in Apache are HTTP response headers for cache control and error
reporting to the client browser. See Chapter 13 on enhancing Apache performance for a
discussion of mod_header.
Potential Use as a Caching Proxy Server Apache is not designed for general proxy
use, but by using a module called mod_proxy, you can make it a very efficient caching
proxy server. In other words, Apache can cache files received from remote servers and
serve them directly to clients who request these resources, without downloading them
again from the origin server. Caching for multiple clients (on a local area network, for
example) can greatly speed up Web retrieval for clients of the proxy server, and reduce the
traffic on an Internet connection. Chapter 13, “Enhancing the Performance of Apache,”
discusses the use of mod_proxy.
The Features of Apache 37
Security Apache’s security features are the subject of Chapters 14 and 15. They include
Further Benefits
None of the major features outlined for the current Apache release is unique to Apache.
The feature set alone, while impressive, is not enough to justify a decision to choose
Apache over other excellent alternatives. There are, however, other benefits to Apache.
Apache has been ranked (by Netcraft) the number one Web server on the Internet since
April 1996, and as this book goes to press, Apache powers an estimated 60% of all Web
sites reachable through the Internet. While its popularity alone doesn’t indicate its supe-
riority, it does say that a lot of successful, high-volume sites have been built using Apache.
That represents a huge vote of confidence in the software. It also means Apache is thor-
oughly tested. Its security, reliability, and overall performance are demonstrated, docu-
mented, and unquestionable.
Apache has unparalleled support from a tremendous group of individuals. Some are pro-
grammers; most are end users and administrators. For a software system as widely used
as Apache, regardless of the nature of your problems, the odds are that someone, some-
where has encountered it and can offer some insight into its resolution. While it might
seem logical to assume that support for no-cost software will necessarily be inferior to
that provided by commercial software vendors, I haven’t found that to be true at all. As
a professional network administrator, the most difficult problems I’ve had to solve were
nearly all related to commercial software (for which I usually paid dearly) and often
involved licensing servers and product keys. The usual answer from Tech Support is “you
need to upgrade to the next revision level.” Trust me, you won’t have these problems with
Apache.
38 Chapter 2 Apache and Other Servers
Apache is under intense active development at all times, and yet many Web sites continue
to operate just fine with Apache engines many revisions behind the current release. I
believe it is the not-for-profit motivation of its developers that is responsible for this
degree of dependability in each revision. There is simply no reason for Apache devel-
opers to rush to market with incomplete, bug-ridden releases. The result is a tremendous
benefit to administrators who are already stressed trying to roll out product upgrades on
an almost continuous basis.
The most compelling reason to use the Apache Web server is that, by design, Apache is
highly configurable and extensible by virtue of its support for add-on modules. The
Apache Application Program Interface (API) gives programmers access to Apache data
structures and the ability to write routines to extend the Apache core functionality. It is
possible, of course, to write modifications to any server for which the source code is freely
available, but only Apache makes this easy with a well-documented API that doesn’t
require a module programmer to understand the Apache core source code. The upshot of
all of this is that there are a wide variety of third-party modules available for Apache.
You’ll learn about the most important of these in relevant chapters throughout this book.
From these modules, you can pick and choose the ones you need and forget the rest. Most
of the standard modules provided with the basic server as distributed by the Apache Soft-
ware Foundation are optional and can be removed from the server core if statically
linked, or simply not used if they are compiled separately as dynamically loadable mod-
ules. It’s a great alternative to programs bloated with functions that are never used.
is not blocked waiting for connections but can be performing other tasks rather than sitting
in many of the super-fast servers. A number of criteria should be used to determine the
applicability of Web server software to the needs of the business, and speed is only one
of these.
to the Apache programmer, the implications of this change directly affect all Apache
In Sum
In this chapter, we looked at what Web server software powers the Internet and deter-
mined that 60 percent of all Internet-accessible Web servers are running Apache. Only on
the very largest Internet sites does Apache yield prominence to commercial engines, for
reasons that probably have less to do with the suitability of Apache than with the fact that
many large firms are still reluctant to rely on open-source software (an attitude that is rap-
idly eroding). The major Web servers that compete with Apache have some strong fea-
tures but the features of Apache show why Apache is dominant.
These first two chapters have served as an extended introduction to Apache and its foun-
dations. Beginning in the next chapter, we’ll (metaphorically) roll up our sleeves and start
getting our fingernails dirty—that is, we’ll install the server on a Linux system. Then, in
succeeding chapters, we’ll move on to various aspects of configuring Apache.
This page intentionally left blank
Part 2
Linux Library
Part 2 Essential Configuration
Essential
Configuration
Featuring:
■ Downloading, compiling, and installing Apache from source code
■ Installing precompiled Apache binary files
■ The role of Apache directives in the httpd.conf file
■ General server directives
■ Container directives
■ Setting up user home directories
■ How modules work
■ Linking modules statically or as dynamic shared objects
■ Using apxs
■ IP-based virtual hosting
■ Name-based virtual hosting
■ Virtual hosting guidelines
This page intentionally left blank
Installing Apache
3
T he previous two chapters presented an overview of the Web and its history, and
they introduced Apache as well as other Web servers commonly used on the Internet. The
topics of installing, configuring, and administering Apache begin here, in this chapter.
One of the important things to realize about installing Apache is that there are two com-
pletely different ways to do it. You can choose to download the source code and compile
it on your own machine, or you can take the easier route and download binary files that
have already been compiled for your machine and operating system.
Both methods of installation have merit, and both are discussed in this chapter, with step-
by-step examples of the procedures that you should use on your own Linux system. The
installation of a basic Apache server is a straightforward process. Follow the instructions
in this chapter, regardless of which method of installation you choose, and soon you’ll have
a working Apache server, ready to configure.
to customize the code. The vast majority of us, however, don’t write customized Apache
code. Instead, we benefit from the code improvements made by others.
Compiling Apache from the source code makes it possible to add user-written modifica-
tions (or patches) to the code. Patches are essentially files that contain changes to a source
code base and are usually created by “diffing” modified source to the original; in other
words, comparing the modified and original source files and saving the differences in a
file distributed as a patch. Another user acquires the patch, applies it to the same source
code base to reproduce the modifications, and then compiles the altered source.
Patches make it possible for nonprogrammers to make (often quite sophisticated) changes
to source code and then compile it themselves. Without the ability to patch the source and
compile it yourself, you need to search for precompiled binaries that already include the
necessary patches. Depending on your particular platform, it might be difficult to locate
binaries that include the patches you require.
Another reason to compile from source code is that it allows you to take advantage of
compiler optimizations for your hardware platform and operating system. This consid-
eration is by no means as important as it was once, because chances are you can easily
find binaries for your particular system. Figure 3.1 shows the binary distributions of
Apache available from the Apache Project Web site for a variety of platforms. In the
unlikely circumstance that your operating system is missing from this list, you can always
download and compile the Apache source yourself.
It is not necessary to compile source code on your own hardware to optimize the resulting
binary. Most binaries are already optimized for a given type of hardware. For example, to run
on an Intel 486 or Pentium system, download an i386 binary, or an i686 binary for the Pen-
tium II or Pentium III processor. A compiler designed to optimize code to run on an Intel pro-
cessor was probably used to create the binary. It is unlikely that your compiler will produce
code that performs significantly better. Some companies offer Linux distributions that are
optimized for performance on Pentium-class Intel processors (Mandrake Linux is one such
distribution: www.linux-mandrake.com). If the fastest possible system performance is your
goal, you should consider such a Linux distribution teamed with more or faster hardware.
One word of warning about using binaries is in order. Often, the available binaries lag
behind new releases. If you want to stay on the “bleeding edge” of changes, you must use
source code distributions, which is not always the best decision for production servers.
In sum:
■ Use an Apache binary distribution when you need a basic Apache server with the
Apache modules included in that distribution. All standard Apache modules are
included with these binary distributions, compiled separately from the server as
DSO modules. You can pick and choose the ones you want, using only those that
you require, and disabling the others to conserve the memory required to run
The Decision to Compile 47
Apache. If all the functionality you require is available in the set of standard
Apache modules, and your operating system is supported, you have nothing to
lose by installing one of these. Even if you require a few modules not included
with the binary distribution, most of these are easily compiled separately from the
Apache server itself, without requiring the Apache source. A few, however,
require that the Apache source be patched, and will require that you have the
source code available on your system. A good example is mod_ssl, which is dis-
cussed in Chapter 15. It is impossible to install these modules without the Apache
source code; you won’t find them in an Apache binary distribution.
Compile the Apache server source code whenever you need functionality that
Configuration
■
requires patching the original source code (Secure Sockets Layer, or SSL, is an
Essential
example of such a module or server extension). You will also need to compile the
Apache source if you intend to write your own modules.
PART 2
Figure 3.1 Apache binary distributions
48 Chapter 3 Installing Apache
If you can work with precompiled binaries, feel free to skip the material on compiling
Apache. It will always be here if you need it in the future. If you have decided to compile
the Apache source code, take a look at the next section; otherwise, you can jump ahead
to the “Installing the Apache Binaries” section.
Change directory to the location where you intend to unpack the Apache source code
and compile the server. A common location for source code on Linux systems is the
Compiling Apache 49
/usr/local/src directory, and that’s a pretty logical choice. If you want to place the
Apache source in a subdirectory of /usr/local/src, do the following:
# cd /usr/local/src
From this directory, invoke the Linux tar utility to decompress the archive and extract
the files. Tar will automatically create the necessary directories. When the operation is
finished, you will have the Apache source saved in the directory /usr/local/src/
apache_1.3.12:
# tar xvzf /home/caulds/apache_1.3.12.tar.gz
Configuration
apache_1.3.12/
Essential
apache_1.3.12/src/
apache_1.3.12/src/ap/
apache_1.3.12/src/ap/.indent.pro
PART 2
apache_1.3.12/src/ap/Makefile.tmpl
apache_1.3.12/src/ap/ap.dsp
apache_1.3.12/src/ap/ap.mak
… many files extracted
Compiling Apache
Old (pre-1.3) versions of Apache could only be compiled the old-fashioned way: by man-
ually editing the Configuration.tmpl file, running the ./configure command, and then
running the make utility. An editor was used to customize the compiler flags (EXTRA_
CFLAGS, LIBS, LDFLAGS, INCLUDES) stored in the template as needed for a given
system. Thank goodness there is now a better way.
All recent versions of Apache include the APACI configuration utility. Although some
administrators insist that configuring the Apache compilation manually gives them better
control over the compiler switches and installation options, I disagree. APACI is the instal-
lation method preferred by the Apache development team; it is the easiest way to compile
Apache, and it is the best way to maintain your Apache source code, especially if you’ve
altered it by applying source patches and a number of third-party modules (Chapter 5). It
50 Chapter 3 Installing Apache
is probably best to learn only one way to configure Apache compilation options. If you’re
going to learn only one method, it is best to learn the APACI installation method.
Using APACI
With Apache version 1.3, a new configuration module was introduced with the Apache
source distribution. The APache AutoConf-style Interface (APACI) is a configuration
utility similar to the GNU Autoconf package, although it is not based on that popular
GNU utility. APACI provides an easy way to configure the Apache source prior to com-
pilation in order to specify certain compiler options and the inclusion (or exclusion) of
Apache modules. Like GNU Autoconf, APACI also performs a number of tests in order
to ascertain details about your system hardware and operating system that are relevant to
the Apache source compilation.
APACI does not compile the Apache source; its purpose is to create the files that specify
how that compilation is performed. Its most important task is to create the makefiles that
are used by the Linux make utility to direct the C compiler how to proceed, and also where
to place the compiled programs when make is instructed to perform an install.
The Apache source code is written in C language compliant with the specifications
codified by the American National Standards Institute, or ANSI-C. For that reason,
you will need an ANSI-C–compliant compiler to complete the install. This is not a big
deal, because your Linux distribution includes the GNU C compiler (gcc), which is the
ANSI-C compiler recommended by the Apache Software Foundation. If APACI is
unable to locate a suitable compiler, you will be notified, and the configuration will
abort. You can then install gcc from your Linux CD-ROM or from www.gnu.org. The
Free Software Foundation makes binary distributions available for Linux and a large
number of Unix platforms or you can download and compile the source code yourself,
although compiling gcc can turn into a time-consuming exercise. Binary distributions
of gcc are well optimized so it is unlikely that you can build a more efficient C compiler.
how to correct problems it finds. On most systems running a fairly recent version of
Linux, this will not occur. Once configure determines that it can build Apache on your
system, it then identifies the best possible combination of options for that system. The
information it gathers and the decisions it makes about configuring Apache for your
system are written into a special file that you’ll find stored in src/Configuration.apaci.
In this file it stores information specific to your system (including build options you
specify to configure).
The last step that the configure script takes is to run a second script, which you’ll find
as src/Configure. This script takes the information from src/Configuration.apaci
Configuration
and uses it to create a set of files that control the actual compilation and installation of
Essential
Apache (using the make utility on your Linux system). You’ll find these makefiles created
in a number of the Apache source directories.
You will usually run configure with a number of options (command-line arguments) to PART 2
customize your Apache configuration. In fact, if you run configure with no command-
line arguments, it will report, “Warning: Configuring Apache with default settings. This
is probably not what you really want,” and it probably isn’t. The next few sections will
show you how to specify additional options to configure, or override its default values.
This is a procedure you’ll return to many times, whenever you need alter your Apache con-
figuration or change its functionality by adding new modules. The following configure
statement compiles Apache version 1.3.12. Note that this is a single Linux command with
three arguments; the backslash (\) character is used to continue the command on a new
line. It’s a handy trick for manually entering long command lines, and can also be used to
improve the readability of shell script files.
# ./configure --prefix=/usr/local/apache \
> --enable-module=most \
> --disable-module=auth_dbm \ > --enable-shared=max
The --prefix argument in the example above tells Apache to install itself in the directory
/usr/local/apache. (This is the default installation location for Apache, so in this case
the option is unnecessary.) However, there are many times you may want to install into
an alternate directory—for example, if you do not want to install a second Apache ver-
sion alongside one that already exists (I have five versions of Apache on my server for
testing purposes). Another reason you may want to install Apache into an alternate direc-
tory is to preserve the default locations used by a Linux distribution. For example, assume
the version of Apache that comes with your Linux distribution is installed in /etc/apache
instead of the default /usr/local/apache directory. Use --prefix to install Apache in
the /etc/apache directory. (For standard file location layouts, see the discussion on the
config.layout file below.)
52 Chapter 3 Installing Apache
Linux systems can use --enable-module=all to enable all modules in the standard dis-
tribution. The --enable-module=most option enables all the standard modules in the
Apache distribution that are usable on all platforms supported by Apache. Table 3.1 lists
the modules that are not installed when you specify --enable-module=most, along with the
reason they are not used. Red Hat Linux 7.0 users will not be able to compile Apache with
mod_auth_dbm and should use the --disable-module=auth_dbm directive to disable use
of that module. Users of other Linux distributions (or earlier Red Hat distributions) who
wish to use the module can omit the directive. The mod_auth_dbm module is discussed in
detail in Chapter 14. Table 3.2 later in this chapter lists all of the standard modules
included in the 1.3.12 release of Apache.
mod_example This module is only for programmers and isn’t required on pro-
duction servers.
The extension of Apache Server through the use of modules has always been part of
its design, but it wasn’t until release 1.3 that Apache supported dynamic loadable
modules. These dynamic shared objects are available in Apache on Linux and other
operating systems that support the necessary system functions for a program to load
a module into its address space with a system call. This is similar to the way dynamic
link library (or DLL) files work in Microsoft Windows; in fact, DLLs are used to provide
this functionality in the Windows version of Apache.
Configuration
The use of DSO modules in Apache has several advantages. First, the server can be
Essential
far more flexible because modules can be enabled or disabled at runtime, without the
need to relink the Apache kernel. The exclusion of unnecessary modules reduces the
size of the Apache executable, which can be a factor when many server instances are PART 2
run in a limited memory space.
On Linux systems, the only significant disadvantage to the use of DSO modules is
that the server is approximately 20 percent slower to load at startup time, because of
the system overhead of resolving the symbol table for the dynamic links. This is not
generally a factor unless Apache is run in inetd mode (see Chapter 4), where a new
instance of httpd is spawned to handle each incoming client connection.
In most cases, Linux administrators should build their Apache server to make maxi-
mum use of DSO modules.
# ./configure --prefix=/usr/local/apache \
> --enable-module=most \
> --disable-module=auth_dbm \
> --enable-shared=max
Configuring for Apache, Version 1.3.9
+ using installation path layout: Apache (config.layout)
54 Chapter 3 Installing Apache
Creating Makefile
Creating Configuration.apaci in src
+ enabling mod_so for DSO support
Creating Makefile in src
+ configured for Linux platform
+ setting C compiler to gcc
+ setting C pre-processor to gcc -E
+ checking for system header files
+ adding selected modules
o rewrite_module uses ConfigStart/End
+ using -lndbm for DBM support
enabling DBM support for mod_rewrite
o dbm_auth_module uses ConfigStart/End
The configure script essentially creates a set of instructions to the compiler for compiling
the source files into a working system. It uses information you provide, along with other
information about the capabilities of your system, such as what function libraries are
available. The result is primarily a set of makefiles, which instruct the Linux make utility
how to compile source files, link them to required function libraries, and install them in
their proper locations.
# cat config.status
#!/bin/sh
##
## config.status -- APACI auto-generated configuration restore script
##
## Use this shell script to re-run the APACI configure script for
## restoring your configuration. Additional parameters can be supplied.
##
Configuration
Essential
SSL_BASE="/usr/local/src/openssl-0.9.5" \
./configure \
"--with-layout=Apache" \ PART 2
"--prefix=/usr/local/apache" \
"--enable-module=most" \
"--disable-module=auth_dbm" \
"--enable-module=ssl" \
"--activate-module=src/modules/extra/mod_define.c" \
"--enable-shared=max" \
"$@"
There are a few lines here that have been added since I showed the minimal set of options
required to compile a full working Apache server. The SSL_BASE line, which actually pre-
cedes the invocation of the configure utility, sets an environment variable that points to the
OpenSSL source. This environment variable will be used later by the Secure Sockets Layer
(SSL) module, which is enabled by the line --enable-module=ssl. This will be covered in
full detail in Chapter 15. The --activate-module line is used to compile a third-party
module and statically link it into Apache from a source file previously placed in the location
designated for these “extra” modules. You can also use another option, --add-module, to
copy a module source file into this directory before compiling and statically linking it to the
server. This option saves you only the copy step, however, so it isn’t terribly useful:
--add-module=/home/caulds/mod_include/mod_include.c
A great benefit of the config.status file is that it saves your hard-won knowledge.
You can rerun the last configure command at any time, simply by ensuring that this file
is executable by its owner (probably root), and invoking it as follows:
# chmod u+x config.status
# ./config.status
56 Chapter 3 Installing Apache
Although the config.status file contains many lines, all of them (except for comments
and the last line) end in a backslash character, which indicates that the lines should be
concatenated and passed as a single command to the shell interpreter. The last line, “$@”,
concatenates to the end of the command line any argument passed to config.status
when it is executed. You might run config.status, for example, with an additional
option:
# ./config.status "--activate-module=src/modules/auth_mysql/libauth_mysql.a"
In this case, the portion of the command line enclosed in quotes is substituted for $@ in
the config.status script and concatenated to the command line passed to /bin/sh for
processing.
You can modify config.status and rerun it to add, remove, or change the order of the
arguments. This order is often significant. For example, I discovered that, to use the
--enable-shared option (which specifies compilation of modules as Dynamic Shared
Objects), you must include this option after all --enable-module and --activate-module
arguments. I learned this the hard way. But once I did learn how to do it right, I had the
config.status file to retain that information for later use. Unfortunately, determining the
precedence of configure options is largely a matter of trial and error.
I prefer to copy the config.status file to another filename. This ensures that the file I use
to configure Apache won’t be accidentally overwritten if I choose to run configure to test
other options. After running configure, you may wish to do something like the following:
# cp config.status build.sh
# chmod u+x build.sh
This creates a brand-new file (a shell script), named build.sh, which can be edited and
then executed to reconfigure Apache. I have used the same build.sh over and over again
during the course of writing this book, with several versions of Apache, modifying it as
needed to enable or disable modules or install locations.
Configuration
sysconfdir: $prefix/conf
Essential
datadir: $prefix
iconsdir: $datadir/icons
htdocsdir: $datadir/htdocs PART 2
cgidir: $datadir/cgi-bin
includedir: $prefix/include
localstatedir: $prefix
runtimedir: $localstatedir/logs
logfiledir: $localstatedir/logs
proxycachedir: $localstatedir/proxy
</Layout>
Each line of config.layout defines a directory pathname. Some of the paths are derived
from others previously defined in the file. You might note from this layout that all the
paths are derived from the one identified as prefix. Therefore, simply by running
configure with the --prefix argument to change this location, you automatically
change all of the default paths for the Apache installation.
You can specify a named layout when running configure by using the --with-layout
argument. For example, if you chose to use the same file locations that Red Hat Linux
uses, specify configure with the --with-layout=RedHat argument:
# ./configure --with-layout=RedHat
The following example uses path variables as configure arguments to install all of
Apache’s user executables in /usr/bin and all system executables in /usr/sbin, which is
where the Red Hat layout puts them. All other layout options are read from the Apache
layout in config.layout. The following command accomplishes the same thing as the
custom layout shown later, in Listing 3.5:
# ./configure --bindir=/usr/bin --sbindir=/usr/sbin
For those readers who are using the Red Hat Linux distribution, the Apache that is pro-
vided as a Red Hat Package (RPM) uses a layout that looks like this:
# RedHat 5.x layout
<Layout RedHat>
prefix: /usr
Compiling Apache 59
exec_prefix: $prefix
bindir: $prefix/bin
sbindir: $prefix/sbin
libexecdir: $prefix/lib/apache
mandir: $prefix/man
sysconfdir: /etc/httpd/conf
datadir: /home/httpd
iconsdir: $datadir/icons
htdocsdir: $datadir/html
Configuration
cgidir: $datadir/cgi-bin
Essential
includedir: $prefix/include/apache
localstatedir: /var
runtimedir: $localstatedir/run
logfiledir: $localstatedir/log/httpd PART 2
proxycachedir: $localstatedir/cache/httpd
</Layout>
Note that, since the Red Hat layout modifies the Apache prefix variable, all paths are
altered, because all depend on prefix. The Red Hat layout actually tries to put files into
more standard directories. Rather than storing Apache binaries in a special directory (like
/usr/local/apache/bin), Red Hat places them in the Linux directories that are actually
reserved for them, /usr/bin and /usr/sbin. Likewise, Red Hat prefers to keep Apache
configuration files under /etc, a directory in which you’ll find configuration files for a
large number of other Linux utilities, such as FTP, DNS, sendmail, and others.
# ./configure --show-layout
Configuring for Apache, Version 1.3.9
+ using installation path layout: Apache (config.layout)
Installation paths:
prefix: /usr/local/apache
exec_prefix: /usr/local/apache
60 Chapter 3 Installing Apache
bindir: /usr/local/apache/bin
sbindir: /usr/local/apache/bin
libexecdir: /usr/local/apache/libexec
mandir: /usr/local/apache/man
sysconfdir: /usr/local/apache/conf
datadir: /usr/local/apache
iconsdir: /usr/local/apache/icons
htdocsdir: /usr/local/apache/htdocs
cgidir: /usr/local/apache/cgi-bin
includedir: /usr/local/apache/include
localstatedir: /usr/local/apache
runtimedir: /usr/local/apache/logs
logfiledir: /usr/local/apache/logs
proxycachedir: /usr/local/apache/proxy
Compilation paths:
HTTPD_ROOT: /usr/local/apache
SHARED_CORE_DIR: /usr/local/apache/libexec
DEFAULT_PIDLOG: logs/httpd.pid
DEFAULT_SCOREBOARD: logs/httpd.scoreboard
DEFAULT_LOCKFILE: logs/httpd.lock
DEFAULT_XFERLOG: logs/access_log
DEFAULT_ERRORLOG: logs/error_log
TYPES_CONFIG_FILE: conf/mime.types
SERVER_CONFIG_FILE: conf/httpd.conf
ACCESS_CONFIG_FILE: conf/access.conf
RESOURCE_CONFIG_FILE: conf/srm.conf
SSL_CERTIFICATE_FILE: conf/ssl.crt/server.crt
At the very least, --show-layout is a convenient way to find out where Apache puts all the
files. Because it expands variables, it is more readable than looking directly in the file.
Whether you should use it to modify default settings directly is debatable. Many adminis-
trators consider it safer to use the information displayed by --show-layout to build a new
layout as described in the next section. But whether you work with the default layout or a
copy, editing a named layout has the advantage that you can change the default path values
that configure uses without specifying your changes as arguments to the configure com-
mand. All the settings are visible in one place, and since you modify only those you want
to change in a given layout, you don’t have to do a lot of work in most cases.
Compiling Apache 61
Configuration
As noted, many administrators consider it inherently risky to edit the default layout
Essential
directly; they prefer to leave the original layout values intact and work on a copy. Apache’s
use of named layouts makes it easy to follow this approach. You might add a layout to
config.layout like the one shown in Listing 3.5.
PART 2
Listing 3.5 A Custom Path Layout
<Layout MyLayout>
prefix: /usr/local/apache
exec_prefix: $prefix
# Use all Apache layout options,
# but install user and system
# executables as Red Hat does
bindir: /usr/bin
sbindir: /usr/sbin
# end of changes from Apache layout
libexecdir: $exec_prefix/libexec
mandir: $prefix/man
sysconfdir: $prefix/conf
datadir: $prefix
iconsdir: $datadir/icons
htdocsdir: $datadir/htdocs
cgidir: $datadir/cgi-bin
includedir: $prefix/include
localstatedir: $prefix
runtimedir: $localstatedir/logs
logfiledir: $localstatedir/logs
proxycachedir: $localstatedir/proxy
</Layout>
62 Chapter 3 Installing Apache
To use the new custom layout, run configure with the --with-layout argument, and
compile:
# ./configure --with-layout=MyLayout
Making Apache
Upon completion of the configuration phase, you have constructed a set of makefiles in
various places within your Apache source tree. The make command is used to begin the
actual compilation phase of the install:
# make
===> src
make[1]: Entering directory `/usr/local/src/apache_1.3.9'
make[2]: Entering directory `/usr/local/src/apache_1.3.9/src'
The final step of the install is to call make again, this time with the install argument,
which moves all the compiled binaries and support files to their default locations (or loca-
tions you specified in the configuration step above). Most files are copied into directories
relative to the Apache root directory that you specified with the --prefix argument:
# make install
make[1]: Entering directory `/usr/local/src/apache_1.3.9'
===> [mktree: Creating Apache installation tree]
-- Lines deleted --
Compiling Apache 63
+--------------------------------------------------------+
| You now have successfully built and installed the |
| Apache 1.3 HTTP server. To verify that Apache actually |
| works correctly you now should first check the |
| (initially created or preserved) configuration files |
| |
| /usr/local/apache/conf/httpd.conf |
| |
| and then you should be able to immediately fire up |
Configuration
| Apache the first time by running: |
Essential
| |
| /usr/local/apache/bin/apachectl start |
| |
| Thanks for using Apache. The Apache Group | PART 2
| http://www.apache.org/ |
+--------------------------------------------------------+
With the appearance of the message above, you have installed an Apache system that
should run after making a few simple changes to its configuration file.
An optional step you can take is to reduce the size of the Apache executable using the
Linux strip command. This command removes symbolic information that is used only
by debuggers and other developer tools. For a production version of the Apache kernel,
this information can be stripped to reduce the memory required by the server. The reduc-
tion is slight, but if you are running a number of Apache processes, the savings do add up.
Running strip on a freshly compiled Apache 1.3.9 executable reduced its size by about
14 percent. Be aware that once you strip symbol information from a binary file, you can
no longer run debugging tools if you have problems running that program.
# ls -al httpd
-rwxr-xr-x 1 root root 461015 Dec 6 11:23 httpd
# strip httpd
# ls -al httpd
-rwxr-xr-x 1 root root 395516 Dec 6 11:46 httpd
If you compile Apache from source code feel free to skip down to the section “Running
the Server.” That’s where you’ll learn to start the server.
64 Chapter 3 Installing Apache
Table 3.3 Apache Modules Provided with the Red Hat RPM and with the Apache Binary
Distribution
Libproxy.so X X
Mod_access X X
Mod_actions X X
Installing the Apache Binaries 65
Table 3.3 Apache Modules Provided with the Red Hat RPM and with the Apache Binary
Distribution (continued)
Mod_alias X X
Mod_asis X X
Mod_auth X X
Configuration
Mod_auth_anon X X
Essential
Mod_auth_db X
mod_auth_dbm X
PART 2
mod_auth_digest X
mod_autoindex X X
mod_bandwidth X
mod_cern_meta X X
mod_cgi X X
mod_digest X X
mod_dir X X
mod_env X X
mod_example X
mod_expires X X
mod_headers X X
mod_imap X X
mod_include X X
mod_info X X
mod_log_agent X
66 Chapter 3 Installing Apache
Table 3.3 Apache Modules Provided with the Red Hat RPM and with the Apache Binary
Distribution (continued)
mod_log_config X X
mod_log_referer X
mod_mime X X
mod_mime_magic X X
mod_mmap_static X
mod_negotiation X X
Mod_put X
Mod_rewrite X X
Mod_setenvif X X
Mod_speling X X
Mod_status X X
Mod_throttle X
Mod_unique_id X X
Mod_userdir X X
Mod_usertrack X X
Mod_vhost_alias X X
* installed from separate RPMs by the Red Hat Linux installation program
Installing the Apache Binaries 67
Configuration
can manage all the packages installed on your system, it can use newer packages to upgrade
Essential
those already installed, it can cleanly uninstall packages, and it can even verify the installed
files against the RPM database. Verification is useful because it detects changes that might
have been made accidentally or deliberately by an intruder.
PART 2
True to the spirit of open-source software, Red Hat donated RPM to the public domain,
and many other Linux distributions have the ability to load RPM files. Red Hat, SuSE,
Mandrake, TurboLinux, and Caldera OpenLinux are all “RPM Linux Distributions.”
Although other package managers exist for Linux, RPM is the most widely used, and
more packages are available as RPMs than any other format.
NOTE If your Linux doesn’t support RPM, you can add that support yourself. In
keeping with the spirit of open source, and as a way of encouraging other Linux
developers to use the RPM package manager, the source is no longer distributed
by Red Hat (although you may be able to find it on their Web site). The source files
are available from the www.rpm.org FTP server, ftp://ftp.rpm.org/pub/rpm.
This site also contains a wealth of information about using the Red Hat Package
manager. You’ll find not only the source for all versions of RPM ever released, but
also precompiled binary distributions for Intel 386 and Sparc platforms. For most
versions of Linux, adding your own RPM support will not be necessary.
The best source for RPMs that I’ve ever found is the RPM repository on rpmfind.net
(http://rufus.w3.org/linux/RPM). Figure 3.3 illustrates the RPM site after we’ve chosen
the option to view the index by name. There are numerous packages for Apache 1.3.12,
so to make a choice we need more information about them. Figure 3.4 shows the detailed
display for apache-1_3_12-2_i386.rpm.
68 Chapter 3 Installing Apache
Before installing the Apache 1.3.12 RPM on my Linux system, I removed the existing Apache
RPM that was installed when I loaded the Red Hat distribution. Run rpm with the -qa argu-
ment, which says “query all installed packages,” to determine which Apache RPMs are
installed. Pipe the output to grep to display only those lines containing the string apache:
# rpm -qa |grep apache
apache-1.3.6-7
apache-devel-1.3.6-7
The -e argument to rpm erases an RPM package. It removes all files installed with the
RPM package, unless those files have been modified. Uninstalling an RPM also removes
Configuration
all directories created when installing the RPM, unless those directories are not empty
Essential
after the RPM files are removed.
In this example, removing the installed RPMs failed. The error warns that other packages
PART 2
were installed after, and are dependent on, the Apache RPM:
# rpm -e apache-1.3.6-7
error: removing these packages would break dependencies:
webserver is needed by mod_perl-1.19-2
webserver is needed by mod_php-2.0.1-9
webserver is needed by mod_php3-3.0.7-4
To remove the Apache-1.3.6-7 RPM, it is necessary to first remove the three RPMs listed
as dependent on that RPM, which I did with the following commands (if the package
removal happens without error, the rpm command returns no output):
# rpm –e mod_perl-1.19-2
# rpm –e mod_php-2.0.1-9
# rpm –e mod_php3-3.0.7-4
# rpm -e apache-1.3.6-7
Once all the RPMs are removed, install the new Apache RPM using rpm with the –i argu-
ment in the following manner:
# ls -al apache*.rpm
-rw-r--r-- 1 caulds caulds 833084 Jan 17 09:41 apache-1_3_12-2_
i386.rpm
# rpm -i apache-1_3_12-2_i386.rpm
This RPM is designed to install Apache in the /home/httpd and /etc/httpd directories,
which is where you’ll find it on standard Red Hat systems. The RPM installs all the
required configuration files, with values that allow the server to start:
# cd /home/httpd
# ls
cgi-bin html icons
70 Chapter 3 Installing Apache
The RPM even provides a default HTML page in the default DocumentRoot directory
(/home/httpd/html). This page allows your server to be accessed immediately after
installation:
# ls html
index.html manual poweredby.gif
A listing of the /home/httpd/html directory shows two files and a subdirectory. The
index.html file contains the HTML page the newly installed server will display by
default; it is a special filename used to indicate the default HTML page in a directory. The
poweredby.gif file is a graphic the server displays on the default page. The directory
manual contains HTML documentation for the new Apache server. Access the manual
from a Web browser using http://localhost/manual.
The Apache configuration files, logs, and loadable modules are all found elsewhere on the
file system (in /etc/httpd):
# cd /etc/httpd
# ls
conf logs modules
# ls conf
access.conf httpd.conf magic srm.conf
# ls modules
httpd.exp mod_bandwidth.so mod_include.so mod_setenvif.so
libproxy.so mod_cern_meta.so mod_info.so mod_speling.so
mod_access.so mod_cgi.so mod_log_agent.so mod_status.so
mod_actions.so mod_digest.so mod_log_config.so mod_unique_id.so
mod_alias.so mod_dir.so mod_log_referer.so mod_userdir.so
mod_asis.so mod_env.so mod_mime.so mod_usertrack.so
mod_auth.so mod_example.so mod_mime_magic.so mod_vhost_alias.so
mod_auth_anon.so mod_expires.so mod_mmap_static.so
mod_auth_db.so mod_headers.so mod_negotiation.so
mod_autoindex.so mod_imap.so mod_rewrite.so
The RPM also writes the Apache executable httpd into a directory reserved for system
executable binaries:
# ls -al /usr/sbin/httpd
-rwxr-xr-x 1 root root 282324 Sep 21 09:46 /usr/sbin/httpd
Installing the Apache Binaries 71
Binary Distributions
The last means of installing Apache is almost as easy as the RPM method. Binary distri-
butions of Apache, compiled for a large number of operating systems and hardware plat-
forms, are available from the Apache Software Foundation and can be downloaded from
www.apache.org/dist/binaries/linux. You may need to look elsewhere if your hard-
ware or OS is quite old (an old Linux kernel on a 486, for example). The page listing
Apache for Linux distributions is shown in Figure 3.5.
Configuration
Essential
PART 2
When downloading binary distributions for Intel microprocessors, you need to make sure
you download a version that was compiled to run on your specific processor family. For
example, the i686 family includes the Pentium II, PII Xeon, Pentium III and PIII Xeon, as
well as the Celeron processors. The i586 family includes the Pentium and Pentium with
MMX CPUs, and i386 generally indicates the 80486 family. A binary compiled for the
i386 family will run on any of the processors mentioned above, including the latest Pen-
tium CPUs, but it will not be as fast as code compiled specifically for a processor gener-
ation. If you are downloading a binary distribution for a Pentium II or Pentium III, look
72 Chapter 3 Installing Apache
for an i686 distribution; if you are downloading for an 80486, you must get the i386
binaries.
There is a handy Linux utility that will query the system’s processor and return its hard-
ware family type. Enter /bin/uname -m to obtain this information (the m is for machine
type). When run on my server machine, which has an old Pentium 200 MMX chip, I got
this result:
# uname -m
i586
NOTE For every binary package on the Web site, there is a README file to
accompany it. You can view or download this file for information about the binary
distribution; for example, who compiled it and when, as well as what compiler
options and default locations for files were built into the Apache executable.
2. Make sure you are in the directory where you downloaded the binary distribution
(or move the downloaded file elsewhere and change to that directory). After the
installation process is complete, you will probably want to delete the directory
that was created to hold the installation files. All the files you need to run Apache
from the binary are moved from that directory to their intended locations:
# cd /home/caulds
# pwd
/home/caulds
# ls apache*
apache_1_3_9-i686-whatever-linux2_tar.gz
3. Uncompress and extract the distribution with tar to create a new directory tree
containing all the files from the distribution:
# tar xvzf apache_1_3_12-i686-whatever-linux2_tar.gz
# ls bindist
bin cgi-bin conf htdocs icons include libexec logs man proxy
Configuration
# ls bindist/bin
Essential
ab apxs htdigest httpd rotatelogs
apachectl dbmmanage htpasswd logresolve
5. The binary distribution includes a shell script for installing the files in their proper PART 2
locations (the locations that the Apache daemon expects to find them). Run the
shell script as follows to create the Apache folders. After it runs, you should find
everything neatly installed under /usr/local/apache:
# ./install-bindist.sh
Installing binary distribution for platform i686-whatever-linux2
into directory /usr/local/apache ...
[Preserving existing configuration files.]
[Preserving existing htdocs directory.]
Ready.
+--------------------------------------------------------+
| You now have successfully installed the Apache 1.3.12 |
| HTTP server. To verify that Apache actually works |
| correctly you should first check the (initially |
| created or preserved) configuration files: |
| |
| /usr/local/apache/conf/httpd.conf |
| |
| You should then be able to immediately fire up |
| Apache the first time by running: |
| |
| /usr/local/apache/bin/apachectl start |
| |
| Thanks for using Apache. The Apache Group |
| http://www.apache.org/ |
+--------------------------------------------------------+
You can actually start the Apache server from the httpd file in the bin directory (the last
listing above), but it has been compiled with default values that will not allow it to
74 Chapter 3 Installing Apache
operate from this location. You can verify that it is operational, though, by entering a
command such as the following, which will cause httpd to start, display its version infor-
mation, and quit:
# ./bindist/bin/httpd -v
Server version: Apache/1.3.12 (Unix)
Server built: Feb 27 2000 19:52:12
-D HAVE_SHMGET
-D USE_SHMGET_SCOREBOARD
-D USE_MMAP_FILES
-D USE_FCNTL_SERIALIZED_ACCEPT
-D HTTPD_ROOT="/usr/local/apache1_3_12"
-D SUEXEC_BIN="/usr/local/apache1_3_12/bin/suexec"
-D DEFAULT_PIDLOG="logs/httpd.pid"
-D DEFAULT_SCOREBOARD="logs/httpd.scoreboard"
-D DEFAULT_LOCKFILE="logs/httpd.lock"
Configuration
-D DEFAULT_XFERLOG="logs/access_log"
Essential
-D DEFAULT_ERRORLOG="logs/error_log"
-D TYPES_CONFIG_FILE="conf/mime.types"
-D SERVER_CONFIG_FILE="conf/httpd.conf"
-D ACCESS_CONFIG_FILE="conf/access.conf" PART 2
-D RESOURCE_CONFIG_FILE="conf/srm.conf"
The -l argument displays the modules that are compiled into httpd (also referred to
as statically linked). One module, httpd_core, is always statically linked into httpd. A
second module (the shared object module, mod_so) is statically linked when dynamic
loading of modules is required. For this server, all other modules are available to the
server only if dynamically loaded at runtime:
# /usr/local/apache/bin/httpd -l
Compiled-in modules:
http_core.c
mod_so.c
The –t option runs a syntax test on configuration files but does not start the server. This
test can be very useful, because it indicates the line number of any directive in the
httpd.conf file that is improperly specified:
# /usr/local/apache/bin/httpd -t
Syntax OK
Every configuration option for a basic Apache server is stored in a single file. On most stan-
dard Apache systems, you’ll find the configuration file stored as /usr/local/apache/
conf/httpd.conf. If you have Apache loaded from a Red Hat Linux distribution CD or an
RPM distribution, you’ll find the file in an alternate location preferred by Red Hat, /etc/
apache/conf/httpd.conf. When Apache is compiled, this location is one of the config-
urable values that are hard-coded into it. Unless explicitly told to load its configuration
76 Chapter 3 Installing Apache
from another file or directory, Apache will attempt to load the file from its compiled-in path
and filename.
This compiled-in value can be overridden by invoking the Apache executable with the -f
option, as shown in Chapter 4. This can be handy for testing alternate configuration files,
or for running more than one server on the system, each of which loads its own unique
configuration.
Finally, you can run httpd with no arguments to start the server as a system daemon.
Some simple modifications will probably have to be made to the default httpd.conf pro-
vided when you install Apache, although only very minor changes are actually required
to start the server. In all likelihood, the first time you start Apache, you’ll receive some
error telling you the reason that Apache can’t be started. The most common error new
users see is this:
httpd: cannot determine local host name.
Use the ServerName directive to set it manually.
If you get an error when starting Apache the first time, don’t panic; it is almost always
fixed by making one or two very simple changes to Apache’s configuration file. In fact,
you should expect to make a few changes before running Apache. To do this, you’ll
modify Apache configuration directives, the subject of the next chapter. Chances are, the
directives you need to learn about and change are those covered in the “Defining the
Main Server Environment” section of Chapter 4. If your server won’t start, you need to
follow the instructions there.
If Apache finds an httpd.conf file that it can read for an acceptable initial configuration,
you will see no response, which is good news. To find out if the server is actually running,
attempt to connect to it using a Web browser. Your server should display a demo page
to let you know things are working. Figure 3.6 shows the demo page from a Red Hat
system.
You can also determine if the server is running the slightly more complicated way, and use
the Linux process status (or ps) command to look for the process in the Linux process
table, as shown below:
# ps -ef | grep httpd
root 8764 1 0 13:39 ? 00:00:00 ./httpd
nobody 8765 8764 0 13:39 ? 00:00:00 ./httpd
nobody 8766 8764 0 13:39 ? 00:00:00 ./httpd
nobody 8767 8764 0 13:39 ? 00:00:00 ./httpd
nobody 8768 8764 0 13:39 ? 00:00:00 ./httpd
nobody 8769 8764 0 13:39 ? 00:00:00 ./httpd
Running the Server 77
Figure 3.6 The demonstration Web page installed with the Apache RPM
Configuration
Essential
PART 2
This list is more interesting than it might appear at first. I used the e argument to ps to
display all system processes, the f argument to display the full output format, and then
grep to display only those lines containing the string httpd. Note that only one of the
httpd processes is owned by root (the user who started Apache); the next few httpd
processes in the list are all owned by nobody. This is as it should be. The first process is
the main server, which never responds to user requests. It was responsible for creating the
five child processes. Note from the third column of the output that all of these have the
main server process (denoted by a process ID of 8764) as their parent process. They were
all spawned by the main server, which changed their owner to the nobody account. It is
these processes that respond to user requests.
Stopping the Apache server is a bit more difficult. When you start Apache, it writes the
process ID (or PID) of the main server process into a text file where it can later be used
to identify that process and control it using Linux signals. By default, this file is named
httpd.pid, and is written in the logs directory under the server root. On my system:
# cat /usr/local/apache/logs/httpd.pid
8764
78 Chapter 3 Installing Apache
You’ll note that the number saved in the file is the PID of the main Apache server process
we saw in the process status listing earlier. To shut the server down, extract the contents
of the httpd.pid and pass them to the kill command. This is the line that kills Apache:
# kill `cat /usr/local/logs/httpd.pid`
Using Apachectl
Apache comes with a utility to perform the basic operations of controlling the server. This
utility, called apachectl, is actually a short shell script that resides in the bin directory
under ServerRoot. It does nothing more than simplify processes you can perform by
hand, and for that reason, doesn’t require a lot of explanation. All of the functionality
provided by apachectl is discussed in Chapter 11; for now, I’ll show you how to start
and stop Apache using this handy utility.
Start the server by invoking apachectl with the start argument. This is better than
simply running httpd, because the script first checks to see if Apache is already running,
and starts it only if it finds no running httpd process.
# /usr/local/apache/bin/apachectl start
/usr/local/apache/bin/apachectl start: httpd started
Stopping the server is when apachectl comes in really handy. Invoked with the stop
argument, apachectl locates the httpd.pid file, extracts the PID of the main server, and
then uses kill to stop the process (and all of its child processes). It is exactly what you
did earlier using ps and kill, but it is much easier. That’s what apachectl is, really, an
easy-to-use wrapper for shell commands.
# /usr/local/apache/bin/apachectl stop
/usr/local/apache/bin/apachectl stop: httpd stopped
install multiple versions of Apache, you should specify different values for --prefix
when running configure. When installing version 1.3.12, I instructed configure to place
it in a directory other than the default /usr/local/apache:
# configure --prefix=/usr/local/apache1_3_12
Now, the newly installed 1.3.12 version will have its own configuration file, its own
Apache daemon executable (httpd), and its own set of DSO modules.
And, if you want to run multiple copies of the same Apache server version, but with alter-
nate configurations, you can use the -f argument to httpd. This argument lets you choose
a configuration file that is read by the Apache daemon at startup and contains all the set-
Configuration
tings that define the configuration for each particular server.
Essential
# httpd -l /usr/local/conf/test.conf
The directives in this container will be read only if the variable SSL is defined. In other
words, you want the server to listen for connections on TCP port 443 (the standard port
for SSL) only if you defined SSL when you started the daemon. Do this by invoking the
Apache daemon, httpd, with the -D argument, like so:
# /usr/local/apache/bin/httpd –D SSL
You can do the same to store alternate configurations in a single file, by setting your own
defines for the different blocks you want to be active.
80 Chapter 3 Installing Apache
In Sum:
This chapter has presented three methods of installing Apache:
1. From a package prepared for a Linux package manager, such as RPM,
2. From the downloadable binaries available from the Apache Foundation
3. By compiling the binaries yourself from the source files.
Although it’s more difficult and time-consuming, compiling from source is the way I
prefer to install Apache, because it permits the greatest flexibility in configuration or cus-
tomization. Complete instructions were given on how to obtain, compile, and install
Apache from source code. Many sites will prefer to install ready-made binaries, however,
and these offer the quickest and most painless way to install Apache and upgrade it when
the time comes. Full instructions on using the Apache Foundation’s binary archives and
third-party RPM packages were given. In the next chapter, I’ll describe the Apache con-
figuration file (httpd.conf) and the most important of the core directives that can be used
in that file to customize your Apache server. The core directives are always available in
every Apache server, and there is nothing in this chapter that does not apply to your
Apache server. It is the most important reading you’ll probably do in this book.
The Apache Core
4
Directives
Every directive is associated with a specific module; the largest module is the core module,
which has special characteristics. This module cannot be unlinked from the Apache kernel
and cannot be disabled; the directives it supplies are always available on any Apache server.
All of the directives presented in this chapter are from the core Apache module, and all of
the most important directives from the core module are covered. The most important other
modules and their directives are covered in relevant chapters throughout the book. (For
example, mod_proxy and its directives are presented in Chapter 13’s discussion of using
Apache as a proxy server.) Apache’s online documentation includes a comprehensive ref-
erence to all the modules and directives; Appendix D shows how to make effective use of
this documentation.
The core module provides support for basic server operations, including options and
commands that control the operation of other modules. The Apache server with just the
core module isn’t capable of much at all. It will serve documents to requesting clients
(identifying all as having the content type defined by the DefaultType directive). While
all of the other modules can be considered optional, a useful Apache server will always
include at least a few of them. In fact, nearly all of the standard Apache modules are used
on most production Apache servers, and more than half are compiled into the server by
the default configuration.
In this chapter, we’ll see how directives are usually located in a single startup file
(httpd.conf). I’ll show how the applicability of directives is often confined to a specific
scope (by default, directives have a general server scope). Finally, I’ll show how directives
can be overridden on a directory-by-directory basis (using the .htaccess file).
Appendix A is a tabular list of all directives enabled by the standard set of Apache modules.
For each directive the table includes the context(s) in which the directive is permitted, the
Overrides statement that applies to it (if any), and the module required to implement the
directive. Appendix D is a detailed guide to using the excellent Apache help system, which
should be your first stop when you need to know exactly how a directive is used. In con-
figuring Apache, you will need to make frequent reference to these appendices.
Configuration
■ The main server configuration file, httpd.conf
Essential
■ The resource configuration file, srm.conf
■ The access permissions configuration file, access.conf
The Apache Software Foundation decided to merge these into a single file, and in all current PART 2
releases of Apache, the only configuration file required is httpd.conf. Although there are
legitimate reasons to split the Apache configuration into multiple files (particularly when
hosting multiple virtual hosts), I find it very convenient to place all my configuration direc-
tives into a single file. It greatly simplifies creating backups, and maintaining revision his-
tories. It also makes it easy to describe your server configuration to a colleague—just e-mail
them a copy of your httpd.conf!
TIP To follow along with the descriptions in this chapter, you might find it
useful to open or print the httpd.conf on your system to use for reference. On
most systems, the file is stored as /usr/local/apache/conf/httpd. If you have
Apache loaded from a Red Hat Linux distribution CD or a Red Hat Package Man-
ager (RPM) distribution, you’ll find the file as /etc/apache/conf/httpd.conf.
Nearly everything you do to change the Apache configuration requires some
modification of this file.
For convenience, the httpd.conf file is divided into three sections. Although these divi-
sions are arbitrary, if you try to maintain these groupings, your configuration file will be
much easier to read. The three sections of the httpd.conf are:
Section 1: The global environment section contains directives that control the
operation of the Apache server process as a whole. This is where you place direc-
tives that control the operation of the Apache server processes, as opposed to
directives that control how those processes handle user requests.
Section 2: The main or default server section, contains directives that define the
parameters of the “main” or “default” server, which responds to requests that
84 Chapter 4 The Apache Core Directives
aren’t handled by a virtual host. These directives also provide default values for
the settings of all virtual hosts.
Section 3: The virtual hosts section contains settings for virtual hosts, which allow
Web requests to be sent to different IP addresses or hostnames and have them han-
dled by the same Apache server process. Virtual hosts are the subject of Chapter 6.
The Virtual Host Context: Although a virtual host is actually defined by the con-
tainer directive <VirtualHost>, for the purpose of defining directive contexts, it
is considered separately because many virtual host directives actually override
general server directives or defaults. As discussed in Chapter 6, the virtual host
attempts to be a second server in every respect, running on the same machine and,
to a client that connects to the virtual host, appearing to be the only server run-
ning on the machine.
The .htaccess Context: The directives in an .htaccess file are treated almost
identically to directives appearing inside a <Directory> container in httpd.conf.
The main difference is that directives appearing inside an .htaccess file can be
Configuration
disabled by using the AllowOverride directive in httpd.conf.
Essential
For each directive, Appendix A lists the context in which it can be used and the overrides
that enable or disable it. For example, looking at the information for the Action directive,
PART 2
you can see that it is valid in all four contexts but is subject to being overridden, when
used in an .htaccess file, by a FileInfo override. That is, if the FileInfo override is not
in effect for a directory, an Action directive appearing inside an .htaccess file in that
directory is disabled. This is because the Action directive is controlled by the FileInfo
override.
The Apache server is smart enough to recognize when a directive is being specified out of
scope. You’ll get the following error when you boot, for example, if you attempt to use
the Listen directive in a <Directory> context:
# /usr/local/apache/bin/httpd
Syntax error on line 925 of /usr/local/apache1_3_12/conf/httpd.conf:
Listen not allowed here
httpd could not be started
doing that, you should understand the purpose of the four directives in this section. These
directives, while simple to understand and use, all have a server-wide scope, and affect the
way many other directives operate. Because of the importance of these directives, you
should take care to ensure that they are set properly.
Typically this directory will contain the subdirectories bin/, conf/, and logs/. In lieu of
defining the server root directory using the ServerRoot configuration directive, you can
also specify the location with the -d option when invoking httpd:
/usr/local/apache/bin/httpd -d /etc/httpd
While there’s nothing wrong with using this method of starting the server, it is usually
best reserved for testing alternate configurations and for cases where you will run mul-
tiple versions of Apache on the same server simultaneously, each with its own configura-
tion file.
Paths for all other configuration files are taken as relative to this directory. For example,
the following directive causes Apache to write error messages into /usr/local/apache/
logs/error.log:
ErrorLog logs/error.log
Configuration
and added a new DocumentRoot directive of my own:
Essential
# DocumentRoot “/usr/local/apache/htdocs”
DocumentRoot “/home/httpd/html”
Note that a full path to the directory must be used whenever the directory is outside the PART 2
server root. Otherwise, a relative path can be given. (The double quotes are usually
optional, but it’s a good idea to always use them. If the string contains spaces, for
example, it must be enclosed in double quotes.)
When you change the DocumentRoot, you must also alter the <Directory> container
directive that groups all directives that apply to the DocumentRoot and subdirectories:
# <Directory “/usr/local/apache/htdocs”>
<Directory “/home/httpd/html”>
The name of this group is arbitrary, but I use the command shown above to assign own-
ership of the cgi-bin directory (and all of its subdirectories and files) to a user named
nobody and the group webteam. The default behavior of Apache on Linux is to run under
the nobody user account. The group name is arbitrary, but it is to this group that I assign
membership for those user accounts that are permitted to create or modify server scripts.
The second line ensures that the file owner has full read, write, and execute permission,
that members of the webteam group have read and write access, and that all other users
have no access to the directory or the files it contains.
Configuration
redirect for this HTTP error.
Essential
The DefaultType Directive
A very rarely used directive in the general server scope, DefaultType can redefine the
PART 2
default MIME content type for documents requested from the server. If this directive is
not used, all documents not specifically typed elsewhere are assumed to be of MIME type
text/html. Chapter 16 discusses MIME content types.
there will be no httpd process that binds itself to network sockets and listens for client
connections. Instead, the Linux inetd process is configured to listen for client connec-
tions on behalf of Apache and spawn httpd processes, as required, to handle arriving
connections. This is similar to the way Linux handles services like File Transfer Pro-
tocol (FTP).
The Apache inetd mode of operation is not recommended for most Apache installations,
although it results in a more efficient use of resources if the server load is very light (a few
hundred connections per day), or when the available memory is extremely limited (64MB
or less RAM). The Apache server processes spend most of their time in an idle (waiting)
state, so not running these processes continuously frees resources (particularly memory)
that would otherwise be tied up.
The downside is that, since the system has to create a new listener process for each client
connection, there is a delay in processing Web requests. The use of dynamically loadable
modules increases the time required for Apache to load and begin responding to user
requests. This delay is not usually significant when Apache starts its pool of processes in
standalone mode, but in inetd mode, where Apache starts the processes after the request
is received, the delay can be noticeable. This is particularly true if a large number of DSO
modules have to be loaded and mapped into the Apache kernel’s address space. When
using Apache in inetd mode, you should avoid using dynamic modules and instead stat-
ically link the necessary modules, and eliminate those modules that you aren’t using by
commenting out or deleting the associated directives in httpd.conf.
NOTE Some Apache administrators prefer to use inetd and TCP wrappers for
all server processes. The Apache Software Foundation questions the security
benefits of this practice and does not recommend the use of TCP wrappers with
the Apache server.
More General-Server Directives 91
Setting Up Apache for inetd Setting up Apache to run in the inetd mode is not quite
as simple as running the server in the default standalone mode. Besides adding the
ServerType inetd directive to httpd.conf, you must ensure that the Linux system is con-
figured to respond to Web requests and spawn httpd server processes as required. The
Linux /etc/services file must contain lines for the TCP ports on which Apache requests
will be received. For standard HTTP requests on TCP port 80, the /etc/services file
should contain the following line:
http 80/tcp
If you are running Apache with Secure Sockets Layer (SSL), you should also include a line
Configuration
for the default SSL port:
Essential
https 443/tcp
Additionally, for each of the lines in /etc/services that apply to Apache, you must have PART 2
a corresponding line in the /etc/inetd.conf file. For the two lines above, you would
make sure /etc/inetd.conf contains the following lines:
http stream tcp nowait nobody /usr/local/apache/bin/httpd
https stream tcp nowait nobody /usr/local/apache/bin/httpd -DSSL
The first argument on each line is the service name and must match an entry in /etc/
services. These lines give the inetd server process a full command path and optional
arguments to run the Apache server for each defined service. The process will be started
with the user ID (UID) nobody, which in Linux is UID -1. The user nobody owns the
Apache process, so you should ensure that file and directory permissions permit user
nobody to access all resources needed by the server.
Before these changes will be effective, it is necessary to restart Apache or send the HUP
(hangup) signal to the running inetd process, as in this example:
# ps -ef | grep inetd
root 352 1 0 08:17 ? 00:00:00 inetd
# kill -HUP 352
In the following example, I’ve chosen to run Apace as www, a special Web-specific user
account that I create on all my Web servers. For ease of administration, Apache resources
on my server are usually owned by user www and group wwwteam.
User www
I place all the Web developers’ accounts in this group and also change the Apache con-
figuration to run server processes owned by this group.
As it must with the User directive, a standalone Apache server must be started as root to
use the group directive. Otherwise, the server can’t change the group ownership of any
child processes it spawns.
Configuration
Apache is started, the DNS query will fail and the Apache server will not start.
Essential
This directive is very limited. It can be used only once in an Apache configuration. If mul-
tiple directives exist, only the last is used. It cannot specify port values, nor can it be used
PART 2
to specify multiple IP addresses (other than the special case of * or ALL). For these reasons,
the Listen directive (described shortly) is much more flexible and should usually be used
instead.
BindAddress 192.168.1.1
This example of the BindAddress directive (which is always valid only in a server context)
causes the Apache server to bind to, or listen for, connections on a single network inter-
face (designated by the IP address assigned to that port). By default, Apache listens for
connections on all network interfaces on the system. This directive can be used, for
example, with an Apache server on an intranet to force it to listen only for connections
on the system’s local area network address, ignoring connection attempts on any other
network adapters that may exist (particularly those accessible from the Internet).
Like the BindAddress directive, the Port directive is limited to a single TCP port for the
server and cannot be used to set different port values for different network interfaces.
Also, only one Port directive is used in httpd.conf. If more than one exists, the last one
overrides all others. However, while the BindAddress directive should be avoided, using
the Port directive is a good practice. This is because the Port directive serves a second
purpose: the Port value is used with the value of the ServerName directive to generate
URLs that point back to the system itself. These self-referential URLs are often generated
automatically by scripts (Chapter 8) or Server-Side Include pages (Chapter 7). While it is
acceptable to rely on the default value of the Port directive (80), if you wish to create self-
referential URLs that use any port other than 80, you must specify a Port directive. For
example:
Port 443
This directive defines the default port on which Apache listens for connections as TCP
port 443, the standard port for Secure Sockets Layer. (SSL is the subject of Chapter 15.)
Note that subsequent Listen directives can cause Apache to accept connections on
other TCP ports, but whenever the server creates a URL to point back to itself (a “self-
referential URL”), the Port directive will force it to include 443 as the designated port for
connections.
If Listen specifies only a port number, the server listens to the specified port on all system
network interfaces. If a single IP address and a single port number are given, the server lis-
tens only on that port and interface.
Multiple Listen directives may be used to specify more than one address and port to
listen to. The server will respond to requests from any of the listed addresses and ports.
The Options Directive 95
For example, to make the server accept connections on both port 80 and port 8080, use
these directives:
Listen 80
Listen 8080
To make the server accept connections on two specific interfaces and port numbers, iden-
tify the IP address of the interface and the port number separated by a colon, as in this
example:
Listen 192.168.1.3:80
Configuration
Listen 192.168.1.5:8080
Essential
Although Listen is very important in specifying multiple IP addresses for IP-based virtual
hosting (discussed in detail in Chapter 6), the Listen directive does not tie an IP address
to a specific virtual host. Here’s an example of the Listen directive used to instruct PART 2
Apache to accept connections on two interfaces, each of which uses a different TCP port.
Listen 192.168.1.1:80
Listen 216.180.25.168:443
I use this configuration to accept ordinary HTML requests on Port 80 on my internal net-
work interface; connections on my external interface (from the Internet) are accepted
only on TCP Port 443, the default port for Secure Sockets Layer (SSL) connections (as
we’ll see in Chapter 15).
The following examples should clarify the rules governing the merging of options. In the
first example, only the option Includes will be set for the /web/docs/spec directory:
<Directory /web/docs>
Options Indexes FollowSymLinks
</Directory>
<Directory /web/docs/spec>
Options Includes
</Directory>
In the example below, only the options FollowSymLinks and Includes are set for the /
web/docs/spec directory:
<Directory /web/docs>
Options Indexes FollowSymLinks
</Directory>
<Directory /web/docs/spec>
Options +Includes -Indexes
</Directory>
The Container Directives 97
Using either -IncludesNOEXEC or -Includes disables server-side includes. Also, the use
of a plus or minus sign to specify a directive has no effect if no options list is already in
effect. Thus it is always a good idea to ensure that at least one Options directive that
covers all directories is used in httpd.conf. Options can be added to or removed from
this list as required in narrower scopes.
WARNING Be aware that the default setting for Options is All. For that reason,
you should always ensure that this default is overridden for every Web-accessible
directory. The default configuration for Apache includes a <Directory> container
Configuration
to do this; do not modify or remove it.
Essential
The Container Directives PART 2
The scope of an Apache directive is often restricted using special directives called con-
tainer directives. In general, these container directives are easily identified by the
enclosing <> brackets. The conditional directives <IfDefine> and <IfModule>, which are
not container directives, are an exception. Container directives require a closing directive
that has the same name and begins with a slash character (much like HTML tags.)
A container directive encloses other directives and specifies a limited scope of applica-
bility for the directives it encloses. A directive that is not enclosed in a container directive
is said to have global scope and applies to the entire Apache server. A global directive is
overridden locally by the same directive when it is used inside a container. The following
sections examine each type of container directive.
The directives you enclose in the <VirtualHost> container will specify the correct host
name and document root for the virtual host. Naturally, the server name should be a
value that customers of the Web site expect to see when they connect to the virtual host.
Additionally, the file served to the customers needs to provide the expected information.
In addition to these obvious directives, almost anything else you need to customize for the
virtual host can be set in this container. For example:
<VirtualHost 192.168.1.4>
ServerAdmin [email protected]
DocumentRoot /home/httpd/wormsdocs
ServerName www.worms.com
ErrorLog logs/worms.log
TransferLog logs/worms.log
</VirtualHost>
The example above defines a single virtual host. In Chapter 6, we’ll see that this is one
form of virtual host, referred to as IP-based. The first line defines the Internet address (IP)
for the virtual host. All connections to the Apache server on this IP address are handled
by the virtual server for this site, which might be only one of many virtual sites being
hosted on the same server. Each directive defines site-specific values for configuration
parameters that, outside a <VirtualHost> container directive, normally refer to the entire
server. The use of each of these in the general server context has already been shown.
<Directory> containers are always evaluated so that the shortest match (widest scope) is
applied first, and longer matches (narrower scope) override those that may already be in
effect from a wider container. For example, the following container disables all overrides
for every directory on the system (/ and all its subdirectories):
<Directory />
AllowOverride None
</Directory>
The Container Directives 99
If the httpd.conf file includes a second <Directory> container that specifies a directory
lower in the file system hierarchy, the directives in the container take precedence over
those defined for the file system as a whole. The following container enables FileInfo
overrides for all directories under /home (which hosts all user home directories on most
Linux systems):
<Directory /home/*>
AllowOverride FileInfo
</Directory>
The <Directory> container can also be matched against regular expressions by using the
Configuration
‘~’ character to force a regular expression match:
Essential
<Directory ~ “^/home/user[0-9]{3}”>
The <DirectoryMatch> directive is specifically designed for regular expressions, however, PART 2
and should normally be used in place of this form. This container directive is exactly like
<Directory>, except that the directories to which it applies are matched against regular
expressions. The following example applies to all request URLs that specify a resource that
begins with /user, followed by exactly three digits. (The ^ character denotes “beginning of
string,” and the {3} means to match the previous character; in this case any member of the
character set [0–9]).
<DirectoryMatch “^/user[0-9]{3}”>
order deny,allow
deny from all
allow from .foo.com
</Directory>
This container directive would apply to a request URL like the following:
http://jackal.hiwaay.net/user321
Many Apache configuration directives accept regular expressions for matching pat-
terns. Regular expressions are an alternative to wildcard pattern matching and are
usually an extension of a directive’s wildcard pattern matching capability. Indeed, I
have heard regular expressions (or regexps) described as “wildcards on steroids.”
100 Chapter 4 The Apache Core Directives
A brief sidebar can hardly do justice to the subject, but to pique your interest, here are
a few regexp tags and what they mean:
^ and $ Two special and very useful tags that mark the beginning and end of a
line. For example, ^# matches the # character whenever it occurs as the
first character of a line (very useful for matching comment lines), and #$
would match # occurring as the very last character on a line. These
pattern-matching operators are called anchoring operators and are said
to “anchor the pattern” to either the beginning or the end of a line.
* and ? The character * matches the preceding character zero or more times,
and ? matches the preceding pattern zero or one time. These operators
can be confusing, because they work slightly differently from the same
characters when used as “wildcards.” For example, the expression fo*
will match the pattern foo or fooo (any number of o characters), but it
also matches f, which has zero o’s. The expression ca? will match the c
in score, which seems a bit counterintuitive because there’s no a in the
word, but the a? says zero or one a character. Matching zero or more
occurrences of a pattern is usually important whenever that pattern is
optional. You might use one of these operators to find files that begin
with a name that is optionally followed by several digits and then an
extension. Matching for ^filename\d*.gif will match filename001
.gif and filename2.gif, but also simply filename.gif. The \d
matches any digit (0–9), in other words, we are matching zero or more
digits.
+ Matches the preceding character one or more times, so ca+ will not
match score, but will match scare.
. The period character matches any single character except the newline
character. In effect, when you use it, you are saying you don’t care
what character is matched, as long as some character is matched. For
example x.y matches xLy but not xy; the period says the two must be
separated by a single character. The expression x.*y says to match an
x and a y separated by zero or more characters.
The only way to develop proficiency in using regexps is to study examples and
experiment with them. Entire books have been written on the power of regular
Configuration
expressions (well, at least one) for pattern matching and replacement.
Essential
Some useful resources on regexps are:
Like the <Directory> container, <Files> can also be matched against regular expres-
sions by using the ~ character to force a regular expression match. The following line, for
example, matches filenames that end in a period character (escaped with a backslash)
immediately followed by the characters xml. The $ in regular expressions denotes the end
of the string. Thus we are looking for file names with the extension .xml.
<Files ~ “\.xml$”>
Directives go here
</Files>
102 Chapter 4 The Apache Core Directives
The <FilesMatch> directive is specifically designed for regular expressions, however, and
should normally be used in place of this form.
<FilesMatch> is exactly like the <Files> directive, except that the specified files are
defined by regular expressions. All graphic images might be defined, for example, using:
<FilesMatch> “\.(gif|jpe?g|png)$”>
some directives
</FilesMatch>
This regular expression matches filenames with the extension gif or jpg or jpeg or png.
(The or is denoted by the vertical bar ‘|’ character.) Notice the use of the ? character after
the e, which indicates zero or one occurrences of the preceding character (e). In other
words, a match is made to jp, followed by zero or one e, followed by g.
You can also use extended regular expressions by adding the ~ character, as described for
the <Directory> and <Files> container directories; but a special container directive,
<LocationMatch>, is specifically designed for this purpose and should be used instead.
<LocationMatch> is exactly like the <Location> container directive, except that the
URLs are specified by regular expressions. The following container applies to any URL
The Container Directives 103
that contains the substring /www/user followed immediately by exactly three digits; for
example, /www/user911:
<LocationMatch “/www/user[0-9]{3}”>
order deny,allow
deny from all
allow from .foo.com
</Location>
Configuration
<Limit> encloses directives that apply only to the HTTP methods specified. In the fol-
Essential
lowing example, user authentication is required only for requests using the HTTP
methods POST, PUT, and DELETE:
<Limit POST PUT DELETE> PART 2
require valid-user
</Limit>
<LimitExcept> encloses directives that apply to all HTTP methods except those speci-
fied. The following example shows how authentication can be required for all HTTP
methods other than GET:
<LimitExcept GET>
require valid-user
</Limit>
Perl Sections
If you are using the mod_perl module, it is possible to include Perl code to automatically
configure your server. Sections of the httpd.conf containing valid Perl code and enclosed
in special <Perl> container directives are passed to mod_perl’s built-in Perl interpreter.
The output of these scripts is inserted into the httpd.conf file before it is parsed by the
Apache engine. This allows parts of the httpd.conf file to be generated dynamically, pos-
sibly from external data sources like a relational database on another machine.
Since this option absolutely requires the use of mod_perl, it is discussed in far more detail
with this sophisticated module in Chapter 8.
the request. <Directory> containers are always evaluated from widest to nar-
rowest scope, and directives found in .htaccess files override those in
<Directory> containers that apply to the same directory.
2. Directives found in <DirectoryMatch> containers and <Directory> containers
that match regular expressions are evaluated next. Directives that apply to the
request override those in effect from <Directory> or .htaccess files (item 1 of
this list).
3. After directives that apply to the directory in which the resource resides, Apache
applies directives that apply to the file itself. These come from <Files> and
<FilesMatch> containers, and they override directives in effect from <Directory>
containers. For example, if an .htaccess file contains a directive that denies the
requester access to a directory, but a directive in a <Files> container specifically
allows access to the file, the request will be granted, because the contents of the
<Files> container override those of the <Directory> container.
4. Finally, any directives in <Location> or <LocationMatch> containers are applied.
These directives are applied to the request URL and override directives in all other
containers. If a directive in a <Location> container directly conflicts with the
same directive in either a <Directory> or a <Files> container, the directive in the
<Location> container will override the others.
Containers with narrower scopes always override those with a wider scope. For example,
directives contained in <Directory /home/httpd/html> override those in <Directory
/home/httpd> for the resources in its scope. If two containers specify exactly the same
scope (for example, both apply to the same directory or file), the one specified last takes
precedence.
The following rather contrived example illustrates how the order of evaluation works.
<Files index.html>
allow from 192.168.1.2
</Files>
<Directory /home/httpd/html>
deny from all
</Directory>
In this example, the <Directory> container specifically denies access to the /home/
httpd/html directory to all clients. The <Files> directive (which precedes it in the
httpd.conf file) permits access to a single file index.html inside that directory, but only
to a client connecting from IP address 192.168.1.2. This permits the display of the HTML
page by that client, but not any embedded images; these can’t be accessed, because the
The .htaccess File 105
<Files> directive does not include them in its scope. Note also that the order of the con-
tainers within the configuration file is not important; it is the order in which the con-
tainers are resolved that determines which takes precedence. Any <Files> container
directives will always take precedence over <Directory> containers that apply to the
same resource(s).
Configuration
editing this file is not always the most efficient configuration method. Most Apache
Essential
administrators prefer to group directory-specific directives, particularly access-control
directives, in special files located within the directories they control. This is the purpose
of Apache’s .htaccess files. In addition to the convenience of having all the directives
that apply to a specific group of files located within the directory that contains those files, PART 2
.htaccess files offer a couple of other advantages. First, you can grant access to modify
.htaccess files on a per-directory basis, allowing trusted users to modify access permis-
sions to files in specific directories without granting those users unrestricted access to the
entire Apache configuration. Second, you can modify directives in .htaccess files
without having to restart the Apache server (which is the only way to read a modified
httpd.conf file).
By default, the Apache server searches for the existence of an .htaccess file in every direc-
tory from which it serves resources. If the file is found, it is read and the configuration direc-
tives it contains are merged with other directives already in effect for the directory. Unless
the administrator has specifically altered the default behavior (using the AllowOverride
directive as described below) all directives in the .htaccess file override directives already
in effect. For example, suppose httpd.conf contained the following <Directory> section:
<Directory /home/httpd/html/Special>
order deny,allow
deny from all
</Directory>
Here, we’ve used a wildcard expression to specify a range of IP addresses (possibly the
Web server’s local subnet) that can access resources in the Special directory.
106 Chapter 4 The Apache Core Directives
Configuration
Limit Allow the use of directives that control access based on the browser host-
Essential
name or network address.
Options Allow the use of special directives, currently limited to the directives
Options and XBitHack. PART 2
Once I’ve added this line to Apache’s httpd.conf file and restarted the server, each user
on my system can now place files in a /WWW subdirectory of their home directory that
Apache can serve. Requests to a typical user’s Web files look like:
http://jackal.hiwaay.net/~caulds/index.html
108 Chapter 4 The Apache Core Directives
The UserDir directive specifies a filename or pattern that is used to map a request for a
user home directory to a special repository for that user’s Web files. The UserDir directive
can take one of three forms:
A relative path This is normally the name of a directory that, when found in the
user’s home directory, becomes the DocumentRoot for that user’s Web resources:
UserDir public_html
This is the simplest way to implement user home directories, and the one I rec-
ommend because it gives each user a Web home underneath their system home
directories. This form takes advantage of the fact that ~account is always Linux
shorthand for “user account’s home directory”. By specifying users’ home direc-
tories as a relative path, the server actually looks up the user’s system home (in the
Linux /etc/passwd file) and then looks for the defined Web home directory
beneath it).
WARNING Be careful when using the relative path form of the UserDir direc-
tive. It can expose directories that shouldn’t be accessible from the Web. For
example, when using the form http://servername/~root/, the Linux shortcut
for ~root maps to a directory in the file system reserved for system files on most
Linux systems. If you had attempted to designate each user’s system home direc-
tory as their Web home directory (using UserDir /), this request would map to
the /root directory. When using the relative directory form to designate user
Web home directories, you should lock out any accounts that have home direc-
tories on protected file systems (see “Enabling/Disabling Mappings” below). The
home directory of the root account (or superuser) on Linux systems should be
protected. If someone was able to place an executable program in one of root’s
startup scripts (like .profile or .bashrc), that program would be executed the
next time a legitimate user or administrator logged in using the root account.
This example would give each user their own directory with the same name as
their user account underneath /home/httpd/userstuff. This form gives each
user a Web home directory that is outside their system home directory. Main-
taining a special directory for each user, outside their system home directory, is
not a good idea if there are a lot of users. They won’t be able to maintain their
own Web spaces, as they could in their respective home directories, and the entire
responsibility will fall on the administrator. Use the absolute form for defining
Setting Up User Home Directories 109
user Web home directories only if you have a small number of users, preferably
where each is knowledgeable enough to ensure that their Web home directory is
protected from other users on the system.
An absolute path with placeholder An absolute pathname can contain the *
character (called a placeholder), which is replaced by the username when deter-
mining the DocumentRoot path for that user’s Web resources. Like the absolute
path described above, this form can map the request to a directory outside the
user’s system home directory:
UserDir /home/httpd/*/www
Configuration
Apache substitutes the username taken from the request URL of the form http:/
Essential
/servername/~username/ to yield the path to each user’s Web home directory:
/home/httpd/username/www
PART 2
If all users have home directories under the same directory, the placeholder in the
absolute path can mimic the relative path form, by specifying:
UserDir /home/*/www
The behavior of the lookup is slightly different, though, using this form. In the rel-
ative path form, the user’s home directory is looked up in /etc/passwd. In the
absolute path form, this lookup is not performed, and the user’s Web home direc-
tory must exist in the specified path. The advantage of using the absolute path in
this manner is that it prevents URLs like http://servername/~root from map-
ping to a location that Web clients should never access.
The disadvantage of using the “absolute path with placeholder” form is that it forces all
Web home directories to reside under one directory that you can point to with the abso-
lute path. If you needed to place user Web home directories in other locations (perhaps
even on other file systems) you will need to create symbolic links that point the users’
defined Web home directories to the actual location of the files. For a small to medium-
sized system, this is a task that can be done once for each user and isn’t too onerous, but
for many users, it’s a job you might prefer to avoid.
The use of the UserDir directive is best illustrated by example. Each of the three forms of
the directive described above would map a request for
http://jackal.hiwaay.net/~caulds/index.html
to generate a URL redirect request that would send the requester to the following
resource, which is on a separate server:
http://server2.hiwaay.net/~caulds/docfiles/index.html
Enabling/Disabling Mappings
Another form of the UserDir directive uses the keywords enabled or disabled in one of
three ways:
UserDir disabled <usernames>
turns off all username-to-directory mappings. This form is sually used prior to a UserDir
enabled directive that explicitly lists users for which mappings are performed.
UserDir enabled <usernames>
Setting Up User Home Directories 111
Configuration
with the permissions of the Web server could be disastrous. Such a script would have the
Essential
same access privileges that the Web server itself uses, and this is normally not a good
thing. To protect the Web server from errant or malicious user-written CGI scripts, and
to protect Web users from one another, user CGI scripts are usually run from a program
PART 2
called a CGI wrapper. A CGI wrapper is used to run a CGI process under different user
and group accounts than those that are invoking the process. In other words, while ordi-
nary CGI processes are run under the user and group account of the Apache server (by
default that is user nobody and group nobody), using a CGI wrapper, it is possible to
invoke CGI processes that run under different user and group ownership.
There are several such CGI wrappers, but one such program, called suEXEC, is a standard
part of Apache in all versions after version 1.2 (though not enabled by the default installa-
tion). SuEXEC is very easy to install, and even easier to use. There are two ways in which
suEXEC is useful to Apache administrators. The most important use for suEXEC is to
allow users to run CGI programs from their own directories that run under their user and
group accounts, rather than that of the server.
The second way in which suEXEC is used with Apache is with virtual hosts. When used
with virtual hosts, suEXEC changes the user and group accounts under which all CGI
scripts defined for each virtual host are run. This is used to give virtual host administra-
tors the ability to write and run their own CGI scripts without compromising the security
of the primary Web server (or any other virtual host).
Listing 4.1 A build.sh Script for Building Apache 1.3.12 with suEXEC Support
CFLAGS=“-DUSE_RANDOM_SSI -DUSE_PARSE_FORM” \
./configure \
“--enable-rule=EAPI” \
“--with-layout=Apache” \
“--prefix=/usr/local/apache” \
“--enable-module=most” \
“--enable-module=ssl” \
“--enable-shared=max” \
“--enable-suexec” \
“--suexec-caller=www” \
“--suexec-docroot=/home/httpd/html” \
“--suexec-logfile=/usr/local/apache/logs/suexec_log” \
“--suexec-userdir=public_html” \
“--suexec-uidmin=100” \
“--suexec-gidmin=100” \
“--suexec-safepath=/usr/local/bin:/usr/bin:/bin” \
“$@”
To build and install Apache, with suEXEC, I enter three lines in the Apache source
directory:
# ./build.sh
# make
# make install
After building and installing Apache with suEXEC support, you should test it by invoking
httpd with the -l argument. If suEXEC is functional, the result will look like this:
# ./httpd -l
Compiled-in modules:
http_core.c
mod_so.c
suexec: enabled; valid wrapper /usr/local/apache/bin/suexec
If Apache is unable to find suEXEC, or if it does not have its user setuid execution bit
set, suEXEC will be disabled:
# ./httpd -l
Compiled-in modules:
http_core.c
mod_so.c
suexec: disabled; invalid wrapper /usr/local/apache1_3_12/bin/suexec
Setting Up User Home Directories 113
Apache will still start, even if suEXEC is unavailable, but suEXEC will be disabled. You
have to keep an eye on this; it is unfortunate that, when suEXEC is disabled, no warning
is given when Apache is started, and nothing is written into Apache’s error log. The
error log will only show when suEXEC is enabled. You can check inside Apache’s error
log (which is in logs/error.log under the Apache installation directory, unless you’ve
overridden this default value). If all is OK, the error log will contain the following line,
usually immediately after the line indicating that Apache has been started:
[notice] suEXEC mechanism enabled (wrapper: /usr/local/apache/bin/suexec)
If suEXEC is not enabled when Apache is started, verify that you have the suexec
Configuration
wrapper program, owned by root, in Apache’s bin directory:
Essential
# ls -al /usr/local/apache/bin/suexec
-rws--x--x 1 root root 10440 Jun 28 09:59 suexec
PART 2
Note the s in the user permissions. This indicates that the setuid bit is set—in other
words, the file, when executed, will run under the user account of the file’s owner. For
example, the Apache httpd process that invokes suexec will probably be running under
the nobody account. The suexec process it starts, however, will run under the root
account, because root is the owner of the file suexec. Only root can invoke the Linux
setuid and setgid system functions to change the ownership of processes it spawns as
children (the CGI scripts that run under its control). If suexec is not owned by root, and
does not have its user setuid bit set, correct this by entering the following lines while
logged in as root:
# chown root /usr/local/apache/bin/suexec
# chmod u+s /usr/local/apache/bin/suexec
If you wish to disable suEXEC, the best way is to simply remove the user setuid bit:
# chmod u-s /usr/local/apache/bin/suexec
This not only disables suEXEC, but it also renders the suEXEC program a bit safer
because it will no longer run as root (unless directly invoked by root).
Using suEXEC
While suEXEC is easy to set up, it’s even easier to use. Once it is enabled in your running
Apache process, any CGI script that is invoked from a user’s Web directory will execute
under the user and group permissions of the owner of the Web directory. In other words,
if I invoke a script with a URL like http://jackal.hiwaay.net/~caulds/cgi-bin/
somescript.cgi, that script will run under caulds’s user and group account. Note that
all CGI scripts that will run under the suEXEC wrapper must be in the user’s Web direc-
tory (which defaults to public_html but can be redefined by the --suexec-userdir con-
figuration) or a subdirectory of that directory.
114 Chapter 4 The Apache Core Directives
For virtual hosts, the user and group accounts under which CGI scripts are run are
defined by the User and Group directives found in the virtual host container:
<VirtualHost 192.168.1.1>
ServerName vhost1.hiwaay.net
ServerAdmin [email protected]
DocumentRoot /home/httpd/NamedVH1
User vh1admin
Group vh1webteam
</VirtualHost>
If a virtual host does not contain a User or Group directive, the values for these are inher-
ited from the primary Web server (usually user nobody and group nobody). Note that all
CGI scripts that will run under suEXEC for a virtual host described above must reside
beneath the DocumentRoot (they can be in any subdirectory beneath DocumentRoot, but
they cannot reside outside it).
I first granted ownership of this directory to the user and group that Apache processes run
under. On Linux systems, these are both user ID -1, which are both named nobody. (Most
Unix systems use the same user ID and group ID as Linux. FreeBSD, which does not provide
either of these, is the most notable exception.) I changed the ownership of the directory recur-
sively, so that all subdirectories and files would be accessible to the user and group nobody:
# chown -R nobody.nobody /usr/doc/MySQL-3.22.29
I symbolically linked the top-level HTML file to one that Apache will read when the
requested URL names only the directory, and not a particular file (that is, where it
matches one of the names specified in DirectoryIndex):
Configuration
Essential
# ln –s manual_toc.html index.html
Using a symbolic link, rather than copying the file or renaming it, ensures that only one
copy of the file exists, but can be accessed by either name. The last step was the insertion PART 2
of two Alias directives into httpd.conf. Place these in a manner that seems logical to
you, probably somewhere in the section of the file labeled 'Main' server configuration,
so that you can easily locate the directives at a later date.
Alias /MySQL/ “/usr/doc/MySQL-3.22.29/”
Alias /ApacheDocs/ “/usr/local/apache/docs/”
Any user can now access these sets of documentation on my server using these URLs:
http://jackal.hiwaay.net/MySQL/
http://jackal.hiwaay.net/ApacheDocs/
This URL actually maps to a directory on the server (a directory named dirname beneath
the directory defined in the Apache configuration as DocumentRoot). It is only through a
standard Apache module named mod_dir that a specific page is served to clients that send
116 Chapter 4 The Apache Core Directives
a request URL that maps to a directory. Without mod_dir, the second form, which does
not specify a singe resource, would be invalid and would produce an HTTP 404 (Not
Found) error.
The mod_dir module serves two important functions. First, whenever a request is received
that maps to a directory but does not have a trailing slash (/) as in:
http://jackal.hiwaay.net/dirname
mod_dir sends a redirection request to the client indicating that the request should be
made, instead, to the URL:
http://jackal.hiwaay.net/dirname/
This requires a second request on the part of the client to correct what is, technically, an
error in the original request. Though the time required to make this second request is usu-
ally minimal and unnoticed by the user, whenever you express URLs that map to direc-
tories rather than files, you should include the trailing slash for correctness and efficiency.
The second function of mod_dir is to look for and serve a file defined as the index file for
the directory specified in the request. That page, by default, is named index.html. This can
be changed using mod_dir’s only directive, DirectoryIndex, as described below. The name
of the file comes from the fact that it was originally intended to provide the requestor with
an index of the files in the directory. While providing directory indexes is still useful, the file
is used far more often to serve a default HTML document, or Web page, for the root URL;
this is often called the home page. Remember that this behavior is not a given; mod_dir must
be included in the server configuration and enabled for this to work.
The last change I made was to add a second filename to the DirectoryIndex directive. I
added an entry for index.htm to cause the Apache server to look for files of this name,
which may have been created on a system that follows the Microsoft convention of a
three-character filename extension. The files are specified in order of preference from left
to right, so if it finds both index.html and index.htm in a directory, it will only serve
index.html.
# DirectoryIndex index.html
DirectoryIndex index.html index.htm
Providing Directory Indexes 117
Configuration
Figure 4.1 A plain directory listing
Essential
PART 2
118 Chapter 4 The Apache Core Directives
In addition to this plain directory listing, mod_autoindex also allows the administrator full
control over every aspect of the directory listing it prepares. This is called fancy indexing.
You can enable fancy indexing by adding the following directive to httpd.conf:
IndexOptions FancyIndexing
The default httpd.conf provided with the Apache distribution uses many of the direc-
tives that I’ll describe in the following sections to set up the default fancy directory for use
on your server. Figure 4.2 shows what this standard fancy directory looks like when dis-
playing the results of a directory request.
Index Options
IndexOptions can also be used to set a number of other options for configuring directory
indexing. Among these are options to specify the size of the icons displayed, to suppress
the display of any of the columns besides the filename, and whether or not clicking the
column heading sorts the listing by the values in that column. Table 4.1 depicts all pos-
sible options that can be used with the IndexOptions directive.
Providing Directory Indexes 119
IconsAreLinks Makes icons part of the clickable anchor for the filename.
IconHeight=pixels Sets the height (in pixels) of the icons displayed in the list-
ing. Like the HTML tag <IMG HEIGHT=n …>.
Configuration
IconWidth=pixels Sets the width (in pixels) of the icons displayed in the list-
Essential
ing. Like the HTML tag <IMG WIDTH=n …>.
NameWidth=n Sets the width (in characters) of the filename column in the
listing, truncating characters if the name exceeds this PART 2
width. Specifying NameWidth=* causes the filename col-
umn to be as wide as the longest filename in the listing.
Options are always inherited from parent directories. This behavior is overridden by
specifying options with a + or – prefix to add or subtract the options from the list of
120 Chapter 4 The Apache Core Directives
options that are already in effect for a directory. Whenever an option is read that does not
contain either of these prefixes, the list of options in effect is immediately cleared. Con-
sider this example:
IndexOptions +ScanHTMLTitles -IconsAreLinks SuppressSize
If this directive appears in an .htaccess file for a directory, regardless of the options
inherited by that directory from its higher-level directories, the net effect will be the same
as this directive:
IndexOptions SuppressSize
Specifying Icons
In addition to IndexOptions, mod_autoindex provides other directives that act to con-
figure the directory listing. You can, for example, provide a default icon for unrecognized
resources. You can change the icon or description displayed for a particular resource,
either by its MIME type, filename, or encoding type (GZIP-encoded, for example). You
can also specify a default field and display order for sorting; or identify a file whose con-
tent will be displayed at the top of the directory.
The AddIcon Directive AddIcon specifies the icon to display for a file when fancy
indexing is used to display the contents of a directory. The icon is identified by a relative
URL to the icon image file. Note that the URL you specify is embedded directly in the for-
matted document that is sent to the client browser, which then retrieves the image file in
a separate HTTP request.
The name argument can be a filename extension, a wildcard expression, a complete file-
name, or one of two special forms. Examples of the use of these forms follow:
AddIcon /icons/image.jpg *jpg*
AddIcon (IMG, /icons/image.jpg) .gif .jpg .bmp
The second example above illustrates an alternate form for specifying the icon. When
parentheses are used to enclose the parameters of the directive, the first parameter is the
alternate text to associate with the resource; the icon to be displayed is specified as a relative
URL to an image file. The alternate text, IMG, will be displayed by browsers that are not
capable of rendering images. A disadvantage of using this form is that the alternate text
cannot contain spaces or other special characters. The following form is not acceptable:
AddIcon (“JPG Image”, /icons/image.jpg) .jpg
There are two special expressions that can be used in place of a filename in the AddIcon
directive to specify images to use as icons in the directory listing. ^^BLANKICON^^ is used
Providing Directory Indexes 121
to specify an icon to use for blank lines in the listing, and ^^DIRECTORY^^ is used to specify
an icon for directories in the listing:
AddIcon /icons/blankicon.jpg ^^BLANKICON^^
AddIcon /icons/dir.pcx ^^DIRECTORY^^
There is one other special case that you should be aware of. The parent of the directory
whose index is being displayed is indicated by the “..” filename. You can change the icon
associated with the parent directory with a directive like the following:
AddIcon /icons/up.gif ..
Configuration
Essential
NOTE The Apache Software Foundation recommends using AddIconByType
rather than AddIcon whenever possible. Although there appears to be no real dif-
ference between these (on a Linux system, the MIME type of a file is identified by PART 2
its filename extension), it is considered more proper to use the MIME type that
Apache uses for the file, rather than directly examining its filename. There are
often cases, however, when no MIME type has been associated with a file and you
must use AddIcon to set the image for the file.
The AddIconByType Directive AddIconByType specifies the icon to display in the direc-
tory listing for files of certain MIME content types. This directory works like the AddIcon
directive just described, but it relies on the determination that Apache has made of the
MIME type of the file (as discussed in Chapter 16, Apache usually determines the MIME
type of a file based on its filename).
AddIconByType /icons/webpage.gif text/html
AddIconByType (TXT, /icons/text.gif) text/*
This directive is used almost exactly like AddIcon. When parentheses are used to enclose the
parameters of the directive, the first parameter is the alternate text to associate with the
resource; the icon to be displayed is specified as a relative URL to an image file. The last
parameter, rather than being specified as a filename extension, is a MIME content type (look
in conf/mime.types under the Apache home for a list of types that Apache knows about).
Specifying a Default Icon A special directive, DefaultIcon, is used to set the icon that
is displayed for files with which no icon has been associated with either of the other direc-
tives mentioned above. The directive simply identifies an image file by relative URL:
DefaultIcon /icons/unknown.pcx
The AddAlt Directive The AddAlt directive specifies an alternate text string to be dis-
play for a file, instead of an icon, in text-only browsers. Like its AddIcon counterpart, the
directive specifies a filename, partial filename, or wildcard expression to identify files:
AddAlt “JPG Image” /icons/image.jpg *jpg*
AddIcon “Image File”.gif .jpg .bmp
Note that it is possible to use a quoted string with the AddAlt directive, which can contain
spaces and other special characters. This is not possible when specifying alternate text
using the special form of AddIcon as shown above.
The AddAltByType Directive AddAltByType sets the alternate text string to be displayed
for a file based on the MIME content type that Apache has identified for the file. This
directive works very much like its counterpart, AddIconByType.
AddAltByType “HTML Document” text/html
Note that this example sets a description to apply to all files named index.html. To apply
the description to a specific file, use its full and unique pathname:
AddDescription “My Home Page” /home/httpd/html/index.html
AddDescription can also be used with wildcarded filenames to set descriptions for entire
classes of files (identified by filename extension in this case):
AddDescription “PCX Image” *.pcx
AddDescription “TAR File” *.tgz *.tar.gz
Configuration
When multiple descriptions apply to the same file, the first match found will be the one
used in the listing; so always specify the most specific match first:
Essential
AddDescription “Powered By Apache Logo” poweredby.gif
AddDescription “GIF Image” *.gif
PART 2
In addition to AddDescription, there is one other way that mod_autoindex can deter-
mine values to display in the Description column of a directory listing. If IndexOptions
ScanHTMLTitles is in effect for a directory, mod_autoindex will parse all HTML files
in the directory, and extract descriptions for display from the <TITLE> elements of the doc-
uments. This is handy if the directory contains a relatively small number of HTML docu-
ments, or is infrequently accessed. Enabling this option requires that every HTML
document in the directory be opened and examined. For a large number of files, this can
impose a significant workload, so the option is disabled by default.
Files identified by the HeaderName directive must be of the major MIME content type text.
If the file is identified as type text/html (generally by its extension), it is inserted ver-
batim; otherwise it is enclosed in <PRE> and </PRE> tags. A CGI script can be used to gen-
erate the information for the header (either as HTML or plain text), but you must first
associate the CGI script with a MIME main content type (usually text), as follows:
AddType text/html .cgi
HeaderName HEADER.cgi
124 Chapter 4 The Apache Core Directives
The ReadmeName directive works almost identically to HeaderName to specify a file (again
relative to the URI used to access the directory being indexed) that is placed in the listing
just before the closing </BODY> tag.
Ignoring Files
The IndexIgnore directive specifies a set of filenames that are ignored by mod_autoindex
when preparing the index listing of a directory. The filenames can be specified by wildcards:
IndexIgnore FOOTER*
Example
In order to illustrate typical uses of some of the mod_autoindex directives discussed, I
created an .htaccess in the same directory that was illustrated in Figure 4.2. This file
contains the following directives, all of which are used by mod_autoindex to customize
the index listing for the directory. The result of applying these directives is shown in
Figure 4.3.
IndexOptions +ScanHTMLTitles
AddIcon /icons/SOUND.GIF .au
AddDescription “1-2-Cha-Cha-Cha” DancingBaby.avi
AddAltByType “This is a JPG Image” image/jpeg
HeaderName HEADER.html
ReadmeName README.txt
Providing Directory Indexes 125
Configuration
Essential
PART 2
The IndexOptions directive is used to enable the extraction of file descriptions from the
<TITLE> tags of HTML formatted files (technically, files of MIME content type text/
html). In the illustration, you’ll see that it did that for the file indexOLD.html. If this file
had its original name, index.html, the index listing would not have been generated;
instead, index.html would have been sent (by mod_dir) to the client.
I’ve also provided an example of adding an icon using the AddIcon directive and a file
description using AddDescription. The results of these directives can be easily seen in
Figure 4.3. The alternate text for JPEG images (added with the AddAltByType directive)
is not displayed in the figure but would be seen in place of the image icon in text-only
browsers. It will also appear in a graphical browser in a pop-up dialog box when the
cursor is paused over the associated icon. This gives the page developer a handy way to
add help text to a graphics-rich Web page, which can be particularly useful when the icon
or image is part of an anchor tag (clickable link) and can invoke an action.
126 Chapter 4 The Apache Core Directives
The last two directives I added to the .htaccess file for this directory specify an HTML-
formatted file to be included as a page header and a plain text file to be included as a page
footer. These both consist of a single line, also visible in Figure 4.3. The header file con-
tains HTML-formatting tags (<H3> … </H3>) that cause it to be rendered in larger, bolder
characters. There is no reason that either the header or footer could not be much longer
and contain far more elaborate formatting. Use your imagination.
In Sum
This chapter has covered a lot of ground, because so much of Apache’s functionality is
incorporated into the configuration directives provided by its core modules. We began
with the essential concept of directive context, the scope within which particular direc-
tives are valid. We then looked at the directives used to configure the basic server envi-
ronment and how the server listens for connections. These directives are fundamental to
Apache’s operation, and every administrator needs to be familiar with them.
Later sections of the chapter explored the directives used to create and manage user home
directories. These are not only an essential function for any ISP installation of an Apache
server, they are also widely used in intranets.
The next chapter moves beyond the core module to the use of third-party modules and the
techniques you can use to incorporate them into your Apache server.
Apache Modules
5
I ’ve already discussed the importance of modules to Apache’s design philosophy.
Without the concept of extension by module, it is unlikely that Apache would have gar-
nered the level of third-party support that directly led to its phenomenal success in the early
days of the Web. Apache owes much of that success to the fact that any reasonably profi-
cient programmer can produce add-on modules that tap directly into the server’s internal
mechanisms. As administrators, we benefit greatly from the availability of these third-
party modules.
At one time, it was thought that commercial Web servers, with the support that “commer-
cial” implies, would eventually eclipse the open-source Apache server. It seemed com-
pletely logical that when a company began to get serious about the Web, it needed to look
for a serious Web engine, a commercial server—not some piece of unsupported free soft-
ware downloaded from the Internet. But as we’ve seen, Apache took the top spot from its
commercial rivals and has continued to widen that lead, even while most Unix-based appli-
cations slowly gave ground to their NT competitors. Apache owes much of its success to
a vibrant, innovative, and completely professional community of users and developers that
you can be a part of. Apache is as fully supported as any commercial product. Virtually any
feature or function you can desire in a Web server is available as an Apache module, usually
offered by its author at no cost to all Apache users.
This chapter looks at the types of modules available, how the module mechanism works,
how to link modules to Apache as dynamic shared objects (DSOs), and where to find third-
party modules. It concludes with a step-by-step example of installing a module.
128 Chapter 5 Apache Modules
phase of the cycle, the phase always includes the determination of which virtual
host will handle the request. This phase sets up the server to handle the request.
Modules that register callbacks for this phase of the request cycle include mod_
proxy and mod_setenvif, which get all the information they need from the
request URL.
URL Translation: At this stage the URL is translated into a filename. Modules
like mod_alias, mod_rewrite or mod_userdir, which provide URL translation
services, generally do their main work here.
Header Parsing: This phase is obsolete (superseded by the Post-Read-Request
Configuration
phase); no standard modules register functions to be called during this phase.
Essential
Access Control: This phase checks client access to the requested resource, based on
the client’s network address, returning a response that either allows or denies the
user access to the server resource. The only module that acts as a handler for the
PART 2
Access Control phase of the request cycle is mod_access (discussed in Chapter 14).
Authentication: This phase verifies the identity of the user, either accepting or
rejecting credentials presented by that user, which are as simple as a username/
password pair. Examples of modules that do their work during this phase are
mod_auth and mod_auth_dbm.
Authorization: Once the user’s identity has been verified, the user’s authorization
is checked to determine if the user has permission to access the requested resource.
Although authenticating (identifying) the user and determining that user’s autho-
rization (or level of access) are separate functions, they are usually performed by
the same module. The modules listed as examples for the Authentication phase
also register callbacks for the Authorization phase.
MIME type checking: Determines the MIME type of the requested resource,
which can be used to determine how the resource is handled. A good example is
mod_mime.
FixUp: This is a catch-all phase for actions that need to be performed before the
request is actually fulfilled. mod_headers is one of the few modules on my system
that register a callback for this request phase.
Response or Content: This is the most important phase of the Request cycle; the one
in which the requested resource is actually processed. This is where a module is reg-
istered to handle documents of a specific MIME type. The mod_cgi module is
registered, for example as the default handler for documents identified as CGI scripts.
Logging: After the request has been processed, a module can register functions to
log the actions taken. While any module can register a callback to perform actions
during this phase (and you can easily write your own) most servers will use only
mod_log_config (covered in Chapter 12) to take care of all logging.
130 Chapter 5 Apache Modules
Cleanup: Functions registered here are called when an Apache child process shuts
down. Actions that would be defined to take place during this phase include the
closing of open files and perhaps of database connections. Very few modules actu-
ally register a callback for this request phase. In fact, none of the standard
modules use it.
The first line preloads the module into the Apache:: namespace. The second line registers
the myhandler function within that module as a callback during the PostReadRequest
phase of the request cycle. When a request comes in, Apache will ensure that myhandler,
which has already been loaded and compiled by mod_perl, is called. The function will
have access to Apache’s internal data structures and functions through the Perl Apache
API calls (each of which, in turn, calls a function from the Apache API).
You’ll learn more about working with mod_perl in Chapter 8. One of the best and most
complete sets of online documentation for any Apache module is that available for mod_
perl at perl.apache.org/guide/.
Installing Third-Party Modules 131
Configuration
Essential
There is no rigid specification to which Apache modules from third-party sources must
adhere. There is no standard procedure for installing and using Apache modules. There
are guidelines, however, that define a “well-behaved” Apache module, and most mod-
PART 2
ules are fairly standard and therefore quite simple to install and configure.
Most third-party modules, though, are better compiled outside the Apache source tree. In
other words, they are compiled in a completely separate directory from the Apache
source, as dynamic shared object (DSO) modules, and are loaded at runtime by Apache.
Although the module source can be placed inside the Apache source tree and the APACI
configuration utility instructed to compile it as a DSO, I strongly recommend against
doing this. If you intend to use a module as a DSO, it can be compiled on its own, outside
the Apache source tree, using a utility called apxs, which is provided with the Apache dis-
tribution. One advantage of compiling with apxs is that the resulting module, which will
have the extension .so for shared object, is a stand-alone module that can be used with
different versions of the server. This allows you to upgrade modules without recompiling
Apache, as you must do when a module is compiled within the Apache source tree using
132 Chapter 5 Apache Modules
APACI. More importantly, using DSO modules compiled with apxs allows you to
upgrade the Apache server without having to rerun the configuration for each module,
specifying the new Apache source tree.
There are nearly as many installation procedures as there are modules. Some install inside
the Apache tree; most can be compiled separately from Apache. Some simply compile a
DSO and leave you to manually edit httpd.conf; some configure httpd.conf for you.
Read the INSTALL file carefully before compiling any module, at least to get some idea of
how the installation proceeds and what options are available. In general, though, the best
way to compile and install Apache modules is to use the utility Apache has provided spe-
cifically for this purpose, apxs. Because most third-party modules are best compiled as
DSOs using apxs, that is the method I describe in this chapter. The only modules I rec-
ommend installing as statically linked modules are those that come with the standard
Apache distribution. These are automatically linked to Apache during the server instal-
lation unless at least one --enable-shared argument is passed to configure. Chapter 3
describes how standard modules are chosen and identified as statically linked or DSO
modules.
Module Loaded When: Module always loaded, Module loaded only when
even if disabled and specified in httpd.conf.
unused.
Configuration
Essential
Recommended When: The Apache configuration Server configuration
is simple, requiring few changes frequently or
add-on modules and few when modules are fre-
PART 2
changes and when fastest quently changed,
possible loading is upgraded, or installed
important. for testing.
http_core.c
mod_so.c
This example shows the most basic httpd daemon, which must always have the core
module linked into it, and optionally, the mod_so module that provides support for DSO
modules. All other module support is dynamically linked at runtime to the httpd process.
The mod_so module supplies the server with a new directive, LoadModule, that is used in the
Apache configuration file to designate a module for dynamic loading. When the server
reads a LoadModule directive from the configuration during its initialization, mod_so will
load the module and add its name to the list of available Apache modules. The module does
not become available, however, until an AddModule directive specifically enables it. The
AddModule directive is a core directive and is not specific to DSO modules. All modules,
even those that are statically linked into the Apache kernel, must be explicitly enabled with
an AddModule directive. Only DSO modules, however, require the LoadModule directive.
A DSO module exposes an external name for itself that does not necessarily have to match
the name of the shared object file. For example, if a module calling itself firewall_
module is stored in a file mod_firewall.so in the /libexec directory under the Apache
ServerRoot, it is enabled for use by the inclusion of the following two lines in the Apache
configuration file:
LoadModule firewall_module libexec/mod_firewall.so
AddModule mod_firewall.c
The LoadModule directive (supplied by mod_so) links the named module to the httpd pro-
cess, and then adds the module to the list of active modules. The module is not available,
however, until enabled by the AddModule directive, which makes the module’s structure,
its internal functions, and any directives it supports, available to Apache. As noted above,
the LoadModule directive has no meaning for statically linked modules, but an AddModule
line is required for all modules before they can be used. This permits the disabling of even
a statically linked module by simply commenting out or removing its associated
AddModule line in httpd.conf.
These two directives do not have to be located together; and in most cases, they are not.
Somewhere near the beginning of your httpd.conf file, in the general server configura-
tion, you should find a group of LoadModule directives, followed by a section consisting
of AddModule directives. Always remember that when you add a LoadModule directive,
you must add a corresponding AddModule directive to enable the loaded module.
If you disable a module by simply commenting out its AddModule directive, you will be
loading a module that is never used; and that, of course, is wasteful. Conversely, if you
have an AddModule directive without a corresponding LoadModule directive, the module
must be statically linked, or you will get a configuration error when you start the server
Installing Third-Party Modules 135
because you will be attempting to enable a module the server knows nothing about. Gen-
erally, you should add and delete the LoadModule and AddModule directives in pairs.
The order in which DSO modules are loaded determines the order in which they are called
by Apache to handle URLs. As Apache loads each module, it adds the name to a list. DSO
modules are always processed in the reverse of the order in which they are loaded, so the
first modules loaded are the last ones processed. This is a very important thing to remember.
In later chapters, you’ll encounter some modules that must be processed in the correct order
to avoid conflicts. When a module must be processed before another, make sure its
AddModule line is placed after the other module in the httpd.conf file.
Configuration
The internal list of modules can be erased with the ClearModuleList and then recon-
Essential
structed with a series of AddModule directives. If you compiled Apache to use DSO modules,
you’ll find that it does exactly that in the httpd.conf it created, which begins like this:
ClearModuleList PART 2
AddModule mod_vhost_alias.c
AddModule mod_env.c
AddModule mod_log_config.c
AddModule mod_mime.c
... lines deleted ...
You can refer to this section to see the processing order of modules, but change it only
with very good reason; altering Apache’s default ordering of the AddModule lines can
cause undesirable and unpredictable results. There are times, however, when the pro-
cessing order of modules needs to be changed. You may, for example, want to use mul-
tiple modules that provide the same functionality but in a specific order. In Chapter 14,
I’ll describe how the processing order of authentication modules is often important. There
are other cases where some modules fail to function completely if other modules precede
them. In Chapter 10, we’ll see an example.
If you do venture to change the file, remember the rule of first loaded, last processed.
Using apxs
Since the release of Apache version 1.3, Apache has been packaged with a Perl script
called apxs (for APache eXtenSion). This relatively simple utility is used to compile and
install third-party modules. One important benefit of using apxs rather than placing the
module in the Apache source tree and compiling it with the APACI configure script is
that apxs can handle modules consisting of more than one source file; configure cannot.
A few modules have special installation requirements; these modules generally come with
detailed instructions (usually in a file named INSTALL) that should be followed carefully.
Generally, modules that cannot be installed using the procedures detailed in this section
136 Chapter 5 Apache Modules
are those that must make modifications to the Apache source. The OpenSSL module
(mod_ssl), discussed in Chapter 15, is one such module. As you’ll see, during its instal-
lation this module makes extensive patches and additions to the Apache source and
requires a recompilation of Apache to work.
With those exceptions, however, nearly every Apache module can be compiled with apxs.
apxs is the preferred way to compile most third-party modules, and you should become
quite familiar with its use.
You can invoke apxs with combinations of the following arguments to control its actions.
-g Generates a template for module developers; when supplied with a module
name (using the -n switch) this option creates a source code directory with that
name and installs a makefile and sample module C source code file within it. The
sample C program is a complete module that can actually be installed; however,
it does nothing but print out a line indicating that it ran. Example:
# apxs -g -n mod_MyModule
-q Queries the apxs script for the values of one or more of its defaults. When the
apxs script is created during an APACI installation, default values for the fol-
lowing variables are hard-coded into the script: TARGET, CC, CFLAGS, CFLAGS_
SHLIB, LD_SHLIB, LDFLAGS_SHLIB, LIBS_SHLIB, PREFIX, SBINDIR, INCLUDEDIR,
LIBEXECDIR, SYSCONFDIR. Examples:
# /usr/local/apache/bin/apxs -q TARGET
httpd
# /usr/local/apache/bin/apxs -q CFLAGS
-DLINUX=2 -DMOD_SSL=204109 -DUSE_HSREGEX -DEAPI -DUSE_EXPAT -I../lib/
expat-lite
# /usr/local/apache/bin/apxs -q PREFIX
/usr/local/apache
TIP The default value for any apxs hard-coded variable can be overridden by
specifying a new value with the -S switch, for example:
# apxs –S PREFIX=”/usr/local/apachetest” -c -n MyModule.so
-c Compiles and links a DSO module, given the name of one or more source
files (and, optionally, a list of supporting libraries). Using the -c argument to apxs
enables the following options:
-o outputfile Specifies the name of the resulting module file rather than
determining it from the name of the input file.
-D name=value Specifies compiler directives to be used when compiling the
module.
Where to Find Modules 137
Configuration
-Wl,flags Passes flags to the linker. Each flag must be specified as it would
Essential
appear if it was a command-line argument, and the comma is mandatory:
# axps –c -Wl,-t MyModule.c
-i Installs a DSO module that has already been created with apxs -c into PART 2
its correct location, which is determined by the PREFIX variable hard-coded
into apxs, if not overridden with a -S switch. Using the -i apxs argument
enables two others:
-a modifies the Apache configuration file (httpd.conf) to add LoadModule
and AddModule directives to enable the newly installed module.
-A Use this argument to add the lines, but leave them commented out so
they don’t take effect when Apache is started.
-e Exactly like -i.
-n Works to name a module that is not the same as the DSO file. Example:
# apxs -i -a -n mod_MyModule MyModule.so
The -c and -i arguments to apxs are usually combined. The following line will compile
a DSO from a single source file, install it, and modify the Apache configuration to load
it the next time Apache is started:
# apxs –c -i -a MyModule.so
description of each one’s function, along with information about the author and, most
importantly, a link to the site where the latest version of the module is maintained for
download. Figure 5.1 shows the search form for this site.
TIP To request a list of all the modules available on the site, simply enter an
empty search string.
Configuration
Essential
PART 2
The mod_random module redirects clients to a random URL from a list provided either in
Apache configuration directives or in a text file. You could use this module, if you’re the
serious sort, to implement a simple load-balancing scheme, randomly redirecting clients
to different servers. Or, you may (like me) simply use the module for fun.
1. Begin by downloading the module from the author’s site (modules.apache.org
links to it, but if you need the URL it’s http://www.tangent.org/mod_random).
Download the latest archive of the module, which was mod_random-0_9_tar.gz
when I snagged it. Unpack the archive into a location like /usr/local/src:
# pwd
/usr/local/src
# tar xvfz /home/caulds/mod_random-0_9_tar.gz
140 Chapter 5 Apache Modules
mod_random-0.9/
mod_random-0.9/ChangeLog
mod_random-0.9/INSTALL
mod_random-0.9/LICENSE
mod_random-0.9/Makefile
mod_random-0.9/README
mod_random-0.9/TODO
mod_random-0.9/VERSION
mod_random-0.9/mod_random.c
As you can see, there’s not a lot to the module; the only file you really need is the C source
code (mod_random.c). Everything else is simply nonessential support files and documenta-
tion. This working core of the module consists of only about 100 lines of easy-to-follow C
source code and is worth a glance if you intend to write your own simple module in C.
Installing and configuring the module took me about five minutes; if the author has done his
part, there’s absolutely no reason for anyone to be afraid of a third-party Apache module!
2. Make sure that the directory into which you extracted the files is the working
directory:
# cd mod_random-0.9
# ls -al
total 14
drwxr-xr-x 2 1001 root 1024 Dec 11 17:48 .
drwxr-xr-x 17 root root 1024 Mar 15 13:24 ..
-rw-r--r-- 1 1001 root 30 Dec 11 17:47 ChangeLog
-rw-r--r-- 1 1001 root 779 Dec 11 17:47 INSTALL
-rw-r--r-- 1 1001 root 1651 Dec 11 17:47 LICENSE
-rw-r--r-- 1 1001 root 820 Dec 11 17:47 Makefile
-rw-r--r-- 1 1001 root 738 Dec 11 17:47 README
-rw-r--r-- 1 1001 root 72 Dec 11 17:47 TODO
-rw-r--r-- 1 1001 root 4 Dec 11 17:47 VERSION
-rw-r--r-- 1 1001 root 3342 Dec 11 17:47 mod_random.c
3. At this point, you should read the installation instructions (INSTALL) and glance at
the contents of the makefile (if one has been provided). The makefile contains
instructions for a command-line compilation and installation, and it probably even
contains lines for stopping, starting, and restarting the Apache server. These lines
are added by the template-generation (-g) argument to apxs, described in the last
section. After demonstrating the manual use of apxs to install mod_random, I’ll show
how the Linux make utility can be used to simplify the already simple procedure.
Example of Installing a Module 141
4. Although you can break this up into a couple of steps, I found it convenient to
compile (-c) and install (-i) the module, and configure Apache to use it (-a) all
in one command:
# /usr/local/apache/bin/apxs -c -i -a -n random mod_random.c
gcc -DLINUX=2 -DMOD_SSL=204109 -DUSE_HSREGEX -DEAPI -DUSE_EXPAT -I../lib/
expat-lite -fpic -DSHARED_MODULE -I/usr/local/apache/include -c mod_
random.c
gcc -shared -o mod_random.so mod_random.o
cp mod_random.so /usr/local/apache/libexec/mod_random.so
chmod 755 /usr/local/apache/libexec/mod_random.so
Configuration
[activating module `random' in /usr/local/apache/conf/httpd.conf]
Essential
5. Make sure that the installation procedure modified httpd.conf to use the new
module. I checked using the Linux grep utility to extract mod_random entries
from httpd.conf:
PART 2
# grep mod_random /usr/local/apache/conf/httpd.conf
LoadModule random_module libexec/mod_random.so
AddModule mod_random.c
7. Then I checked server-info to insure that mod_random is ready to rock (Figure 5.3).
This interesting server status page is explored in more detail in Chapter 11.
8. One part of any module configuration is always manual, and that is editing the
Apache configuration to make use of the module, usually by specifying the
module as a handler, and usually by including directives supplied by the module.
Our mod_random is no exception. I added the following section to my httpd.conf
file to take full advantage of all the module’s features:
# Brian Aker's mod_random configuration
#
<Location /randomize>
SetHandler random
RandomURL http://www.acme.com/
RandomURL http://www.apple.com/macosx/inside.html
RandomURL http://www.asptoday.com/
RandomURL http://atomz.com/
RandomFile /usr/local/apache/conf/random.conf
</Location>
142 Chapter 5 Apache Modules
12. I was immediately redirected to one of the sites I’d specified for random selection
in httpd.conf.
Example of Installing a Module 143
You may or may not eventually have a use for the mod_random module. But the basic pro-
cedure demonstrated in this example will be the same for any module you decide to add:
download the archived file; extract it into your working directory; compile and install it
(after reading the INSTALL file for instructions); check your httpd.conf file to verify
that the module has been added; manually edit the configuration file to specify your new
module as a handler; and finally test the configuration.
Configuration
the tasks I just described. You can use the included Makefile (if one exists) to perform the
Essential
steps I described above, but the additional convenience it offers is only slight. If you’ll
examine the makefile included with mod_random (Listing 5.1), you’ll see that it does nothing
but invoke the same commands I demonstrated above, using apxs to do the real work. PART 2
##
## Makefile -- Build file for mod_random Apache module
##
# the used tools
APXS=/usr/local/apache/bin/apxs
APACHECTL=/usr/local/apache/bin/apachectl
# additional defines, includes and libraries
#DEF=-Dmy_define=my_value
#INC=-Imy/include/dir
#LIB=-Lmy/lib/dir -lmylib
# the default target
all: mod_random.so
# compile the shared object file
mod_random.so: mod_random.c
$(APXS) -c $(DEF) $(INC) $(LIB) mod_random.c
# install the shared object file into Apache
install: all
$(APXS) -i -a -n 'random' mod_random.so
# cleanup
clean:
-rm -f mod_random.o mod_random.so
# install and activate shared object by reloading Apache to
144 Chapter 5 Apache Modules
The entire process of compiling and installing mod_random, using the supplied makefile,
can be summarized as follows:
make Compiles mod_random.so with apxs.
make install Uses apxs to copy mod_random.so to Apache and modify
server config.
make restart Restarts Apache using apachectl.
NOTE On the surface, the makefile appears to be the simplest way to install
third-party modules, and it often is; but this method depends on the existence of
a properly configured makefile. The standard makefile also depends on the
values of several environment variables to work properly. If these aren’t set on
your machine (or if you run multiple Apache configurations), the makefile will not
work as expected. This is a good reason to bypass the makefile and invoke the
proper apxs commands manually.
In Sum
From the very beginning, the Apache Web server was designed for easy expandability
by exposing a set of functions that allowed programmers to write add-in modules easily.
Support for dynamic shared objects was added with the release of Apache 1.3. DSO
allows modules to be compiled separately from the Apache server and loaded by the
server at runtime if desired or omitted by the administrator who wants to reduce the
amount of memory required for each loaded copy of Apache.
The modular architecture of Apache is an important factor in the popularity of the server.
Because of its fairly uncomplicated programmers’ interface for extending the server’s capa-
bilities, a large number of modules are available (at no cost) from third-party sources.
Virtual Hosting
6
T he term virtual hosting refers to maintaining multiple Web sites on a single server
machine and differentiating those sites by hostname aliases. This allows companies sharing
a single Web server to have their Web sites accessible via their own domain names, as
www.company1.com and www.company2.com, without requiring the user to know any extra
path information. With the number of Web sites on the Internet constantly increasing, the
ability to host many Web sites on a server efficiently is a critical feature of a first-class Web
server engine. Apache provides full support for virtual hosting and is a superb choice of
Web engine for hosting large numbers of virtual Web sites (or virtual hosts).
This chapter outlines the three basic methods of configuring a single Apache engine to sup-
port multiple Web sites: IP-based virtual hosts, name-based virtual hosts, and dynamic vir-
tual hosting. Much of the discussion focuses on the functionality provided by the standard
Apache module used for virtual hosting, mod_virtual. The mod_virtual module supports
two types of virtual hosts:
■ IP-based virtual hosts are identified by the IP address on which client requests are
received. Each IP-based virtual host has its own unique IP address and responds to
all requests arriving on that IP address.
■ Name-based virtual hosts take advantage of a feature of HTTP/1.1 designed to elim-
inate the requirement for dedicating scarce IP addresses to virtual hosts. As mentioned
in Chapter 1, HTTP/1.1 requests must have a Host header that identifies the name of
146 Chapter 6 Virtual Hosting
the server that the client wants to handle the request. For servers not supporting
virtual hosts, this is identical to the ServerName value set for the primary server.
The Host header is also used to identify a virtual host to service the request, and
virtual hosts identified by the client Host header are thus termed name-based vir-
tual hosts.
Apache was one of the first servers to support virtual hosts right out of the box. Since ver-
sion 1.1, Apache has supported both IP-based and name-based virtual hosts. This chapter
examines both IP-based and name-based virtual hosts in detail.
The chapter also introduces the concept of dynamic virtual hosting, which uses another
module, mod_vhost_aliases. Dynamic virtual hosts are virtual hosts whose configura-
tion is not fixed, but is determined (using a predefined template) from the request URL.
The advantage of dynamic virtual hosts is that literally thousands of these can be sup-
ported on a single server with only a few lines of template code, rather than having to
write a custom configuration for each.
In general, you will want to use IP-based virtual hosts whenever you must support
browsers that aren’t HTTP/1.1-compliant (the number of these in use is rapidly dwin-
dling), and when you can afford to dedicate a unique IP address for each virtual host (the
number of available IP addresses is also dwindling). Most sites will prefer to use name-
based virtual hosts. Remember, though, that with name-based virtual hosting, non-
HTTP/1.1 browsers will have no way to specify the virtual hosts they wish to connect to.
Configuration
Essential
IP-Based Virtual Hosting
IP-based virtual hosts are defined by the IP address used to access them, and each IP-based
PART 2
virtual host must have a unique IP address. Since no server machine has more than a few
physical network interfaces, it is likely that multiple IP-based virtual hosts will share the
same network interface, using a technique called network interface aliasing. You’ll see
how to do this on a Linux server later in this section.
Secure Sockets Layer (SSL is the subject of Chapter 15) requires each SSL Web server on the
Internet to have a unique IP address associated with its well-known hostname. Most site
hosting services and ISPs that provide SSL Web sites for their customers do so by using IP-
based virtual hosting, usually by aliasing multiple IP addresses to a small number of actual
network interfaces on each server. This has created a demand for IP-based virtual hosts—
even though its use was once declining in favor of name-based virtual hosting—and a com-
mensurate increase in demand for IP addresses to support IP-based virtual hosting.
IP-virtual hosts are quite easy to set up. Use the <VirtualHost IPaddr> container direc-
tive to enclose a group of directives that apply only to the virtual host specified (and iden-
tified by a unique IP address).
148 Chapter 6 Virtual Hosting
To create two IP-based virtual hosts on my Apache server, I placed the following section in
my httpd.conf file, making sure that this section followed any global scope directives. In
other words, any directives I wanted to apply to the Apache daemon processes or to the pri-
mary server and to provide default values for all virtual hosts are placed at the top of the file,
and they are the first read when Apache is started. For the following definitions to work, the
two IP addresses (192.168.1.4 and 192.168.1.5) must be valid IP addresses for the server,
either on separate interfaces or (as in my case) on the same interface using interface aliasing.
<VirtualHost 192.168.1.4>
ServerName vhost1.hiwaay.net
DocumentRoot /home/httpd/html/vhost1
</VirtualHost>
<VirtualHost 192.168.1.5>
ServerName vhost2.hiwaay.net
DocumentRoot /home/httpd/html/vhost2
</VirtualHost>
These are quite simple definitions. Appendix A lists all the directives that can be used
within a virtual host scope, but here I defined only a ServerName for the virtual host and
a path to the DocumentRoot for each virtual host. Connecting to the first virtual host using
the following URL:
http://192.168.1.4/
Keep in mind, though, that with IP-based virtual hosts, the hostname is irrelevant (except
to human users). Apache uses only the IP address to determine which virtual host will be
used to serve a connection. With name-based virtual hosting, as we’ll see, the hostname
is the determining factor in deciding which virtual host is used to serve a connection.
Figure 6.1 illustrates the complete request/resolution process for IP-based virtual hosting.
Later in the chapter, you’ll compare this to a similar diagram for name-based virtual
hosting.
IP-Based Virtual Hosting 149
Configuration
BindAddress *
Essential
IP-based virtual host1
PART 2
User links to URL <VirtualHost 192.168.1.4>
http://vhost1.hiwaay.net ServerName vhost1.hiwaay.net
CONNECT 192.168.1.4:80
DocumentRoot /home/httpd/
GET / HTTP/1.1 html/vhost1
</VirtualHost>
A special form of the <VirtualHost> directive is used to define a default virtual host:
<VirtualHost _default_:*>
DocumentRoot /home/http/html/defaultvh
</VirtualHost>
Here, I’ve defined a virtual host that will respond to all requests that are sent to any port
that is not already assigned to another <VirtualHost> on any valid IP address. It is also
possible to specify a single port to be used by a _default_ virtual host, for example:
<VirtualHost _default_:443>
DocumentRoot /home/httpd/html/securedefault
</VirtualHost>
<VirtualHost _default_:*>
DocumentRoot /home/httpd/html/defaultvh
</VirtualHost>
This example shows that more than one _default_ virtual host can be defined. The first
<VirtualHost _default_> container defines a special default virtual host that is used for
unrecognized connections on the Secure Sockets Layer TCP port 443. Connections coming
in on that port are served the documents found in /home/http/html/securedefault. The
second <VirtualHost _default_> container handles unrecognized connections on all
other ports. It provides those connections access to the documents in /home/http/html/
defaultvh. Because the specific port 443 is already assigned to another virtual host, the
second <VirtualHost_default_> directive ignores port 443.
organizations, however, will probably opt to use name-based virtual hosts, or because of
limited IP address space, use them out of necessity.
To add virtual IP addresses to the network interface on a Linux server, log in as root and
use the ifconfig command. In the following example I add two new virtual Ethernet
interfaces for the server’s one physical Ethernet interface (eth0). These IP addresses do
not have to be sequential (as they are here), but they must be on the same network subnet:
# /sbin/ifconfig eth0:0 192.168.1.4
# /sbin/ifconfig eth0:1 192.168.1.5
Configuration
To confirm this configuration change, I entered the ifconfig command without argu-
Essential
ments. The output is shown in Listing 6.1. As expected, the new virtual interfaces (eth0:0
and eth0:1) appear with the same hardware address (HWaddr 00:60:08:A4:E8:82) as the
physical Ethernet interface.
PART 2
Listing 6.1 The Linux ifconfig command, showing physical and virtual network interfaces
# /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:60:08:A4:E8:82
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:463 errors:0 dropped:0 overruns:0 frame:0
TX packets:497 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:11 Base address:0x6100
NOTE The last interface shown in Listing 6.1, lo, is that of the loopback
address, which is a special virtual network interface, used primarily for testing,
that is always available on a Linux system with networking enabled. The special
IP address 127.0.0.1 is reserved on all Linux systems for this virtual interface.
I created an IP-based virtual host for each new virtual network interface I created on the
server, as shown below:
<VirtualHost 192.168.1.4>
ServerName vhost1.hiwaay.net
DocumentRoot /home/httpd/html/vhost1
</VirtualHost>
<VirtualHost 192.168.1.5>
ServerName vhost2.hiwaay.net
DocumentRoot /home/httpd/html/vhost2
</VirtualHost>
NOTE ARP is used with either Ethernet or Token Ring networks. The discussion
below is based on Ethernet but applies equally to Token Ring networks, although the
MAC address of a Token Ring node will differ from the Ethernet addresses shown.
IP-Based Virtual Hosting 153
Configuration
Ethernet interface with the conflicting address and notifies the administrator of the con-
Essential
flict. In other words, the machine that’s already using the IP address gets to keep it, and
new machines trying to use the interface politely defer to the incumbent.
Figure 6.2 illustrates how ARP allows other workstations (in this case an NT 4 worksta- PART 2
tion) to discover the Ethernet address of a Linux workstation that has been configured (as
described above) to communicate using three IP addresses on the same Ethernet interface.
The server that receives this request knows only the IP address of the interface on which
it was received; it has no way of knowing which DNS name the client used to determine
that IP address. To comply with HTTP/1.1, a second header must be present even in a
minimal request, to identify the host that should process the request. This is usually the
primary Apache server, but it may be any virtual host that has been defined in the Apache
configuration. An HTTP/1.1 request would look like this:
GET / HTTP/1.1
Host: jackal.hiwaay.net
The hostname (and, optionally, the TCP port) that is placed in the Host header by the client
browser is determined from the URL of the request itself. In cases where the Web server
is using IP-based virtual hosting, or supports no virtual hosts, the Host header is usually
ignored. But when name-based virtual hosts are used, the Host header can be very important.
The Host header can be used to identify a specific virtual host that has a matching hostname.
NOTE Failure to specify the Host header is an error if the client identifies itself
as HTTP/1.1-compliant. A client that does not want to send this header must not
specify HTTP/1.1 in its request. Netscape Communicator 4.7 sends the HTTP/1.0
header but also sends the Host field. This is not an error; but it would be an error
for Netscape to send the HTTP/1.1 header and omit the Host field. I suspect that
Netscape prefers to identify itself as an HTTP/1.0 client because some other
behavior of the HTTP/1.1 specification is not fully implemented in Netscape and
can't be relied on.
Name-based virtual hosts, which (like IP-based virtual hosts) are handled by the mod_
virtual module, make use of a special directive, NameVirtualHost. This directive desig-
nates an IP address for name-based virtual hosting. When NameVirtualHost is used, the
IP address it specifies becomes available only as a name-based virtual host. It is no longer
accessible by non-HTTP/1.1 clients and cannot be used for IP-based virtual hosting.
Name-Based Virtual Hosting 155
When Apache encounters the NameVirtualHost directive while reading httpd.conf, it sets
up a virtual host table for the IP address specified. Only a single NameVirtualHost address
should exist for each IP address, designating that IP address for virtual hosting. Any number
of <VirtualHost> directives can identify the same IP address, however. As it parses
httpd.conf, Apache adds virtual hosts to the virtual host table for each IP address when-
ever it encounters a <VirtualHost> directive that specifies the same IP address as one ear-
lier designated for virtual hosting. After parsing httpd.conf, Apache has a complete list
of all virtual hosts for each IP address specified in NameVirtualHost directives.
When it receives a request on any IP address specified by a NameVirtualHost directive,
Configuration
Apache searches the associated list of virtual hosts for that IP address. When it finds a vir-
Essential
tual host that has a ServerName directive matching the Host header of the incoming
request, Apache responds to the request using the configuration defined in that virtual
host’s container. This process was illustrated earlier, in Figure 6.1.
PART 2
In name-based virtual hosting, illustrated in Figure 6.3, the virtual host selected to service
a request is always determined from the Host header of the request. If no match is found
for the virtual host requested by the client, the first virtual host defined for the IP address
is served by default. This virtual host is called the primary virtual host. Don’t confuse this
with the primary server, which is defined by the directives outside all virtual host con-
tainers. Each request for a name-based virtual host must match an IP address that has
been previously designated for virtual hosting with the NameVirtualHost directive. Only
name-based virtual hosts will be served on an address so designated; the primary server
(that is, the configuration defined outside the VirtualHost directives) will never serve any
client connecting on an IP address designated for virtual hosting.
If Apache receives an HTTP/1.0 request sent to an IP address that you identified for name-
based virtual hosting (using a NameVirtualHost directive), but the Host header is unrecog-
nized (or missing), the primary virtual host always handles the request. The <VirtualHost
_default_> directive can never be used as a name-based virtual host, because the
<VirtualHost> directive for name-based virtual hosts must always contain a valid IP
address.
With name-based virtual hosting, you should include directives for the main server that
apply to all virtual hosts rather than trying to use the main server as a repository for direc-
tives that apply to the “default” name-based virtual host. The directives for that host
should be placed in the virtual host container for the primary virtual host. Remember that
the first virtual host you define in httpd.conf for an IP address previously designated for
name-based virtual hosting is the primary virtual host for that address.
On my system (Listing 6.2), I defined two very simple name-based virtual hosts.
156 Chapter 6 Virtual Hosting
NameVirtualHost 192.168.1.1
<VirtualHost 192.168.1.1>
UseCanonicalName off
ServerName namedvh1.hiwaay.net
DocumentRoot /home/httpd/html/
</VirtualHost>
Name-Based Virtual Hosting 157
<VirtualHost 192.168.1.1>
UseCanonicalName off
ServerName namedvh2.hiwaay.net
DocumentRoot /home/httpd/html/NamedVH2
</VirtualHost>
Configuration
longer be reached on this IP address. Instead, the server creates a virtual host list for this
Essential
IP address and adds the names specified by the ServerName directive as it processes the
virtual host configurations that follow. When this server is loaded, it will have a virtual
host list for IP address 192.168.1.1 consisting of the virtual hosts namedvh1.hiwaay.net PART 2
and namedvh2.hiwaay.net. As requests arrive, the name specified in Host header of each
request is compared against this list. When a match is found, the server knows which vir-
tual host will service that request.
Remember that (in contrast to IP-based virtual hosting) any request received on the IP
address 192.168.1.1 that does not properly identify one of its name-based virtual hosts
will be served by the first named virtual host. In Listing 6.2, this is namedvh1.hiwaay.net,
which becomes sort of a default name-based host for IP address 192.168.1.1. For that
reason, I explicitly set its DocumentRoot to match that of my primary server. I did this
mainly to make the configuration file more readable; it is not necessary to set this value,
because virtual hosts inherit the value of this directive, along with that of all other direc-
tives, from the primary server.
The second virtual host in Listing 6.2 has a separate DocumentRoot, and to HTTP/1.1
browsers that connect to http://namedvh2.hiwaay.net, it appears to be completely dif-
ferent from any other Web site on this server; the only hint that it’s one virtual host among
potentially many others is that it has the same IP address as other Web servers. This is not
apparent, however, to users who know the server only by its hostname. When setting up
name-based hosts that all apply to the same IP address, you should enter the hostnames
as Canonical Name or CNAME records in the DNS server for the domain. This will place
them in the DNS as aliases for the one true hostname that should exist (as an Address or A)
record in the DNS.
There’s one more point to note about the example above. The namedvh2.hiwaay.net vir-
tual host can only be reached by browsers that send an HTTP/1.1 Host request header.
It can’t be reached at all by browsers that are unable to send this header. If you need to
provide access to name-based hosts from browsers that don’t support Host, read the next
section.
158 Chapter 6 Virtual Hosting
<VirtualHost 192.168.1.1>
ServerName SomethingBogus.com
DocumentRoot /home/httpd/
</VirtualHost>
<VirtualHost 192.168.1.1>
ServerName www.innerdomain.com
ServerPath /securedomain
DocumentRoot /home/httpd/domain
</VirtualHost>
Here, I’ve defined a virtual host with the ServerName www.innerdomain.com directive.
HTTP/1.1 clients can connect directly to http://www.innerdomain.com. HTTP/1.0 clients
will by default reach the SomethingBogus.com virtual host (even though they don’t specify
it) because it is the first defined, but they can access the innerdomain.com host using a URL
that matches the ServerPath, like http://www.innerdomain.com/securedomain. Note,
Dynamic Virtual Hosting 159
though, that they are selecting the virtual host not with a Host header that matches its
ServerName, but with a URL that matches the ServerPath. Actually, it really doesn’t matter
what hostname a non-HTTP/1.1 client uses as long as it connects on 192.168.1.1 and uses
the trailing /securedomain in its request URL.
Now, if you publish the URL http://www.innerdomain.com, HTTP/1.1 clients will have
no trouble reaching the new virtual host; but you need some way to tell non-HTTP/1.1
clients that they need to use another URL, and that’s the purpose of the first virtual host.
As the first virtual host in the list, it will be the default page served to clients that don’t
use a Host header to designate a name-based virtual host. Choose a ServerName for this
Configuration
host that no client will ever connect to directly; this virtual host is a “fall-through” that
Essential
will only serve requests from clients that don’t provide a valid Host header. In the
DocumentRoot directory for this virtual host, you should place a page that redirects non-
HTTP/1.1 clients to http://www.innerdomain.com/securedomain , similar to this:
PART 2
<HTML>
<TITLE>
Banner Page for non-HTTP/1.1 browser users.
</TITLE>
<BODY>
If you are using an older, non-HTTP/1.1 compatible browser,
please bookmark this page:
<BR>
<A HREF=/securedomain>http://www.innerdomain.com/securedomain
</A>
</BODY>
</HTML>
Also, in order to make this work, always make sure you use relative links (e.g., file.html
or ../icons/image.gif) in the www.innerdomain.com virtual host’s pages. For HTTP/1.1
clients, these will be relative to www.innerdomain.com; for HTTP/1.0 clients, they will be
relative to www.innerdomain.com/securedomain.
usually name-based hosts, each with its own unique Internet hostname and DNS entry.
ISPs that provide this service to thousands of customers need a solution for hosting huge
numbers of virtual hosts. Even name-based hosting is difficult to set up and maintain for
so many virtual sites when an administrator has to set each one up individually, even if
only a few lines in the httpd.conf is required for each.
Another technique, called dynamically configured mass virtual hosting, is used for very
large numbers of Web sites. A standard module provided with the Apache distribution,
mod_vhost_aliases, implements dynamically configured hosts by specifying templates
for DocumentRoot and ScriptAlias that are used to create the actual paths to these direc-
tories after examining the incoming URL.
The entire purpose of mod_vhost_aliases is to create directory paths for DocumentRoot
and ScriptAlias based on the request URL. It is a very simple module that is controlled by
only four directives, two for name-based and two for IP-based dynamic virtual hosting.
These directives implement name-based dynamic virtual hosting:
VirtualDocumentRoot Specifies how the module constructs a path to the
DocumentRoot for a dynamic virtual host from the request URL.
VirtualScriptAlias Works like ScriptAlias to construct a path to a direc-
tory containing CGI scripts from the request URL.
These implement IP-based dynamic virtual hosting:
VirtualDocumentRootIP Like VirtualDocumentRoot, but constructs the path
to the dynamic virtual host’s DocumentRoot from the IP address on which the
request was received.
VirtualScriptAliasIP Like VirtualScriptAlias, but constructs the path to a
directory of CGI scripts from the IP address on which the request was received.
Since mod_vhost_aliases constructs paths for dynamic hosts as requests arrive at the
server, DocumentRoot and ScriptAlias essentially become variables that change
depending on the virtual host the client is trying to reach. Thus they do not have to be
explicitly specified for each virtual host in httpd.conf. In fact, no virtual host needs to
be specified in httpd.conf; the administrator has only to ensure that a directory exists for
each virtual host on the server. If the directory doesn’t exist, the requester gets the stan-
dard Not Found message (or, if you are being user-friendly, your customized Not Found
message).
Each of the directives uses a set of specifiers to extract tokens from the request URL and
then embed them into one of two paths, either the path to DocumentRoot or the path to
ScriptAlias for the dynamic virtual host. The specifiers that can be used are listed in
Table 6.1.
Dynamic Virtual Hosting 161
Specifier Meaning
Configuration
%N The Nth part of the server name. If the full server name is
Essential
jackal.hiwaay.net, then %1 resolves to jackal, %2 to hiwaay, and so on.
%N+ The Nth part of the server name, and all parts following. If the full server
name is jackal.hiwaay.net, then %2+ resolves to hiwaay.net. PART 2
%-N The Nth part, counting backwards from the end of the string. If the full
server name is jackal.hiwaay.net, then %-1 resolves to net, and %-2
resolves to hiwaay.
%-N+ The Nth part, counting backwards, and all parts preceding it. If the full
server name is jackal.hiwaay.net, then %-2+ resolves to jackal.hiwaay.
Each of the parts that can be extracted from the server name can be further broken down
by specifying a subpart, using the specifier %N.M, where N is the main part, and M is the sub-
part. If the directive being evaluated refers to a hostname, for example, each part of the
URL is separated by the / character; the subparts are the individual characters of each
part. A URL beginning with http://caulds.homepages.hiwaay.net would yield the fol-
lowing parts:
%1 = caulds
%2 = homepages
%3 = hiwaay
%4 = net
Each of these parts can be further broken down into subparts, in this fashion:
%1.1 = c
%1.2 = a
%1.3 = u
...and so on.
162 Chapter 6 Virtual Hosting
A simple example should illustrate how this works. The mod_vhost_aliases module
translates the VirtualDocumentRoot directive specified below into a DocumentRoot path
as illustrated in Figure 6.4. The purpose of the UseCanonicalName directive is explained
in the next section.
UseCanonicalName off
VirtualDocumentRoot /home/httpd/%1/%p
This example uses two of the specifiers that create a VirtualDocumentRoot. The first
specifier (%1) returns the first portion of the server name. In this case the server name is
provided by the client in a Host header of the HTTP request (as described in the discus-
sion of UseCanonicalName). The second specifier (%p) returns the TCP port of the request
for the dynamic virtual host—in this case, the Secure Sockets Layer port 443, because this
Apache server has been configured to listen for connections on this port. To run CGI
scripts from each dynamic virtual host, use a VirtualScriptAlias in exactly the same
way to specify a dynamically constructed path to a directory containing these scripts.
http://secure.jackal.hiwaay.net:443/login.html
home/httpd/secure/443/login.html
%1 %p
In the next example, an ISP has given its users their own virtual hosts and organized the
user home directories into subdirectories based on the first two characters of the user ID.
Figure 6.5 shows how the original request URL is mapped to a pathname using parts and
subparts.
UseCanonicalName off
VirtualDocumentRoot /home/httpd/users/%3.1/%3.2/%2/%1
http://www.caulds.myisp.com/welcome.html
home/users/c/a/caulds/www/welcome.html
%3.1/%3.2/%2/%1
2734ch06.fm Page 163 Wednesday, August 29, 2001 7:15 AM
When using virtual hosts with Apache, you need to give special consideration to the
hostname that Apache will use to refer to each virtual host. The next section covers the
UseCanonicalName directive, which is particularly important for virtual hosting.
Configuration
On my local network, I connect to my Web server using its unqualified name, with a URL
Essential
like http://jackal. This URL would not work for someone on the Internet, so when my
server composes a self-referential URL, it always uses a fully qualified hostname and
(optionally) the TCP port number. The UseCanonicalName directive controls how PART 2
Apache determines the system’s hostname when constructing this self-referential URL.
There are three possible ways this directive can be used:
UseCanonicalName on Apache constructs a canonical name for the server using
information specified in the ServerName and Port server configuration directives
to create a self-referential URL.
UseCanonicalName off Apache uses the hostname and port specified in the
Host directive supplied by HTTP/1.1 clients to construct a self-referential URL
for the server. If the client uses HTTP/1.0 and does not supply a Host header,
Apache constructs a canonical name from the ServerName and Port directives.
The UseCanonicalName off form of the directive is usually used with name-based
virtual hosts.
UseCanonicalName DNS Apache constructs a self-referential URL for the server
using the hostname determined from a reverse-DNS lookup performed on the IP
address to which the client connected. This option is designed primarily for use
with IP-based virtual hosts, though it can be used in a server context. It has no
effect in a name-based virtual host context. The UseCanonicalName DNS form of
the directive should only be used with IP-based virtual hosts.
In addition to controlling how self-referential URLs are constructed, the UseCanonicalName
directive is also used to set two variables that are accessible by CGI scripts through their
“environment,” SERVER_NAME and SERVER_PORT. If you look at a CGI script that displays the
environment variables, you can easily see how modifying the UseCanonicalName directive
affects the value of these two variables. Chapter 8 includes such a script, in the section on
CGI programming.
164 Chapter 6 Virtual Hosting
Notice that the second portion of each directive specifies a pathname constructed from
the IP address on which the HTTP request was received. Therefore the %4 in both direc-
tives is filled with the fourth part of the request IP address (the fourth number in the tra-
ditional dotted quad IP address format). If a request arrives on an interface whose IP
address is 127.129.71.225, the paths specified by VirtualDocumentRootIP and
VirtualScriptAliasIP directories are translated, respectively, into the following
directories:
/home/httpd/vhost/225
/home/httpd/vhost/cgi-bin/225
These directories need to be created on the server for the server to produce a meaningful
response. Since each of the parts of an IP address can take a value from 1 to 254, this
scheme permits up to 254 IP-based virtual hosts. The following directives would allow
64516 (254 × 254) virtual hosts, with pathnames like /home/httpd/vhost/116/244/,
but would also require an IP address for each. I show this for illustration only; you’d
never find something like this being done in the real world.
UseCanonicalName DNS
VirtualDocumentRootIP /home/httpd/vhost/%3/%4
VirtualScriptAliasIP /home/httpd/vhost/cgi-bin/%3/%4
Also note from these examples that no ServerName directive is used to assign each virtual
host its name. If the server needs to form a self-referential URL to refer to any of these vir-
tual hosts, the UseCanonicalName DNS directive instructs it to perform a reverse DNS
lookup to determine the server name from the IP address. It is not necessary for Apache
to perform this reverse DNS lookup to serve requests from the virtual host.
Guaranteeing Sufficient File Descriptors 165
Configuration
open network connection. Web connections are rarely open for very long, as clients con-
Essential
nect, retrieve resources, and disconnect. Apache’s log files, however, normally stay open for
as long as the Apache server is running, in order to minimize the overhead required to open
the file, write to it, and close it. This creates a problem when the number of file handles
available to the Apache process is limited and a large number of virtual hosts are being sup- PART 2
ported. Each virtual host has at least two open logs, error.log and access.log.
File descriptors are usually constrained by three system limits. The first is called the soft
resource limit. A process cannot use a greater number of file descriptors than this limit,
but a user can increase the soft limit using the ulimit command up to the hard resource
limit. A user with root privileges can increase the hard resource limit up to the kernel
limit. The kernel limit is an absolute resource limit imposed by the running Linux kernel.
Recent versions of the Linux kernel have such a high kernel limit it can be considered
unlimited in most environments.
The hard limit and soft limits on the number of file descriptors a process can have open
are both set to 1024 in 2.2.x kernels. In Linux 2.0 (and older) kernels, these were set to
256. Use the ulimit command to determine the hard and soft limits of your Linux kernel,
as follows. The number of file descriptors a process can have opened at one time is
shown as “open files.”
[caulds@jackal caulds]$ ulimit -Sa
core file size (blocks) 1000000
data seg size (kbytes) unlimited
file size (blocks) unlimited
max memory size (kbytes) unlimited
stack size (kbytes) 8192
cpu time (seconds) unlimited
max user processes 256
pipe size (512 bytes) 8
open files 1024
virtual memory (kbytes) 2105343
166 Chapter 6 Virtual Hosting
NOTE I ran the commands as a nonprivileged user. Running them as root pro-
duces the same result.
Although it is unlikely that you will ever bump up against Linux’s limits on the number
of open file descriptors, you should be aware that they exist, especially if you intend to
support a large number of virtual hosts, each with its own log files. If you do need to
increase the number of file descriptors beyond the system’s hard limit, do one of the fol-
lowing things:
■ Reduce the number of log files. Simply by having each virtual host write both its
error and access logs to the same file, you can reduce the number of required file
descriptors by half, though you may not want to do this, because it lumps all log-
ging into a single disk file that fills very rapidly, making it more difficult to locate,
isolate, and resolve errors encountered by the server.
■ Increase the file descriptor limit to the system’s hard limit prior to starting
Apache, by using a script like this:
#!/bin/sh
ulimit –S –n 1024
/usr/local/apache/bin/apachectl start
C. Add the following lines to one of your system startup scripts (probably /etc/
rc.d/rc.local):
# Increase system-wide file descriptor limit.
echo 8192 > /proc/sys/fs/file-max
echo 24576 > /proc/sys/fs/inode-max
Configuration
#!/bin/sh
Essential
ulimit –S –n 4096
/usr/local/apache/bin/apachectl start
PART 2
NOTE Most situations simply do not call for such a large number of open file
descriptors for a single running application or process.
The potential problem is that Apache must know at least one IP address for the virtual
host, and we haven’t provided it. When Apache starts and reads these lines from its
httpd.conf file, it performs a DNS lookup for the IP address of the hostname given in the
<VirtualHost> directive. If for some reason DNS is unavailable, the lookup will fail, and
Apache will disable this particular virtual host. In versions earlier than 1.2, Apache will
then abort.
168 Chapter 6 Virtual Hosting
We no longer require Apache to perform a DNS lookup for the value provided by
<VirtualHost>, but we haven’t provided a second important piece of information
required for every virtual host, the ServerName. Apache determines the ServerName in
this case by performing a reverse-DNS lookup on 198.168.1.4 to find the associated host-
name. This reliance on a DNS query when Apache is started means we haven’t solved our
problem yet. The addition of a ServerName directive for the virtual host eliminates the
dependence on DNS to start the virtual host. The virtual host specification should read:
<VirtualHost 192.168.1.4>
ServerName vhost1.hiwaay.net
ServerAdmin [email protected]
DocumentRoot /home/httpd/html/vhost1
</VirtualHost>
TIP When setting up virtual host configurations, it is often helpful to use the
httpd -S command. This will not start the server, but it will dump out a descrip-
tion of how Apache parsed the configuration file. Careful examination of the IP
addresses and server names may help uncover configuration mistakes.
In Sum
Virtual hosting is used to maintain multiple Web sites on a single server machine. The
Configuration
sites are usually identified by unique hostname aliases in the DNS. Virtual hosts can be
Essential
either IP-based (in which the IP address on which the request was received identifies the
virtual host to handle the request) or name-based (in which the client designates the vir-
tual host to handle the request using the HTTP/1.1 Host header).
PART 2
The mod_vhost_aliases module provides a way to create dynamic virtual hosts, in which
the server knows nothing about the virtual host until a request arrives. All information
about a dynamic virtual host is derived from the URL of the request or the IP address on
which the request arrived. Dynamic virtual hosts are usually used to support large num-
bers of virtual hosts on a single server with only minimal configuration changes to the
Apache server. Dynamic virtual hosts can also be either IP- or name-based; although IP
based dynamic virtual hosts are rarely used because of their requirement that each host
have a unique IP address.
Up until this point, I’ve shown how to set up a working Apache server, but now the focus
of the book will change toward determining how that server will respond to requests and
how the content it delivers can be customized. In other words, we’ll be looking at more
than just the Apache engine, which is fairly simple. We’ll be looking at requests and
responses, and customizing the responses returned by Apache, either by configuration
changes, adding additional modules, or by programming. The next chapter discusses one
of the simpler, but very efficient, techniques for Web page customization, Server-Side
Includes.
This page intentionally left blank
Part 3
Linux Library
Part 3 Advanced Configuration Options
Advanced
Configuration
Options
Featuring:
■ Configuring Apache to run Server-Side Includes (SSI)
■ HotWired’s Extended SSI (XSSI)
■ Java Server-Side Includes (JSSI)
■ The Common Gateway Interface (CGI) and FastCGI
■ The mod_perl Perl accelerator
■ Using PHP and ASP for Apache
■ Java tools for Apache: Apache Jserv, Java Server Pages (JSP),
and Resin
■ Aliasing and redirection with mod_alias
■ URL rewriting with mod_rewrite
■ Controlling Apache manually via the command line
■ GUI configuration tools
This page intentionally left blank
Server-Side Includes
7
S erver-Side Includes (SSI) offer the simplest way to add dynamic content to a Web
page. When the Web server receives a request for a page that may contain SSI commands, it
parses the page looking for those commands. If it finds any, they are processed by the Apache
module that implements SSI (usually mod_include). The results of this processing—which
may be as simple as the document’s last-modified date or as complex as the result of running
a CGI script—replace the SSI code in the HTML document before it is sent to the requesting
user. SSI commands are actually HTML comments (enclosed in <!-- and --> tags) that have
special meaning to the SSI processing module. A page that contains SSI commands adheres
to the requirements for HTML, and the SSI commands are ignored (as comments) if they
happen to reach a client browser without being parsed, processed, and replaced by the server.
Apache has included SSI for a very long time. Although it is implemented as an optional
module, this module is compiled into the server by default, and it is available in nearly every
Apache server. For simple functions, like automatically including the last date of modifica-
tion of the enclosing HTML document in the document itself, using SSI is far simpler and
more efficient than writing a CGI program to take care of the task. I believe every Apache
server should be configured to handle server-parsed documents whenever necessary.
SSI is not powerful enough to replace a programming language for generating complete
HTML pages, or for database querying, or any of the fun stuff that requires true program-
ming (although it does allow a page to call CGI scripts that can handle those more complex
tasks). SSI can’t come close to replacing any of the techniques discussed in Chapters 8 and 9
174 Chapter 7 Server-Side Includes
for Web programming, and SSI shouldn’t be considered an alternative to any of them. I
prefer to think of SSI as a built-in feature of Apache that can be used to augment these
techniques.
The version of SSI included with Apache is XSSI (for eXtended Server-Side Includes).
XSSI has been in use for so long that it is generally considered standard SSI. There are
extensions to XSSI, the most prominent of which are the HotWired extensions discussed
in detail later in this chapter. Another version you may hear of is SSI+, which adds a few
tags primarily of interest to Win32 programmers, the most important of which is an ODBC
tag used to retrieve data from databases using the Microsoft Open Database Connectivity
drivers. At the end of this chapter, Java developers can learn about another option, Java
Server-Side Includes (JSSI).
2. Use an Options directive to enable Includes for the directory (or directories) in
which you plan to place your server-parsed pages:
Options Includes
NOTE I first tried to set Options +Includes to say “enable the Includes
option” but, much to my surprise, this did not work! The + operator adds options
to an Options list that already exists. Since I had no Options list already set for my
DocumentRoot directory, the statement had no effect. It was necessary for me to
remove the + for the Options directive to take effect.
SSI Tags 175
3. Specify a MIME content type for files with the .shtml extension:
AddType text/html .shtml
4. Add an Apache handler for .shtml files:
AddHandler server-parsed .shtml
NOTE The choice of .shtml as the extension for SSI files is conventional but
not strictly necessary. You just need to specify the same extension in both the
AddType and AddHandler statements, and save all HTML files containing SSI
commands with that extension.
Configuration Options
# /usr/local/apache/bin/apachectl restart
/usr/local/apache/bin/apachectl restart: httpd restarted
Advanced
SSI Tags
Part of the beauty of SSI is that it is implemented through such a simple mechanism, PART 3
embedded HTML tags that have special meaning only to the SSI parser. SSI commands
are legitimate HTML comments that appear between HTML comment tags <!— and -->
and would be ignored by the client browser if they weren’t parsed and removed by the
Web server. SSI commands have the following general syntax:
<!--#command attribute=value attribute=value ... -->
Most SSI commands require at least one attribute=value pair. Only a few SSI com-
mands (such as printenv) can be used without an attribute=value pair. To prevent
confusion in interpreting the SSI line, it is a good practice to enclose the value in double
quotes, even if that value is a nonstring data type like an integer. The comment terminator
(-->) at the end of the line should be offset with white space. (This is not always required,
but I had problems running SSI when I failed to separate the final SSI token from the com-
ment terminator.)
SSI commands are parsed in-place and do not need to be placed at the beginning of the
line; you can use an SSI command to replace a single word in the middle of a sentence. In
Listing 7.1 and its output (Figure 7.1) you’ll see how SSI commands can be used to insert
values right in the middle of a line of text.
176 Chapter 7 Server-Side Includes
Table 7.1 Format strings used with the <config> SSI tag
String Meaning
%% Escapes a % character
Table 7.1 Format strings used with the <config> SSI tag (continued)
String Meaning
%M Minute (00–59)
%p A.M. or P.M.
%S Second (00–59)
Configuration Options
%Z The time zone (CST)
Advanced
Listing 7.1 is an example of a Web page, formatted in HyperText Markup Language
(HTML), that uses most of the time format tags from Table 7.1. Figure 7.1 shows how
that page will look when viewed in a Web browser. HTML, as you may remember from PART 3
Chapter 2, is a standard method of formatting documents for display, or rendering. By
definition, a Web browser must be able to interpret some version of HTML. Most
modern browsers support HTML 4, which includes nearly every element or tag one might
conceivably require (version 4 is described at www.w3.org/TR/html4/). HTML is a work
in progress, and variants of it have been spawned (with names like Extended HTML or
Dynamic HTML) but all Web-browser software supports basic HTML.
<HTML>
<HEAD>
<TITLE>SSI "config" Element Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>SSI "config" Element Test Page</H1>
</center>
<!--#config errmsg="mod_include unable to parse your code!" -->
<!--#config timefmt="%A" -->
Today is <!--#echo var="DATE_LOCAL"-->.
178 Chapter 7 Server-Side Includes
Some familiarity with HTML will be necessary to understand the SSI examples in this
chapter, but the important tags, and the only ones that will be explained in detail, are the
SSI tags. These can be identified in this example as those enclosed in slightly modified
brackets, <!--# -->. The first SSI tag in Listing 7.1, <!--#config, changes the default
error message to be displayed when the SSI parser encounters a problem. One of the two
include file tags, which attempt to bring HTML formatted documents into this one as
a page footer, is incorrect and will cause this error message to be displayed. The other footer
is correct, so you can see the result in Figure 7.1. Note that HTML tags in that footer are
properly rendered (as a button and as an embedded hyperlink).
This example also serves to illustrate the use of the config timefmt SSI tag to display the
current system time and date. Compare the SSI tags against the output, glancing back at
Table 7.1, and you can pretty easily see how these work.
As you can see, at least one statement in the HTML could not be parsed. But which one?
Where did the links to my e-mail come from? And why are there two separate references
SSI Tags 179
to a footer.html file? Not surprisingly, the answers to all those questions are related.
The e-mail links are part of my standard page footer, displayed by calling my
footer.html file. One of the #include statements is correct and displays the footer page,
but the other has incorrect syntax and displays the error message. You’ll see exactly what
the error is when we look at the #include tag later in the chapter.
Configuration Options
DOCUMENT_NAME The filename of the SSI document requested by the user.
DOCUMENT_URI The URL path of the SSI document requested by the user.
Advanced
LAST_MODIFIED The last modification date of the SSI document requested by the
user. (When displayed by echo, will be formatted according to “config timefmt”.)
Listing 7.2 illustrates how the echo tag is used to display the values of all four of the SSI- PART 3
specific variables shown above, along with several selected variables from the CGI envi-
ronment. Figure 7.2 shows the results in a browser. The two time variables (DATA_LOCAL
and DATE_GMT) are displayed using the SSI default format, but could be tailored by pre-
ceding them with a “config timefmt” tag, as described in the last section.
<HTML>
<HEAD>
<TITLE>SSI Variable Include Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>SSI Variable Include Test Page</H1>
</center>
<FONT SIZE=+1>
<ul>
Special mod_include Includes:
<ul>
180 Chapter 7 Server-Side Includes
for example:
<!--#exec cmd="/usr/bin/parser.sh rfc2626.html" -->
or a CGI script:
<!--#exec cgi="/cgi-bin/mycgi.cgi" -->
If the script returns a Location: HTML header instead of output, this header is translated
into an HTML anchor (an embedded hyperlink). Listing 7.3 is an example of the exec tag
at work. The CGI script that it calls consists of only three lines; while it could do many
other things, it simply returns a Location: string (SSI is smart enough to translate this
into an anchor tag or hyperlink):
Configuration Options
#!/usr/bin/perl –Tw
# This is anchor.cgi
Advanced
use CGI;
print "Location: http://www.apache.org\n\n";
<HTML>
<HEAD>
<TITLE>SSI "exec Tag with Location:" Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>SSI "exec Tag with Location:" Test Page</H1>
</center>
<br>
Clickable hyperlink: <!--#exec cgi="/cgi-bin/anchor.cgi" -->
<p>
<!--#include file="footer.html"-->
</BODY>
</HTML>
182 Chapter 7 Server-Side Includes
If an IncludesNOEXEC option is in effect for the directory containing the SSI file being
parsed, the exec tag will be ignored. The directive Options IncludesNOEXEC should be
in the .htaccess file in the directory or in the httpd.conf file.
WARNING For security reasons, you should avoid the use of the <exec cgi>
SSI tag, which will execute a file anywhere in the file system. This violates an
accepted Apache standard practice, that CGI scripts reside only in special pro-
tected directories and have specific filename extensions. Instead, use <include
virtual>, which can execute only standard CGI scripts that are accessible only
through a URL that is acceptable to Apache. This allows Apache to apply the secu-
rity measures applied to ordinary CGI scripts.
virtual The virtual variable is set to the filename or path relative to Apache’s
DocumentRoot. Use this when you want to specify a file using a partial URL.
The fsize and flastmod tags are examples of what I like best about SSI: They both have
very simple syntax and offer a very efficient way of doing what they do. Moreover, nei-
ther tries to do too many things, but each of them comes in very handy when you need
it. The next section illustrates them both in the same example (Listing 7.4) because they
are used in exactly the same manner. Figure 7.4 then shows how both tags are rendered
by a browser.
TIP Use the config tag as described above to format the file size printed by the
fsize tag.
Configuration Options
ument being parsed at the location of the flastmod tag. Like fsize, the file is specified
in one of the following two ways:
file Identifies a filename and path relative to the directory containing the SSI
Advanced
document being parsed.
virtual The virtual variable is set to the filename or path relative to Apache’s
PART 3
DocumentRoot. Use this when you want to specify a file using a partial URL.
TIP The format of the date printed by the flastmod tag is controlled using the
config tag as described earlier.
Listing 7.4 is an example of a document that makes use of both the SSI fsize and
flastmod tags. By referring to Figure 7.4, you can easily determine the use of each of these
tags. Note that the first fsize tag uses the file keyword to indicate that the referenced
file is relative to the directory in which the SSI document resides (in this case they must
be in the same directory). The second fsize tag makes use of the virtual keyword to
indicate that the file is relative to the Apache DocumentRoot (the file must be in the docs
subdirectory of that directory).
Listing 7.4 A Test Document for the SSI fsize and flastmod Tags
<HTML>
<HEAD>
<TITLE> SSI fsize and flastmod Test Page</TITLE>
</HEAD>
184 Chapter 7 Server-Side Includes
<BODY>
<center>
<H1>SSI Test Page</H1>
<H3>Testing fsize and flastmod</H3>
</center>
<!--#config sizefmt="bytes" -->
<!--#config timefmt="%I:%M %P on %B %d, %Y" -->
<p>Size of this file (bytes): <!--#fsize file="SSItest6.shtml" -->
<br>Last modification of this file:
<!--#flastmod file="SSItest6.shtml" -->
<p>Size of mod_fastcgi.html (bytes):
<!--#fsize virtual=filemod_fastcgi.html" -->
<br>Size of mod_fastcgi.html (KB):
<!--#fsize virtual="/docs/mod_fastcgi.html" -->
<!--#include file="footer.html"-->
</BODY>
</HTML>
Figure 7.4 The SSI fsize and flastmod test document displayed in a browser
SSI Tags 185
Configuration Options
system path as possible.) When used in this fashion, mod_include constructs a
URL from the include virtual command, and embeds the results of this URL
(what would be returned if the URL was called directly by the client) into the
Advanced
calling document. If the resource indicated by the URL itself includes SSI com-
mands, these are resolved, which allows include files to be nested.
PART 3
Regardless of the calling method, the included resource can also be a CGI script, and
include virtual is the preferred way to embed CGI-generated output in server-parsed
documents (always use this method rather than exec cgi, which the SSI developers do
not recommend). Incidentally, if you need to pass information to a CGI script from an
SSI document, you must use include virtual; it isn’t possible using exec cgi.
Also, attempting to set environment variables (such as QUERY_STRING) from within an SSI
page in order to pass data to a CGI script won’t work. This sets a variable accessible only
to mod_include and doesn’t alter the environment variable with the same name. Instead,
pass variables to CGI scripts by appending ?variable=value to the query string of the
calling URL, as shown in Listing 7.5. This script demonstrates how a CGI script is called,
passed a variable and value, and the results embedded in an HTML document passed to the
browser. Figure 7.5 shows the resulting document displayed in a browser.
186 Chapter 7 Server-Side Includes
<HTML>
<HEAD>
<TITLE>include virtual Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>Test of include virtual SSI Tag</H1>
</center>
<!--#include virtual="/cgi-bin/test1.cgi?testvar=Testing+for+Carl" -->
<!--#include file="footer.html"-->
</BODY>
</HTML>
Listing 7.6 The CGI Script Used with the SSI include Tag Test Document
#!/usr/bin/perl –Tw
#This is test1.cgi
#
#queries a table for a value
use strict;
use CGI qw(:standard);
use CGI::Carp;
my $output=new CGI;
my $TEST=param('testvar') if (param('testvar') );
print $output->header;
print h3("Variable passed to and returned from CGI script:");
print h4("$TEST");
print $output->end_html;
SSI Tags 187
Configuration Options
Advanced
PART 3
Flow Control
The so-called flow control elements of SSI implement only the most basic execution con-
trol element, an if/else operator; they don’t provide the functions of execution branching
or nesting found in a real programming language. Here’s the basic implementation of the
if tag in SSI:
<!--#if expr="test_condition" -->
HTML-formatted text
<!--#elif expr="test_condition" -->
HTML-formatted text
<!--#else -->
even more HTML-formatted text
<!--#endif -->
Note that expr is a keyword and must be present. The if expr element works like the if
statement in a true programming language. The test condition is evaluated and, if the
result is true, the text between it and the next elif, else, or endif tag is included in the
output stream, and subsequent endif tests are ignored. If the result is false, the next elif
is evaluated in the same way.
SSI test conditions are almost always simple string comparisons, and return True or False
based on the result of one of the following possible operations:
Syntax Value
string True if string is not empty; False otherwise
string1 = string2 True if string1 is equal to string2
string1 != string2 True if string1 is not equal to string2
string1 < string2 True if string1 is alphabetically less than string2
string1 <= string2 True if string1 is alphabetically less than or equal to
string2
string1 > string2 True if string1 is alphabetically greater than string2
string1 >= string2 True if string1 is alphabetically greater than or
equal to string2
Condition1 && condition2 True if both conditions are True (the AND
operator)
Condition1 || condition2 True if either condition is True (the OR operator)
SSI Tags 189
Generally, you will be looking only for the existence of a match (using the = operator)
when working with regular expressions:
<!—if expr=$DOCUMENT_URI=/^cgi-bin/ -->
However, you can also test for an expression that is not matched by negating the results
of the match using the != operator:
<!—if expr=$DOCUMENT_URI!=/^cgi-bin/ -->
Use parentheses for clarity when expressing SSI tags with several comparisons:
<!--#if expr="($a = test1) && ($b = test2)" -->
Configuration Options
The following example evaluates to True if the request URI begins with either /cgi-bin/
or /cgi-vep/, False otherwise:
Advanced
<!--#if expr="($DOCUMENT_URI=/^\/cgi-bin/) || ($DOCUMENT_URI=/^\/cgi-vep/)" -->
Listing 7.7 illustrates a very practical use of the if/else tag in SSI. If the IP address of the
PART 3
connecting host, which is stored in the environment variable REMOTE_ADDR, matches the
regular expression in the first if expr expression, it indicates that the client is on the
Apache server’s subnet, and the user is presented with some information that external
users will never see. If the REMOTE_ADDR does not match in this expression, the user is not
on the local subnet, and the text in the else clause is sent to the requester. This contains
a line to simply tell remote users that some aspects of the page are invisible to them. In real
life, you’d probably keep them from knowing even that, instead presenting them with a
document intended for their eyes. Figure 7.6 shows how the results of Listing 7.7 are dis-
played in a browser.
<HEAD>
<TITLE>SSI File Include Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>SSI File Include Test Page</H1>
<!--#if expr="$REMOTE_ADDR = /^192.168.1./" -->
190 Chapter 7 Server-Side Includes
Configuration Options
XBitHack On Tests every text/html document within the scope of the directive
to see if it should be handled as server-parsed by mod_include. If the user-execute
bit is set, the document is parsed as a SSI document. If an XBitHack On directive
Advanced
applied to the directory in the following example, index.html would not be iden-
tified as server-parsed until the chmod statement was issued to set the execute bit
for the user: PART 3
# ls -al index.html
-rw-r--r-- 1 www www 3844 Jan 28 14:58 index.html
# chmod u+x index.html
# ls -al index.html
-rwxr--r-- 1 www www 3844 Jan 28 14:58 index.html
XBitHack Full Works just like XBitHack On except that, in addition to testing
the user-execute bit, it also tests to see if the group-execute bit is set. If it is, then
the Last-Modified date set in the response header is the last modified time of the
file. If the group-execute bit is not set, no Last-Modified header is sent to the
requester. This XBitHack feature is used when you want proxies to cache server-
parsed documents; normally, you would not want to do this if the document con-
tains data (from a CGI include, for example) that changes upon every invocation.
Here’s an example of setting the group-execute bit:
# ls -al index.html
-rw-r--r-- 1 www www 3916 Mar 10 08:25 index.html
# chmod g+x index.html
# ls -al index.html
-rw-r-xr-- 1 www www 3916 Mar 10 08:25 index.html
192 Chapter 7 Server-Side Includes
NOTE The XBitHack directive, discussed in the last section, is fully imple-
mented in the HotWired Extended XSSI. Do not, however, expect it to be available
when using any other SSI implementation (like Apache JSSI).
A disclaimer on HotWired’s Web site states that the module has not been thoroughly
tested when compiled as a DSO module, but that is the way I compiled and tested it, and
I had no problems. I was unable to use HotWired’s instructions for compiling the module
from within the Apache source directory, disabling the standard mod_include and
enabling the HotWired version. Instead, I recommend compiling the module outside the
Apache source directory, as a DSO, and replacing the mod_include.so that came with
Apache. I used the following command line to compile the module, install it into Apache’s
libexec, and add the LoadModule and AddModule lines to Apache’s httpd.conf. The
LoadModule and AddModule lines will already exist if you are using the standard Apache
mod_include; they will be created, otherwise:
/usr/local/apache/bin/apxs -i -a mod_include.so
Another way to do this is simply to replace the standard Apache mod_include.c found
in the src/modules/standard directory in the Apache source tree with the HotWired
HotWired’s Extended SSI (XSSI) 193
version and recompile Apache. I actually used both methods. I replaced the standard mod_
include.c in Apache, but rather than recompiling the entire server, I chose to make mod_
include.so with apxs and simply restarted the server. The next time I compile Apache,
I’ll be compiling the HotWired version of mod_include.
Configuration Options
<FORM METHOD="GET"
ACTION="http://jackal.hiwaay.net/SSItest5.shtml>
<LABEL>Enter var1: </LABEL>
Advanced
<INPUT type="text" name="var1" id="var1"><BR>
<LABEL>Enter var2: </LABEL>
<INPUT type="text" name="var2" id="var2"><BR>
PART 3
<INPUT TYPE="submit" VALUE="Submit">
</FORM>
Listing 7.8 illustrates how the parse_form tag is used to create two SSI variables (form_
var1 and form_var2) from information entered by a user in a Web form, and made avail-
able to the SSI page through the variable QUERY_STRING. The purpose of the parse_form
tag is to easily convert user input into variables that can be used by other SSI code. These
variables might be used, for example, in an <include virtual> SSI tag to specify an
external document for inclusion in the HTML sent to the browser:
<!--#parse_form -->
<HTML>
<HEAD>
<TITLE>HotWired's XSSI Test Page</TITLE>
</HEAD>
<BODY>
<center>
194 Chapter 7 Server-Side Includes
Configuration Options
to HotWired’s XSSI echo tag. Note that the first time we echo FOO, the variable has not
been set, and the default attribute is used by echo. Then, using SSI’s set var= tag, we set
the value to <p>. The second time it is echoed, the value is interpreted by the browser as
Advanced
a page tag. The third time we echo the value of FOO, using escape=html, the <> characters
are replaced with < and > before value of FOO is sent to the browser. Figure 7.8 shows
the result of HotWired XSSI parsing the echo test document. PART 3
<HTML>
<HEAD>
<TITLE>HotWired's XSSI Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>HotWired's XSSI Test Page</H1>
</center>
<p>First: <!--#echo var="FOO" default="<b>Not Set</b>" -->
<!--#set var="FOO" value="<p>" -->
<p>Second: <!--#echo var="FOO" -->
<p>Third: <!--#echo var="FOO" escape="html" -->
<!--#include file="footer.html"-->
</BODY>
</HTML>
196 Chapter 7 Server-Side Includes
<HTML>
<HEAD>
<TITLE>HotWired's XSSI Test Page</TITLE>
</HEAD>
<BODY>
<center>
HotWired’s Extended SSI (XSSI) 197
Figure 7.9 The HotWired random tag test page displayed in a browser
Configuration Options
Advanced
PART 3
WARNING Unlike Apache mod_include, Apache JSSI does not implement the
IncludesNOEXEC feature, nor does it support an exec tag. The only way to run
external programs from Apache JSSI is through the <SERVLET> tag.
Although JSSI is a nice add-on if you are already running Java servlets, it does not justify the
complexity involved in installing servlet capability in Apache. SSI is simply not the best use
of Java. Java Server Pages are the ticket, now and for the future. If you run JSP, then Apache
JSSI is a simple installation and well worth the time spent to install it. If you don’t already
have servlet capability, look for a better reason than Apache JSSI to install it.
Configuration Options
it in /usr/local/src directory:
# pwd
Advanced
/usr/local
# ls /home/caulds/ApacheJSSI*
/home/caulds/ApacheJSSI-1_1_2_tar.gz
# tar xvzf /home/caulds/ApacheJSSI-1_1_2_tar.gz PART 3
ApacheJSSI-1.1.2/
ApacheJSSI-1.1.2/CHANGES
ApacheJSSI-1.1.2/docs/
2. Change to the src/java subdirectory under the newly created ApacheJSSI source
directory and type make to create the ApacheJSSI.jar Java archive (jar) file:
# cd ApacheJSSI-1.1.2/src/java
# make
# ls -al ApacheJSSI.jar*
-rw-r--r-- 1 root root 70135 Mar 9 12:21 ApacheJSSI.jar
-rw-r--r-- 1 root root 68859 Mar 9 12:01 ApacheJSSI.jar.ORIG
3. The newly created ApacheJSSI.jar file appears first, and the file that came with
the Apache JSSI distribution is shown with the .ORIG extension appended. A .jar
200 Chapter 7 Server-Side Includes
file can reside anywhere on your system, as long as the Java servlet runner can
locate it. I place mine in the lib subdirectory of my Java Servlet Development Kit
(JSDK):
# cp ApacheJSSI.jar /usr/local/JSDK2.0/lib
Wherever you choose to place the ApacheJSSI.jar file, you will point to it,
explicitly, from one of your Apache JServ configuration files. For now, just locate
it in a place that makes sense to you.
NOTE If you are unable to make the ApacheJSSI.jar file, don’t worry; the
Apache JSSI archive contains an ApacheJSSI.jar file that you can probably use
with your Apache setup without running make. When I first ran make, the file I cre-
ated was identical to the one provided with Apache JSSI. When I did it a second
time, the file differed in size (it was about 2% larger) but worked perfectly. I think
the difference is that I had changed JDK or JSDK versions. Keep in mind that you
aren’t compiling a system binary; you are compiling Java pseudocode, which
should compile and run the same on different system architectures. The Java Vir-
tual Machine (VM) interprets this pseudocode at runtime into machine-specific
binary code, which does differ drastically between machines.
4. The next steps are small changes to the Apache JServ configuration files to permit
Apache JServ to locate and run the Apache JSSI servlet. The first change is to the
main Apache JServ configuration file, which is actually called from the main
Apache httpd.conf by an Include line like this one taken from my system:
Include /usr/local/apache/conf/jserv/jserv.conf
5. In jserv.conf, ensure that the ApJServAction line for the Apache JSSI servlet is
uncommented; this line works like AddHandler in httpd.conf to define a servlet
as the proper handler for files with the .jhtml extension (again, that’s an arbi-
trary choice of extension, but as good as any other). Note that a different servlet
is specified to run JSP (.jsp) pages; the other ApJServAction lines in my config-
uration aren’t used and are commented out:
# excerpted from: /usr/local/apache/conf/jserv/jserv.conf
# Executes a servlet passing filename with proper extension in
PATH_TRANSLATED
6. Point to the Apache JSSI classes (the ApacheJSSI.jar file) in one of your
Apache JServ servlet zones. You may have a number of different servlet zones,
each probably corresponding to a different application, but you should always
have a root servlet zone defined, which is the default zone, and defined by the
file zone.properties. It is this file that I edited to add a class repository for
Apache JSSI (note that a repository can be a directory like /usr/local/apache/
servlets which contains individual class files, or it can point to an archive of
classes in a .jar file):
# excerpted from: /usr/local/apache/conf/jserv/zone.properties:
# List of Repositories
#######################
Configuration Options
# The list of servlet repositories controlled by this servlet zone
# Syntax: repositories=[repository],[repository]...
# Default: NONE
Advanced
# Note: The classes you want to be reloaded upon modification should be
# put here.
repositories=/usr/local/apache/servlets
PART 3
repositories=/usr/local/JSDK2.0/lib/ApacheJSSI.jar
7. Apache JSSI only sets the following variables for use by SSI: DATE_GMT, DOCUMENT_
NAME, and LAST_MODIFIED. To enable Apache JSSI to work with the entire set of
standard SSI tags, it is necessary to pass the Apache JSSI servlet an initial
argument.
8. The exact method of doing this varies between different implementations of the
Java servlet engine. For Apache JServ, I added a single line to my zone.properties
file, at the very end, in a section reserved for passing initialization arguments to Java
servlets (the file contains simple syntax examples and instructions):
# Aliased Servlet Init Parameters
servlet.org.apache.servlet.ssi.SSI.initArgs=SSISiteRoot=/home/httpd/html
9. This line simply tells the servlet engine where to find the Java classes that imple-
ment Java Server-Side Includes.
You’re now ready to install the following simple test application and crank her up.
202 Chapter 7 Server-Side Includes
<HTML>
<HEAD>
<TITLE>Java Server-Side Include (JSSI) Test Page</TITLE>
</HEAD>
<BODY>
<center>
<H1>Java Server-Side Include (JSSI) Test Page</H1>
</center>
<h3>Traditional SSI Includes:</h3>
<ul><b>
DATE_LOCAL: <!--#echo var="DATE_LOCAL"--> <br>
DATE_GMT: <!--#echo var="DATE_GMT"--> <br>
DOCUMENT_NAME: <!--#echo var="DOCUMENT_NAME"--> <br>
DOCUMENT_URI: <!--#echo var="DOCUMENT_URI"--> <br>
LAST_MODIFIED: <!--#echo var="LAST_MODIFIED"--> <br>
</b>
<SERVLET CODE="HelloWorld.class">
Your Web server has not been configured to support servlet tags!
</SERVLET>
<!--#include file="footer.html" -->
Note that I included some regular SSI so that you can see that the SSI tags work with Apache
JSSI. Three of the SSI variables used are available; two are simply not set by Apache JSSI.
The include tag works pretty much as you would expect. The real tag of interest is
<SERVLET>, which runs the servlet shown in Listing 7.12, displaying the output formatted
by the servlet to the client browser. Figure 7.10 shows how the output will look in the user’s
browser.
Java Server-Side Includes (JSSI) 203
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
/**
* This is a simple example of an HTTP Servlet. It responds to
* the GET and HEAD methods of the HTTP protocol.
*/
public class HelloWorld extends HttpServlet
{
/**
* Handle the GET and HEAD methods by building a simple web
* page. HEAD is just like GET, except that the server returns
* only the headers (including content length) not the body we
Configuration Options
* write.
*/
Advanced
public void doGet (HttpServletRequest request,
HttpServletResponse response)
throws ServletException, IOException
PART 3
{
PrintWriter out;
String title = "Example JSSI Servlet";
out.println(title);
out.println("<H1>" + title + "</H1>");
out.println("<H2> Congratulations, if you are reading this, <br>"
+ "Java Server-Side Include (JSSI) 1.1.2 is working!<br>");
out.close();
}
}
204 Chapter 7 Server-Side Includes
Figure 7.10 The result of the Apache JSSI example displayed in a browser
What’s going on in this example? When you request a file with the .jhtml extension, the
Apache JServ module (mod_jserv.so) loads and runs the proper servlet to handle the file.
We could define any servlet to handle .jhtml files, but we defined org.apache.servlet
.ssi.SSI, which resides in a Java archive named ApacheJSSI.jar. The servlet is loaded
and run in a Java virtual machine created by Apache JServ, and it runs the servlet classes,
passing them the .jhtml file. The file is parsed, the standard SSI tags resolved, and any
Java classes defined in <SERVLET> tags are run and the output pasted back into the .jhtml
file, which is sent on to the requesting browser after being parsed. Notice the doGet
method, which is automatically called whenever the servlet is invoked by an HTTP
request using the GET method. This is provided, in accordance with the specification for
Java servlets, by the HttpServlet class from which our HelloWorld class is derived (illus-
trating class inheritance).
In Sum 205
For the little it accomplishes, that’s a pretty expensive piece of code. Apache JSSI is far
more overhead than I required for my trivial application. Nevertheless, you can see that
a very powerful servlet could be used here, perhaps with a remote database query or
something equally complex.
In Sum
Server-Side Includes or SSI (often called server-parsed HTML) provides one of the sim-
plest ways to produce dynamic Web pages without true programming. SSI is imple-
mented through special SSI tags, but otherwise the instructions consist of standard
HTML text. SSI is usually used to provide features like displaying the current time, the
date and time of the last file modification, or including standard text from other doc-
uments. Although SSI is rarely an alternative to a real programming language, it can be
used for tasks like querying or updating a database, sending an e-mail message, or using
conditional statements to determine whether certain actions are taken or whether or
Configuration Options
not specific text is displayed.
In the next chapter, we begin a journey through the most popular programming tech-
Advanced
niques used by Web site designers today. As an Apache administrator, you require a
working familiarity with each, and in particular, knowledge of how they interface with
Apache. The next two chapters will tell you what you need to know to install and use the
programming methodologies that power the majority of the dynamic Web sites on the PART 3
Internet.
This page intentionally left blank
Scripting/Programming
8
with CGI and Perl
I n the early days of the Web, programming usually meant enhancing a Web site by
adding simple user interactivity, or providing access to some basic services on the server
side. Essentially, programming for the Web in those days meant interpreting input from the
user and generating specific content for that user dynamically (“on-the-fly”). A simple Web
program might take user input and use it to control a search engine or a database query.
Web programming has evolved from those very simple programs that added user interac-
tivity and automation to Web pages. Today, Web-based applications are often full-fledged
production systems, complete electronic storefronts, or front ends to complex, powerful
databases. Such applications are often implemented using the three-tier business com-
puting model, where the application or Web server usually makes up the middle tier, and
the Web browser is often used as the bottom tier or user interface. The top tier of this model
usually consists of large database server systems and has no direct interaction with the end
user (or bottom tier).
Changes in the requirements for Web programming are a direct result of the changing role
of the Internet Web server. The Web is no longer simply a delivery medium for static Web
pages. Chances are, if you are writing programs for the Web today, they are likely to be an
integral part of someone’s Internet business strategy. Your ability to program an application,
even a very simple application, is probably critical to the success of your Web project.
208 Chapter 8 Scripting/Programming with CGI and Perl
Although larger Web sites generally have both a Webmaster and a content provider, often
with sharply divided responsibilities, at many sites the two roles have been increasingly
merged. I don’t know any sharp Apache administrator who isn’t keenly interested in pro-
gramming techniques for the Web. Like the topic of security (discussed in Chapters 14
and 15 of this book), programming is one of those formerly peripheral topics that have
become an integral part of the Apache administrator’s required knowledge base.
This is the first of two chapters on scripting/programming for the Apache server. There
are a number of good programming methodologies for the Web; no single language is
clearly superior to all the rest, and each has its adherents and, in many cases, religious
zealots. There will always be someone who will try to tell you that there’s only one way
to program a Web-based application, and if you aren’t using that technology, you’re
behind the times. Don’t believe it. Your choice of programming language or methodology
shouldn’t be based on what is most popular at the moment, but rather should fit your par-
ticular need, as well as the skills you already possess. When selecting a programming
methodology, you must look at what your competencies are, and what you enjoy most;
all the programming tools discussed for Web programming in this chapter and the next
are quite adequate for developing commercial-quality Web applications.
In this chapter we’ll cover what is still the most widespread approach—using the Common
Gateway Interface (CGI) or its newer variant, FastCGI, and the Perl scripting language.
Chapter 9 will look at some of the newer tools and techniques available, including PHP,
Apache JServ, ASP, JSP, and Resin. Each of these tools can be used to successfully create
real-world Web-based applications. They can all be used on the same server, and a single
Web application might make use of more than one tool.
The goal of these chapters is not to teach you “how to program” in the languages covered;
entire books have been written on those topics. The focus instead is on how the tool is
used with Apache. A simple programming example for each tool will serve to show the
basics of how it is used. The examples I provide are simple, but not trivial. Each demon-
strates how to extract data from a database using a simple Structured Query Language
(SQL) query. In essence, each is a full three-tier application, providing a simple user-input
form along with a mid-tier server program that takes the user input and uses it to query
the third tier, a database engine, that might be on a completely separate server.
For additional information on all of the programming methodologies mentioned in
this chapter and the next one, be sure to see the “Programming Resources” section of
Appendix B. The best of these provide numerous examples of working code. Since I
believe studying program examples is undoubtedly the best way to learn programming
techniques, I have provided working examples in each topic discussed in the next two
chapters.
The Common Gateway Interface (CGI) 209
Configuration Options
appear to be a big demand for any of the changes under consideration for the new version.
For most purposes, CGI can be considered a fairly static mechanism. Learning CGI means
that you won’t soon have to learn a new programming methodology or see your Web-
Advanced
based application suddenly become obsolete.
Any program that can be executed from the command line on the server can be used with
CGI. This includes compiled programs written in C or C++, or even COBOL or Fortran. PART 3
Scripting languages like Perl, Tcl, or shell scripting languages are the most popular ways
to write CGI programs. Scripts are usually much quicker to write than compiled pro-
grams. Since the client browser provides the user interface for Web applications, the
scripts contain only the basic code required for data I/O and are smaller and easier to
maintain. Minor code changes to scripts don’t require compilation and linking, which
speeds up and simplifies code design, testing, and maintenance.
As a general-purpose programming interface, CGI offers some advantages over propri-
etary Web programming interfaces like Netscape’s NSAPI, Microsoft’s ISAPI, and even
the Apache programming interface. Although these interfaces offer the programmer sub-
stantially better performance and easier access to the inner workings of the Web server,
CGI is far more widely used for several reasons. The first is that CGI is independent of
both server architecture and programming language, allowing the programmer great
freedom to choose the language best suited for a particular programming task. I regularly
use a combination of C, Tcl, Perl, and even shell scripts for CGI programming tasks.
CGI also offers complete process isolation. A CGI program runs in its own process
address space, independently of the Web server, and it communicates only input and
output with the server. Running CGI programs outside the program space of Apache not
210 Chapter 8 Scripting/Programming with CGI and Perl
only protects the server from errant CGI processes (even the most serious errors in a CGI
program cannot affect the Web server), it also provides protection against deliberate
attempts to compromise the security or stability of the server.
Last, but certainly not least, CGI offers the tremendous advantage of being a simple
interface to learn and use. For most programming tasks, CGI offers more than enough
functionality and adequate performance, without imposing heavy demands on the
programmer.
Scripting languages are so popular for CGI Web programming tasks that many texts
simply refer to CGI programs as CGI scripts. In fact, the Perl language owes much of its
popularity to its early adoption by Web site administrators for CGI applications. In the
past few years it has seen wide acceptance as a general-purpose scripting language, espe-
cially where cross-platform compatibility is a strong concern. Many programmers con-
sider Perl to be the de facto standard for writing CGI scripts. Actually, nearly any
language can be used to write CGI programs, including compiled languages like C. But
Perl is the most popular, and it’s the one I’ve chosen to best illustrate the use of CGI. Just
remember that CGI is not limited to scripting languages, and Perl is not limited to Web
programming.
The CGI examples provided in this section are all written in scripting languages, but there
is no reason that a compiled language like C could be used in exactly the same way.
In Linux, when a program is invoked, it is passed a set of data called the process envi-
ronment. This is a list of name=value pairs. Typically, one process that is invoked by
another inherits a copy of that process’s environment (it is said to inherit the envi-
ronment of its parent process). This provides one way for a process to pass data to
a process it creates. By tailoring its own environment before starting a process, the
parent process can control the environment of the process it invokes.
The Common Gateway Interface (CGI) 211
When Apache receives a request for a resource that it recognizes as a CGI script or pro-
gram, it spawns the process by making calls to the Linux operating system. The process
is completely independent of Apache, with one important exception: Its standard output
pipe remains connected to the Apache server process, so that Apache receives all output
from the program that is directed to standard output (or stdout). If the CGI program is
a Linux shell script, like the example below, the echo statement is used to send text to
stdout. (Later in this chapter, in the section “A Script to Return the Environment,” we’ll
add a little formatting to this script to generate the output shown in Figure 8.1.)
#!/bin/sh
echo "Content-type: text/plain"
echo
echo "Environment variables defined:"
echo
env
Configuration Options
Apache does not communicate directly with the CGI process it spawns. Instead, as the
parent of the process, it has some degree of control over the environment in which the
process runs. In order to pass data to the process it creates, Apache places its data in envi-
Advanced
ronment variables that can be read by the process. In our simple CGI process, the shell
command env reads the environment and reports it to its standard output file handle
(stdout). Apache receives this output through the pipe it maintains to the script’s stdout PART 3
handle and sends it to the requesting user.
To test this script, create a new file in a directory defined by a ScriptAlias directive in
httpd.conf, and place in the file the statements shown above. (Chapter 4 shows how to use
ScriptAlias.) You must also ensure that the file has an extension associated in httpd.conf
with an Apache handler as described in the next section. In the default httpd.conf file
provided with the Apache distribution, you will find the following line:
#AddHandler cgi-script .cgi
Removing the leading # character that marks this line as a comment causes Apache to
treat all files with a name ending in the .cgi extension as CGI scripts, and they will be
executed using the CGI mechanism. Under Linux, it is not necessary to identify each type
of script by a different extension, and I use the .cgi extension to identify all CGI scripts
on my systems, without regard to the actual content of the file. The first line of all scripts
should contain the full pathname of the script processor, preceded by the hash-bang (#!)
characters, as in our example:
#!/bin/sh
212 Chapter 8 Scripting/Programming with CGI and Perl
NOTE Every resource served by Apache that is not associated with a specific
handler is processed by a handler named (not surprisingly) default-handler, pro-
vided by the core module.
Defining Directories
The most common way to define resources for execution as CGI programs is to designate
one or more directories as containers for CGI programs. Security is enhanced when CGI
programs reside in a limited number of specified CGI directories. Access to these direc-
tories should be strictly controlled, and careful attention paid to the ownership and per-
missions of files that are stored there.
Two slightly different directives provide a means of identifying a directory as a container
for CGI scripts: ScriptAlias (introduced in Chapter 4) and ScriptAliasMatch. Both
directives work like a simple Alias directive to map a request to a directory that may not
exist under DocumentRoot, and they designate a directory as a container for CGI scripts.
ScriptAlias is simpler, so we’ll look at it first.
The following line, found in the standard Apache distribution, defines a directory to con-
tain CGI scripts:
ScriptAlias /cgi-bin/ “/usr/local/apache/cgi-bin/”
The Common Gateway Interface (CGI) 213
Configuration Options
Apache does not concern itself with protecting those resources.
Finally, make sure that the user execute bit is set (avoid setting the group or other exe-
Advanced
cute bits). On all Apache servers that I’ve administered, I’ve created a www group
account that includes the user accounts of all the members of the Web team. A directory
listing of one of my CGI directories is shown below. You can see that the CGI scripts
PART 3
are all owned by the nobody user (that is, the Apache httpd process running as nobody),
although members of the www group have full read-write privileges, and all other users
are strictly disallowed.
# ls -al
total 31
drwxr-x--- 2 nobody www 1024 Apr 20 16:39 .
drwxr-x--- 7 www www 1024 Mar 25 13:20 ..
-rwxrw---- 1 nobody www 743 Feb 25 15:10 CGIForm.cgi
-rwxrw---- 1 nobody www 685 Feb 25 16:40 CGIForm2.cgi
-rwxrw---- 1 nobody www 2308 Feb 9 16:20 CGITest1.cgi
-rwxrw---- 1 nobody www 738 Feb 29 16:04 JavaForm.cgi
-rwxrw---- 1 nobody www 987 Feb 9 11:34 MySQLTest1.cgi
-rwxrw---- 1 nobody www 987 Feb 9 17:06 MySQLTest2.cgi
-rwxrw---- 1 nobody www 736 Mar 1 15:06 PHPForm.cgi
-rwxrw---- 1 nobody www 15100 Feb 9 09:11 cgi-lib.pl
-rwxrw---- 1 nobody www 349 Feb 9 11:24 environ.cgi
-rwxrw---- 1 nobody www 443 Feb 26 13:57 environ.fcgi
214 Chapter 8 Scripting/Programming with CGI and Perl
Here, any request URL that begins with /cgi-bin (followed by any other characters) will
be mapped to the file system using the fixed path /usr/local/apache/cgi-bin with the
content of the first back-reference to the regular expression match appended. The back-
reference $1 is filled with the contents of that part of the request URL that matched the
portion of the regular expression contained in parentheses. In this case, it should always
match a slash followed by a valid filename containing the CGI script.
In general, use ScriptAliasMatch only when you find it impossible to phrase your
URL match as a plain string comparison. I have never found it necessary to use
ScriptAliasMatch, and I consider regular expressions unduly complicated for this
purpose.
Defining Files
Although the simplest and most commonly used means of identifying files as CGI scripts
is to place them into directories reserved for scripts, you can also identify individual files
as CGI scripts. To do this, use the AddHandler directive, which maps an Apache handler
to files that end with certain filename extensions. The following line, for example, defines
the standard cgi-script handler to be used for processing all files ending with the exten-
sions .pl or .cgi. Typically CGI scripts will be given the .cgi extension, but since CGI
scripts can be written in more than one language, you may prefer to retain the .pl exten-
sion to more easily identify scripts written in Perl.
AddHandler cgi-script .cgi .pl
The AddHandler directive is valid only in a directory scope, either within a <Directory>
container in http.conf or as part of an .htaccess file. It cannot be used as a global direc-
tive, and therefore can’t be used to define all files with a certain extension as CGI scripts,
regardless of where they occur.
The Common Gateway Interface (CGI) 215
Defining Methods
Although you are unlikely to ever need it, the Script directive, provided by the mod_
actions module, invokes a CGI script whenever the requesting client uses a specified
HTTP request method. The request method must be GET, POST, or DELETE.
The following Script directive calls a CGI script to handle all user DELETE requests:
Script DELETE /cgi-bin/deleteit.cgi
Configuration Options
Action text/html /home/httpd/cgi-bin/ParseMe.cgi
Advanced
This example defines a particular CGI script as the handler for all HTML files. When any
HTML file is requested, the file will first be passed through the script ParseMe.cgi which
does a string search for dirty language and replaces it with more acceptable text. PART 3
The following environment variables are specific to the request being fulfilled by the
gateway program:
SERVER_PROTOCOL The name and revision of the information protocol this
request came in with. Format: protocol/revision, as HTTP/1.1.
SERVER_PORT The port number to which the request was sent.
REQUEST_METHOD The method with which the request was made. For HTTP,
this is GET, HEAD, POST, etc.
PATH_INFO The extra path information, as given by the client. In other
words, scripts can be accessed by their virtual pathname,
followed by extra information at the end of this path. The
extra information is sent as PATH_INFO. The server should
decode this information if it comes from a URL before it is
passed to the CGI script.
PATH_TRANSLATED The server provides a translated version of PATH_INFO,
which takes the path and does any virtual-to-physical
mapping to it.
SCRIPT_NAME A virtual path to the script being executed, used for self-
referencing URLs.
QUERY_STRING The information that follows the question mark in the URL
that referenced this script. This query information should
not be decoded in any fashion. This variable should always
be set when there is query information, regardless of
command-line decoding.
REMOTE_HOST The hostname making the request. If the server does not
have this information, it should set REMOTE_ADDR and leave
this unset.
REMOTE_ADDR The IP address of the remote host making the request.
AUTH_TYPE If the server supports user authentication, and the script is
protected, this is the protocol-specific authentication
method used to validate the user.
REMOTE_USER Set only if the CGI script is subject to authentication. If the
server supports user authentication, and the script is
protected, this is the username they have authenticated as.
REMOTE_IDENT If the HTTP server supports RFC 931 identification, then this
variable will be set to the remote username retrieved from the
server. Usage of this variable should be limited to logging
only, and it should be set only if IdentityCheck is on.
The Common Gateway Interface (CGI) 217
Configuration Options
connection.
SCRIPT_FILENAME The absolute path to the CGI script.
Advanced
SERVER_ADMIN The e-mail address provided in Apache’s ServerAdmin
directive.
PART 3
The following variables are not defined by the CGI specification but are added by the
mod_rewrite module, if it is used:
SCRIPT_URI The absolute URL, including the protocol, hostname, port, and
request.
SCRIPT_URL The URL path to the script that was called.
REQUEST_URI The URL path received from the client that led to the script that
was called.
In addition to the headers shown above, header lines from the client request are also
placed into the environment. These are named with the prefix HTTP_ followed by the
header name. Any - characters in the header name are changed to _ characters. The server
may choose to exclude any headers it has already processed and placed in the environ-
ment, such as Authorization or Content-type.
As a good example of how this works, consider the User-Agent request header. A CGI
script will find the value of this header, extracted from the user request, in the environ-
ment variable HTTP_USER_AGENT.
218 Chapter 8 Scripting/Programming with CGI and Perl
The SetEnv Directive Sets the value of an environment variable to be passed to CGI
scripts, creating the variable if it doesn’t already exist:
SetEnv PATH /usr/local/bin
This changes the value of the PATH variable passed to CGI scripts to include only a single
path. All programs called by the CGI script must reside in this path (or be called by their
full pathname).
The UnsetEnv Directive Removes one or more environment variables from the environ-
ment before it is passed to CGI scripts:
UnsetEnv PATH
You might remove the PATH variable from the CGI environment to avoid the possibility
of a malicious hacker planting a Trojan horse somewhere in the PATH where it would be
executed instead of a legitimate program the script was trying to call. In general, however,
the PATH that is passed to CGI scripts (inherited from the Apache httpd process that called
the script) should contain only protected directories that nonprivileged users cannot write
to. Many site administrators prefer to remove the PATH and reference all external scripts
or utility programs by their full pathname. This is certainly safe, but it is much better to
protect the directories that are included in the PATH variable passed to CGI scripts.
The Common Gateway Interface (CGI) 219
The PassEnv Directive Specifies one or more environment variables from the server’s
environment to be passed to CGI scripts:
PassEnv USER
The PassEnv directive cannot be used to create a new ENV variable; it can only designate
a variable in the httpd process’s environment that is to be included in the environment
that CGI scripts inherit. In this case, we are passing the value of USER, which indicates the
Linux user ID under which the Apache httpd process is running (by default, this is UID
-1, corresponding to user nobody). You might wish to have a script abort with an error
message if this value is not what the script expects.
Configuration Options
tests on request header information and perform the necessary actions. Replace the
BrowserMatch directive, for example, with lines in your CGI program that test the User-
Advanced
Agent header of the request to identify the browser used to send the request, and take
action accordingly.
The SetEnvIf Directive Defines one or more environment variables based on an PART 3
attribute that is associated only with the current request being processed. In most cases
this attribute is one of the HTTP request headers (such as Remote_Addr, User_Agent,
Referer). If not, the attribute is tested to see if it is the name of an environment variable
set (by other SetEnv or SetEnvIf directives) earlier in the processing cycle for the current
request (or in a wider scope, such as the server scope).
The syntax of the SetEnvIf directive is
SetEnvIf attribute regex envvar[=value] [...]
If the attribute matches regex, then envvar is set to a value defined in =value (if it exists)
or set to 1 otherwise. If the attribute does not match regex, no action is performed.
CGI scripts can be written so that the presence of the environment variable IS_ROBOT,
indicating that the script’s output will go to a Web-indexing robot, can be tailored for
indexing engines. Web indexing robots generally ignore and don’t download embedded
graphics or banner ads; therefore, the page returned to robots should be text-rich and
packed with key words and phrases for the indexing engine.
the Unix vulnerabilities had received so much exposure. Now that NT is more widely
used in a server role, it is also suffering (perhaps unfairly) from the perception that its net-
work security model is weak. When it comes to security, neither Linux nor NT has an
indisputable advantage over the other; both platforms contain vulnerabilities that can be
exploited by a malicious attacker. I believe that the Linux community is more open about
security risks, though, and it acts more quickly to solve those that are discovered.
A properly written CGI script is no more insecure than any other Web program. A few
simple guidelines can be very helpful in writing secure CGI scripts.
Configuration Options
cesses it creates should be owned by a user account with limited privileges. This
is covered in detail in Chapter 14.
Avoid passing user input of any kind to the shell for processing. Perl scripts pass
Advanced
data to the shell for processing in several ways. Perl spawns a new shell process to
execute commands enclosed in backtick characters (` `) or included as arguments to
system() or exec() function calls. This should be avoided. The following examples PART 3
illustrate how user data might end up being interpreted by a shell process:
system("/usr/lib/sendmail -t $foo_address < $input_file");
or
$result=`/usr/lib/sendmail -t $foo_address < $input_file`;
In both of these lines, the shell is passed user input as an argument to the sendmail
process. In both examples, the shell that processes the line can be tricked into exe-
cuting part of $input_file as a separate process. If a malicious person were able
to trick your system into running a line like this:
rm *
you could be in trouble. That is the main reason why the Apache processes that
respond to user requests should never run as root. The code below shows a better
way to pass data to a process. Note that, while the shell is used to run sendmail,
the user input is passed to the sendmail process through a pipe, and the shell
never sees the contents of the variable $stuff:
open(MAIL, "|/usr/lib/sendmail -t");
print MAIL "To: $recipient\n";
print MAIL $stuff;
222 Chapter 8 Scripting/Programming with CGI and Perl
In all CGI scripts, explicitly set the value of the PATH environment variables,
rather than simply accepting the value inherited from the Apache process. I rec-
ommend setting this value to a single directory in which you place scripts or other
executable programs you trust. I’ve already shown one way to do this using the
SetEnv and UnSetEnv directives. You can also do the same thing from within CGI
scripts if, for example, you don’t have access privileges that allow you to modify
the httpd.conf to modify the environment for all CGI scripts. The following line,
when included in a Perl CGI script, clears all environment variables and resets the
value of PATH to a “safe” directory:
delete @ENV{qw(IFS CDPATH ENV BASH_ENV)};
$ENV{"PATH"} = "/usr/local/websafe";
Alternatively, set PATH to a null value and call all external programs from your
CGI script using their fully pathname. Basically, before doing a system call, clear
the PATH by issuing a statement like the following:
$ENV{"PATH"} = "";
Taint checking derives its name from the fact that Perl considers any data that your script
receives from an outside source, such as unmodified or unexamined user input from a
Web form, to be tainted. Perl will not allow tainted variables to be used in any command
that requires your script to fork a subshell. In other words, if taint checking is enabled and
you attempt to fork a shell and pass it data that Perl regards as tainted, Perl aborts your
script, reporting an error similar to the following:
Insecure dependency in `` while running with -T switch at temp.pl line 4,
<stdin> chunk 1.
The Common Gateway Interface (CGI) 223
A Web programmer often needs to use external programs, passing data that was received
as input from an unknown user. One of the most common examples of this is using a mail
transport agent (on Linux, this is most likely the ubiquitous sendmail utility) to e-mail
data using input received from a client. The following line is the most commonly cited
example of an absolute CGI scripting no-no:
system(“/usr/sbin/sendmail -t $useraddr < $file_requested”);
This takes a user-entered address and filename and mails the requested file to the user.
What’s wrong with this? By inserting a ; character into the $file_requested, you can
easily trick the shell into believing it is being passed one command, separated from a
second distinct command by this special shell metacharacter. The shell will often be quite
happy to run the second command, which might try to do something nasty on behalf of
your attacker.
If Perl is so careful not to use tainted input from the client, how is it possible to pass any
input safely? There are basically two ways.
Configuration Options
The first way is to avoid passing data directly to the shell. This works because most
hackers are trying to exploit the shell itself and trick it into running unauthorized com-
mands on their behalf. You can avoid the use of the shell by opening a system pipe to the
Advanced
program intended to accept the input. Replace the system command above with the fol-
lowing lines:
open(PIPE, “| /usr/sbin/sendmail –t”); PART 3
In this example, the shell never sees the user’s input, which is piped directly to the
sendmail executable. This means that attempts to exploit the shell are thwarted.
The second way is to “untaint” the data. To do this, use a regular expression pattern
match to extract data from the tainted variable using () groups and back-references to
create new variables. Perl will always consider new variables created from data extracted
from a tainted variable in this manner to be untainted. Of course, Perl has no way of
knowing whether the new variables have been examined carefully to ensure that they
present no security risk when passed to the shell, but it gives the programmer the benefit
of the doubt. Perl assumes that any programmer who has applied a regular expression
match to tainted variables has also taken enough care to remove dangerous metacharac-
ters from the variable. It is the programmer’s responsibility to make sure this assumption
is a correct one.
224 Chapter 8 Scripting/Programming with CGI and Perl
For the e-mail example above, you could untaint the $file_requested variable using the
following section of Perl code:
if ($file_requested =~ /(\w{1}[\w-\.]*)\@([\w-\.]+)/) {
$file_requested = "$1\@$2";
} else {
war ("DATA SENT BY $ENV{‘REMOTE_ADDR’}IS NOT A VALID E-MAIL ADDRESS:
$file-requested: $!");
$file_requested = ""; # successful match did not occur
}
In this example, the variable is matched to ensure that it conforms to the proper format
for an e-mail address. The regular expression in the first line takes a little work to inter-
pret. First, remember the regular expression rules that {} braces enclose a number spec-
ifying how many times the previous character must be repeated to make a match, that []
brackets enclose sets of alternative characters to be matched, and that \w refers to a word
character (defined as characters in the set [a-zA-Z0-9]). The first line can thus be read as
“if the content of $file_requested matches any string containing at least one word char-
acter, followed by any number of word characters, dashes, or periods, followed by the lit-
eral character @ followed by at least one pattern consisting of word characters, dashes or
periods, then perform the following block.” The parentheses are used to enclose sections
of the regular expression that are later substituted into $n back-references, where n cor-
responds to the number of the parenthetical match. In the next line, the first set of paren-
theses (which matches that portion of the variable to the left of the @ character) is later
substituted into $1; the second set of parenthesEs (matching the portion of the variable to
the right of the @ character) is substituted for $2. The result then replaces the old value
of $file_requested, which, having been processed by a regular expression, is now
marked as untainted for future use by Perl.
The else clause of the if statement handles those situations where the regular expression
fails to match $file_requested, which means that the variable does not have the
expected format of an e-mail message. In this case, the script will print a warning, which
will be written to Apache’s error log, along with the IP address of the remote host that
submitted the tainted data and a copy of that data. This information might be helpful in
locating a hacker trying to exploit a CGI weakness on the server. Immediately after log-
ging the failure to match, the Perl script empties the $file_requested variable, essen-
tially discarding the user’s input.
Avoid the temptation to untaint your Perl variables without doing any real checking. This
would have been easy to do in the previous example with two lines of code:
$file_requested =~ /(.*)/;
$file_requested = $1;
The Common Gateway Interface (CGI) 225
This fragment matches anything the user enters and simply overwrites the variable with
its existing contents, but Perl assumes that a check for malicious input has been per-
formed and untaints the variable. Absolutely nothing has actually been done, however.
The programmer who does this should probably just turn off taint checking rather than
resort to this kind of deception. It is likely to lull other programmers into a false assump-
tion that since the script is taint-checked, it must be safe.
Configuration Options
ScriptLog /var/log/cgilog
Advanced
ScriptLog logs/cgilog
The Apache httpd process owner should have write access to the log you specify. Note PART 3
that the ScriptLog is valid only in a server context, in other words, you cannot place the
directive within a container directive. In particular, you cannot specify different log files
for different virtual hosts.
Since the output of all CGI scripts will be logged (not just error messages), your logfile
will tend to grow rapidly. The ScriptLogLength directive is useful for limiting the size of
the logfile. The maximum byte size set with this directive limits the size to which the log
file will grow (the default value of ScriptLogLength is 1MB). The following line would
set the maximum log file size to half a megabyte:
ScriptLogLength 524288
One other directive is used to control CGI logging. ScriptLogBuffer can be used to limit
the size of entries written to the CGI log. This can be especially useful in limiting the
growth of the log when the entire contents of PUT or POST requests (in which the client
browser sends data to the server) are being logged. Since the contents of these two HTTP
request methods are unlimited, they can quickly fill a log file. The default value of this
directive is 1KB (1024 bytes). The following line will limit entries written to the CGI log
to one-fourth that size:
ScriptLogBuffer 256
Using CGI.pm
Lincoln Stein’s CGI.pm is a very large Perl module that uses Perl 5 objects to perform
simple Web-related tasks, such as the HTML tagging required by many HTML elements
(headers, forms, tables, etc.). The module also manages the CGI interface to the Web
server by providing a mechanism for capturing user input into a Perl hash or two-dimen-
sional array. This hash contains environment variables and their values as easy-to-access
data pairs. For example, in Perl, you can access (or dereference) the value of the environ-
ment variable QUERY_STRING using $ENV{QUERY_STRING}.
The module also provides some of the more advanced features of CGI scripting, including
support for file uploads, cookies, cascading style sheets, server PUSH, and frames. The
CGI.pm Perl module is designed to be used with standard CGI or Apache mod_perl (dis-
cussed in a later section) and simplifies the use of these Web programming techniques, but
does not replace either. The module is far too extensive to cover in detail here, but my CGI
examples throughout this chapter make use of it, and illustrate some of the basic CGI.pm
methods (or functions, for those not yet thinking in object terms). Speaking of object-
orientation, though, CGI.pm makes internal methods (or functions) accessible either as
Perl 5 objects or as traditional functions. With CGI.pm, you can choose to use either form,
or both, if you wish. CGI.pm even emulates the ReadParse function from cgi-lib.pl
(a Perl/CGI library that many Web programmers cut their teeth on). This means “legacy”
Perl/CGI scripts don’t have to be rewritten to use CGI.pm.
You can obtain CGI.pm from the Comprehensive Perl Archive Network (or CPAN) search
site at
http://search.cpan.org/
or
ftp://ftp-genome.wi.mit.edu/pub/software/WWW/
The Common Gateway Interface (CGI) 227
#!/usr/bin/perl
Configuration Options
#Environ.cgi - Show environment variables set by the server
#
print "Content-type: text/html\n\n";
Advanced
print "<HTML><HEAD><TITLE>Environment Variables</TITLE></HEAD><BODY>";
print "<H2>Environment Variables:</H2>";
print "<HR>\n"; PART 3
foreach $evar( keys (%ENV)){
print "<B>$evar:</B> $ENV{$evar}<BR>";
}
print "</BODY></HTML>\n";
The first line of the script designates Perl as the script interpreter; in other words, this is
a Perl script (the .cgi extension says nothing about the contents of the file, but it ensures
that Apache spawns the file using CGI). The output of the script (the print statements)
is redirected to the requester’s browser in the form of an HTTP response. Note that the
first response is the Content-type HTML header, which causes the browser to render the
rest of the output as HTML-formatted text. This header is followed by two consecutive
newline characters (/n/n), an HTTP convention used to separate HTTP headers from the
HTTP content or payload. Figure 8.1 shows the page as rendered by a Web browser.
If you know a little Perl, you’ll realize that the script accesses a hash (or indexed two-
dimensional array) named %ENV, iterating through the hash, displaying each hash entry
key and value. The %ENV hash contains the environment inherited by all Perl scripts; access
to the environment, therefore, requires no special function in Perl—it is provided without
charge by the Perl interpreter.
228 Chapter 8 Scripting/Programming with CGI and Perl
This script is extremely handy to have in your CGI directory. Take a moment now to
install it and execute it from a Web browser. This will allow you to use it to view the envi-
ronment provided through CGI, and that environment will change as you add certain
modules or use different request methods. I can access this file on my server using
http://jackal.hiwaay.net/cgi-bin/Environ.cgi
As an experiment, you can pass a variable to any CGI script using a request URL such as:
http://jackal.hiwaay.net/cgi-bin/Environ.cgi?somevar=somevalue
When you try that on your server, you should see the additional data you passed in the
environment variable QUERY_STRING: somevalue=somevar. That’s how information is
passed to CGI scripts when the GET request method is used. More often, however, data
is sent to the Web server with the POST request method. When POST is used, Apache uses
the script’s standard input handle (stdin) to send the data to the script in a data stream.
The Common Gateway Interface (CGI) 229
A CGI script that handles POST requests is a bit more difficult to write, but utilities like
the Perl CGI.pm (discussed earlier) module make this much easier. Remember, CGI.pm is
not required for using Perl with CGI and Apache; it is a convenience, but one well worth
taking the time to learn.
Configuration Options
MySQL, a Simple Relational Database for Linux
Advanced
Many of the Web programming examples in this book make a simple query of a rela-
tional database management system using the structured query language (SQL).
With slight modification, these examples will work with virtually any RDBMS you can PART 3
run on Linux.
You can download the very latest development releases or the latest stable produc-
tion release, as well as binary distributions, or contributed RPMs, from http://
mysql.com/. I have downloaded the source code for MySQL and found it easy to
compile and install, but since I wanted only a running SQL engine and had no interest
in customizing the code, I have since taken to installing MySQL from RPMs. You can
get these from the mysql.com site, but I prefer to use the old standby, RPMFind.net.
MySQL’s version numbers change pretty rapidly; the version I installed may be
behind the current release by the time you read this. It’s a database server and per-
forms a pretty mundane role, when you get right down to it. I don’t worry too much
about having the latest version; I’m happy as long as I’ve got a stable SQL engine that
is always available.
230 Chapter 8 Scripting/Programming with CGI and Perl
You’ll need to get several pieces for a whole system. For the 3.22.29 version I last
installed, for example, I downloaded the following four RPMs:
I took one more step, however, to make the excellent documentation that is provided
easily available from my server through a Web browser. Adding the following line to
my httpd.conf allows me to read the MySQL docs using the URL http://jackal
.hiwaay.net/MySQL.
Alias /MySQL "/usr/doc/MySQL-3.22.29"
The -p argument in the command line above (and in subsequent examples) causes
MySQL to prompt for the user’s password. In this case, the MySQL user’s identity is that
of the invoking user (and I was logged in as root when I invoked these commands).
MySQL is started and the database is opened like this:
# mysql -p zipcodes
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 105 to server version: 3.22.29
Configuration Options
I created a single table in the database. Named zips, it consists of three fields: a 25-
character string for the city name, a two-character string for the state abbreviation, and
Advanced
a five-character string for the postal Zip code, which is the primary index into the table
and, therefore, cannot be empty (NOT NULL).
mysql> create table zips (city char(25), state char(2), PART 3
The 46,796 rows of the database were retrieved from a text file using the following
MySQL command line:
mysql> LOAD DATA LOCAL INFILE "zips.txt" INTO TABLE zips
-> FIELDS TERMINATED BY ‘, ‘ ENCLOSED BY ‘"‘;
Query OK, 46796 rows affected (2.66 sec)
Records: 46796 Deleted: 0 Skipped: 0 Warnings: 0
This specifies the field delimiter as a comma followed by a space character and tells
MySQL that the string fields in the original file are enclosed in quotes. Even on my old
Pentium 200 MMX server, this database was loaded (and indexed) in less than three sec-
onds (impressive).
232 Chapter 8 Scripting/Programming with CGI and Perl
For each Web programming language and technique in this chapter and the next one, I’ll
present an example that accepts a five-digit Zip code entered by a user in a Web form,
looks up the associated city and state from the database, and returns the result to the user.
This will demonstrate not only how to program an application to accept data from a Web
client, but also how to interface the application to a common database system, make a
query of a database, and return the results to the requester. Although it is quite simple,
the application demonstrates the basics of Web programming, particularly for database
access, one of the most common tasks that must be performed by Web servers on behalf
of the end user.
The input form will be the same for each Web programming example that accesses the
Zipcodes MySQL database, a very simple HTML form that takes a single input, the U.S.
Postal Service Code to be looked up in a database. The Web form used to get user input
is shown in Figure 8.2.
The HTML for the form is also quite simple, as you can see in Listing 8.2.
The Common Gateway Interface (CGI) 233
I didn’t actually write the HTML you see above; I used CGI.pm module to do much of the
work for me. Along with the features noted earlier, this module provides the ability to
Configuration Options
create most of the features of an HTML document. CGI.pm can do this using either “tra-
ditional” or object-oriented programming mechanisms. The first method uses standard
Perl function calls. For those programmers who aren’t completely comfortable with pro-
Advanced
gramming through objects, this may seem the simplest and most intuitive way to use
CGI.pm. The function-oriented CGI script in Listing 8.3 generated the HTML for the
simple input form shown in Listing 8.2. PART 3
Listing 8.3 Using CGI.pm to Generate the HTML for a Web Form
#!/usr/bin/perl -Tw
use CGI;
use CGI::Carp;
print header;
print >start_html("Zip Code Database Query Form");
print "<H1>Zip Code MySQL Database Query Form</H1>\n";
&print_prompt($query);
&print_tail;
print end_html;
sub print_prompt {
my($query) = @_;
print startform(-method=>"POST",-action=>"http://Jackal.hiwaay.net/cgi-bin/
zipcodes.cgi ");
234 Chapter 8 Scripting/Programming with CGI and Perl
To use CGI.pm in object-oriented style, you create a CGI object and then make use of
methods and properties that it exposes. This is the form I recommend, for two reasons:
First, it is the modern programming paradigm; second, nearly all good examples you’ll find
for using CGI.pm, including those in the CGI.pm documentation, use this style. Listing 8.4
is the same script, but written to use the CGI object methods rather than functions.
#!/usr/bin/perl -Tw
use CGI;
use CGI::Carp;
print $query->header;
print $query->start_html("Zip Code Database Query Form");
print "<H1>Zip Code MySQL Database Query Form</H1>\n";
&print_prompt($query);
&print_tail;
print $query->end_html;
sub print_prompt {
my($query) = @_;
print $query->startform(-method=>"POST",-action=>"http://Jackal.hiwaay.net/
cgi-bin/zipcodes.cgi ");
print "<EM>Enter a 5-digit Zip Code:</EM><BR>";
print $query->textfield(-name=>‘zip’, -size=>6);
The Common Gateway Interface (CGI) 235
print "<P>",$query->reset(‘Clear’);
print $query->submit(-value=>‘Submit Search’);
print $query->endform;
print "<HR>\n";
}
sub print_tail {
print <<END;
<ADDRESS>Charles Aulds</ADDRESS><BR>
<A HREF="/">Home Page</A>
END
}
Notice several things about both examples above. First, the CGI scripts are designed only
to return an HTML page to the client browser; they contain no code for data manipula-
tion, either I/O or computation. It might seem far easier to write the HTML and save it
on the server as filename.html. In this case, it probably is … but when you are required
Configuration Options
to generate your HTML dynamically or on-the-fly, CGI.pm will repay the effort you take to
learn the module. Use perldoc CGI to generate the excellent documentation for the module,
full of good examples.
Advanced
Also note that many functions (or methods) have defaults. In the case of the header func-
tion (or method) above, I used the default, which sends the following HTML header to
PART 3
the client:
Content-Type: text/html
You can override the default to specify your own content type using either of these forms,
which are equivalent:
print header(‘mimetype/subtype’)
or
print $query->header(‘mimetype/subtype’);
The third point to note is how easily an HTML form can be created with CGI.pm. With
CGI.pm, it isn’t necessary to know the HTML tags used by the browser to render the
HTML page. Comparing the CGI scripts above with the generated HTML, you can easily
see how the script generated the form. CGI.pm is best learned in exactly that fashion, com-
paring a script with its output. Later in this chapter, I’ll demonstrate how CGI.pm is used
to receive user-generated input.
Finally, note that, even if you are using CGI.pm, you can use print statements to output
anything else from your script. For some of the simpler lines, I did just that, using print
to output tagged HTML.
236 Chapter 8 Scripting/Programming with CGI and Perl
CGI::Carp
Carp is a module that was developed to increase the usefulness of the standard warn-
ings and error messages returned when things go wrong in a Perl script. When you
include the Carp module in a Perl script by placing the line
use Carp;
somewhere near the beginning of the file, standard error messages are displayed
with the name of the module in which the error occurred. This is handy whenever you
are running Perl scripts that, in turn, call other scripts.
Errors returned from CGI scripts (normally written to the script’s standard output) are
automatically diverted into the Apache log. The problem is that the errors written
there are not time-stamped and, even worse, sometimes don’t identify the CGI script
in which the error occurred. For example, I deliberately broke a CGI script by calling
a nonexistent function, and then ran the script from a Web browser, which wrote the
following line into Apaches error.log file:
Died at /home/httpd/cgi-bin/CGIForm.cgi line 12.
This isn’t bad; it tells me the script name and even the line number where the error
occurred. When I added the line
use CGI::Carp;
to the beginning of the script, however, the same error caused the following lines to
be written. They give a time and date stamp, as well as identifying the script where
the error or warning occurred, and listing the calling subroutine(s) when these apply:
[Fri Feb 25 14:37:55 2000] CGIForm.cgi: Died at /home/httpd/cgi-bin/
CGIForm.cgi line 12.
[Fri Feb 25 14:37:56 2000] CGIForm.cgi: main::print_me() called at /
home/httpd/cgi-bin/CGIForm.cgi line 8.
Listing 8.5 shows the actual CGI script that performs the database lookup, taking one
parameter, a postal (Zip) code entered by the user in the Web form. This script also loads
CGI.pm, which makes it very easy to receive user form input. CGI.pm provides a Perl function
(or method) called param. This function can be called with the name of a field in a Web form
to retrieve the value entered by the user in that field. In this example, the value entered by
the user in the zip field is obtained by calling param(‘zip’).
The Common Gateway Interface (CGI) 237
#!/usr/bin/perl -w
# queries a MySQL table for a value and returns it in
# an HTML-formatted document
#
use strict;
use DBI;
use CGI qw(:standard);
use CGI::Carp;
#
# Create a new CGI object
my $output=new CGI;
#
# What did the user enter in the query form?
Configuration Options
my $zipentered=param(‘zip’) if (param(‘zip’) );
my($server, $sock, $db);
#
Advanced
# Connect to mysql database and return a database handle ($dbh)
my $dbh=DBI->connect("DBI:mysql:zipcodes:jackal.hiwaay.net","root","mypass");
# PART 3
# Prepare a SQL query; return a statement handle ($sth)
my $sth=$dbh->prepare("Select * from zips where zip=$zipentered");
#
# Execute prepared statement to return
$sth->execute;
my @row;
print $output->header;
print $output->start_html("Zip Code");
print h1("ZipCODE");
#
# Return rows into a Perl array
while (@row=$sth->fetchrow_array() ) {
print "The US Postal Service Zip Code <font size=+1><b>$row[2]</b></font> is
for: <font size=+2><b>$row[0], $row[1]</b></font>\n";
}
print "<p>\n";
print "<h3>GATEWAY_INTERFACE=$ENV{GATEWAY_INTERFACE}</h3>";
print $output->end_html;
238 Chapter 8 Scripting/Programming with CGI and Perl
#
# Call the disconnect() method on the database handle
# to close the connection to the MySQL database
$dbh->disconnect;
The database query is performed using the DBI.pm module (see the accompanying discus-
sion). Although DBI has a number of functions, the most basic use of the module is to
create a connection, compose and send a query, and close the connection. The comments
in Listing 8.5 serve to explain what each line is doing. Using this example, you should be
able to quickly write your own CGI script to connect to and query a relational database
on your own server. You may not choose to use MySQL as I did, but by changing one line
of this script (the line that calls the DBI->connect method) you can make this same script
work with nearly all major relational database servers.
DBI.pm
Certainly one of the most useful of all Perl modules is the DBI module (DBI.pm). DBI
stands for “DataBase Independent,” and the DBI module provides a standard set of
programming functions for querying a wide variety of databases. While a full discus-
sion of the DBI module is beyond the scope of this chapter, I have used DBI to
illustrate database querying from standard CGI scripts.
You install the DBI module once. Then you install a DataBase Dependent (or DBD)
module for each database that you intend to access. As with most Perl modules, the lat-
est versions of the DBD modules are available from the Comprehensive Perl Archive
Network (search.cpan.net). Better yet, using the CPAN.pm module, download and
install them in one easy step. To install the latest DBI.pm module and the MySQL DBD
module, I entered two commands after loading CPAN.pm (as described in Chapter 1):
# cpan
cpan> install DBI
cpan>install DBD::mysql
One very interesting module that I use quite frequently is DBD::CSV, which allows you
to create a flat text file consisting of rows of comma-separated values and work with
it as you would a true relational database. Each line of the file is a separate data
record or row, and the values on the line, separated by commas, are separate data
fields. Using DBD::CSV allows you to develop database applications without having
access to a true relational database. When you have things the way you want them,
you simply modify your application to use a true database-dependent driver (by load-
ing a new DBD module).
FastCGI 239
FastCGI
CGI was the first general-purpose standard mechanism for Web programming, and for a
long time it remained the most used application programmer’s interface to the Web
server. But it has always been hampered by a performance bottleneck: Every time a CGI
application is called, the Web server spawns a new subsystem or subshell to run the pro-
cess. The request loads imposed on many modern servers are so large that faster mecha-
nisms have been developed, which now largely overshadow CGI. Among the first of these
was FastCGI, a standard, which allows a slightly modified CGI script to load once and
remain memory-resident to respond to subsequent requests.
FastCGI consists of two components. The first is an Apache module, mod_fastcgi.so,
that modifies or extends the Web server so that it can properly identify and execute pro-
grams designed to run under FastCGI. The second component is a set of functions that are
linked to your FastCGI programs. For compiled languages, these are provided as a shared
library; for Perl, these functions are added using the FCGI.pm Perl module.
Configuration Options
To make the functions exported from these libraries available to your program, you
include a C header file or, in scripting languages like Tcl or Perl, place a line at the begin-
ning of the script to include code necessary to enable FastCGI support in the script.
Advanced
FastCGI libraries are available for C, Perl, Java, Tcl, and Python. In this section I’ll dem-
onstrate how to make the necessary modifications to the Apache server and how to
modify the CGIForm.cgi Perl script to allow it to run as a FastCGI script. PART 3
2. Moving to the new source directory created, you can use a single command line
to invoke the apxs utility to compile and install the module:
# cd mod_fastcgi_2.2.4
# /usr/local/apache/bin/apxs -i -a -o mod_fastcgi.so -c *.c
3. Then verify that the following lines have been added to httpd.conf:
LoadModule fastcgi_module libexec/mod_fastcgi.so
AddModule mod_fastcgi.c
#!/usr/bin/perl -Tw
#Environ.cgi - Show environment variables set by the server
use FCGI; # Imports the library; this line required
FastCGI 241
You might note a new environment variable that appears when you call this script from
a browser:
FCGI_APACHE_ROLE= RESPONDER
Configuration Options
This indicates that FastCGI is operating in one of three different application roles it can
assume. The Responder role provides the functionality of ordinary CGI, which cannot
operate in the other two roles that FastCGI can assume. The first of these alternate roles
Advanced
is the Filter role, in which a FastCGI script is used to process a file before it is returned
to the client. The other role is the Authorizer role, in which a FastCGI application is used
to make decisions about whether or not to grant a user’s request. In this role, FastCGI acts PART 3
as an authorization module, like those described in Chapter 14. Both of these roles are too
complex for discussion here, and neither is used often. If you’re interested in exploring
either of them further, your first stop to learn more should be www.fastcgi.com.
Note that other Perl modules can still be used in same fashion as ordinary CGI scripts.
Listing 8.7 illustrates this. It’s a FastCGI rewrite of the zipcodes MySQL query script,
rewritten to take advantage of the efficiency of FastCGI.
#!/usr/bin/perl -Tw
use CGI;
use CGI::Carp;
use FCGI; # Imports the library; required line
$query = new CGI;
# Response loop
while (FCGI::accept >= 0) {
print $query->header;
242 Chapter 8 Scripting/Programming with CGI and Perl
#include <fcgi_stdio.h>
void main(void)
{
int I = 0;
while(FCGI_Accept() >= 0) {
printf(“Content-type: text/html\r\n\r\n”);
printf(“<H1>Hello World!</H1>”);
printf(“<p>You’ve requested this FastCGI page
%d times.\n”, i++);
}
}
The mod_perl Perl Accelerator 243
Notice the #include statement that is necessary to use FastCGI. The program goes into
a loop that is processed once every time a call to FCGI_Accept() returns with a result
greater than zero. I set an integer counter outside the loop, which is incremented during
the processing of the loop. Can you see how a different value of the counter is returned
for each request for this FastCGI program?
Configuration Options
My favorite Apache module, mod_perl, eliminates virtually all the overhead associated
with Perl/CGI and puts Perl in the same league with the very fastest server-side Web pro-
Advanced
gramming techniques. Add to this a tremendous wealth of modules that mod_perl
enables, and mod_perl becomes a virtual gold mine for Web program authors and Apache
administrators.
mod_perl starts by linking the Perl runtime library into the Apache server, thereby giving PART 3
each running copy of Apache its own Perl interpreter. This is accomplished in two ways;
first, the Perl function libraries can be statically linked to the Apache httpd process (which
requires recompiling Apache from the source code). Alternatively, the Perl libraries can
be linked into the mod_perl DSO module that is loaded in Apache’s address space at run-
time. If the DSO option is chosen, you have a choice of obtaining the DSO as an RPM,
or compiling it yourself. All of these methods of installing mod_perl are discussed in the
next few sections. This completely eliminates the need to start a new instance of the Perl
interpreter in its own Linux process each time a Perl CGI script is called, which signifi-
cantly improves the response time and total runtime of scripts. Consequently, this increase
in server throughput results in a dramatic increase in the number of client requests that can
be serviced in a given time.
The really cool thing is that mod_perl runs nearly all Perl scripts without modification.
The only thing you have to do is specify mod_perl as the Apache handler for the scripts,
instead of the default mod_cgi. On my server, I set up mod_cgi to handle requests to /cgi-
bin and mod_perl to handle all requests to /perl.
244 Chapter 8 Scripting/Programming with CGI and Perl
The advantages of using mod_perl don’t stop there, however. An integral part of mod_
perl is Apache::Registry, which is probably the most valuable Perl module for use with
Apache. Used together, these modules increase the execution speed of Perl scripts dra-
matically. The Apache::Registry module creates its own namespace and compiles Perl
scripts that are called through mod_perl into that namespace. It associates each script
with a time-stamp. The next time that script is used, if the source files aren’t newer than
the compiled bytecodes in the Apache::Registry namespace, the module is not recom-
piled. Some Perl code that is called frequently (like CGI.pm) is only compiled once, the first
time it is used!
Two other important features that mod_perl adds to Apache are of special interest to
hard-core programmers. The first is a set of functions that give Perl scripters access to the
Apache internal functions and data structures. This permits Apache modules to be
written completely in Perl, rather than in C, and a large number of such modules exist.
(I’ll describe a few shortly.) The second programming feature is a set of easy-to-use han-
dler directives for all the phases of the Apache request processing cycle. These permit the
specification of Perl modules to handle virtually any task, without explicitly adding the
module to the Apache configuration.
Installing mod_perl
Installing mod_perl is a bit more complex than installing most Apache modules, mainly
because it consists of so many components. As described, the actual mod_perl Apache
module (compiled from C source) is only one part of the puzzle. In addition, the Perl inter-
preter library needs to be linked either to the Apache kernel or directly to the mod_perl
module (if it’s compiled as a DSO). A number of supporting Perl scripts are also included
with mod_perl and provide essential pieces of its functionality. There are two easy ways
to acquire all the parts and install them simply. The first is to obtain mod_perl as an RPM,
which I recommend only if you are using a copy of Apache that has been installed using
an RPM, most likely as a part of a standard Linux distribution.
If you are using Apache source that you compiled or installed from a binary distribution,
you should consider using the CPAN.pm module to acquire and install the latest source dis-
tribution of mod_perl, and all the required support files. If you use this method, however,
I recommend that you decide between statically linking mod_perl to Apache and com-
piling it as a DSO module. In either case, you should reconfigure, recompile, and reinstall
the module directly from the source code downloaded and extracted to your hard disk by
CPAN.pm. Instructions for both methods are given.
Traditionally, mod_perl has experienced problems when built as a DSO module on some
platforms. The documentation for the module warns that it may not work, and says that
The mod_perl Perl Accelerator 245
for some platforms it will not work. However, I have compiled mod_perl as a DSO
(libperl.so) and it has worked fine for me under Linux, running every CGI script I have
without problems, even during intense activity driven by the ApacheBench benchmarking
utility. And since mod_perl is provided as a DSO with the Red Hat Linux distribution, the
module probably doesn’t deserve its reputation for presenting problems when run as a
DSO (at least not on Linux). I recommend that you compile mod_perl as a DSO and
reserve static linking as a means of solving any problems that occur. I’m willing to bet you
will have no problems using the module as a DSO.
Configuration Options
in their “proper” locations, but this is more trouble than it is worth, negating the benefits
of the RPM.
Advanced
If you have a Red Hat Linux system, or you installed Apache from a downloaded RPM,
chances are you already have mod_perl installed. You can tell if you have the module
installed by running rpm -qa | grep mod_perl. If you simply want to reinstall or upgrade PART 3
the module, I recommend you download and install the mod_perl RPM. If you’ve
installed Apache from a binary distribution for your platform, or compiled it from the
source, however, don’t use the RPM method described here. Instead, use the CPAN install
method to retrieve the source as a Perl or bundled distribution.
The only place you really need to look for Linux RPMs is Rpmfind.Net (http://
rpmfind.net). This site hosts a database and repository of thousands of RPMs, built on
systems around the world. Figure 8.3 shows the mod_perl 1.22 RPM that I downloaded
for installation on my server. This RPM contains mod_perl built as a loadable module
(DSO) that installs as /usr/lib/apache/libperl.so. If you are installing mod_perl for
the first time, you will need to configure Apache manually to load and use the module as
described below.
Installing the RPM is as easy as downloading the package and entering the following
command line:
# rpm –i mod_perl-1.22-2.i386.rpm
246 Chapter 8 Scripting/Programming with CGI and Perl
You should have the latest versions of all these modules installed on your machine (as
Configuration Options
well as an up-to-date version of Perl, at least 5.004, although 5.005 is much better). Don’t
despair if there seems to be far too many pieces of the puzzle; it’s easy to snag them all
with CPAN.pm, where you can download mod_perl and all those other pieces as a single
Advanced
bundle and install them all at one time.
You can install the mod_perl bundle from CPAN using the following command line: PART 3
# perl -MCPAN -e ‘install mod_perl’
I do not recommend this method of using the CPAN.pm module, however, particularly
since many of the installation scripts (makefiles) require some user input. Always invoke
CPAN.pm in the interactive mode (as described in the sidebar earlier in the chapter) and use
the install command, as shown below. The components of the mod_perl package that
are already installed on your machine are flagged as “up to date” and will be skipped
during the installation.
# cpan
cpan shell -- CPAN exploration and modules installation (v1.54)
ReadLine support enabled
A lot of stuff happens at this point (the trace was over 1500 lines long). Don’t let it scare
you; it’s mostly information that you can safely ignore, but some parts of the install will
require user input. For example, in installing the libnet portion, you will be asked for
some information about your network, including the names of mail servers, your domain
248 Chapter 8 Scripting/Programming with CGI and Perl
name, and that sort of thing. Do your best to answer these, but don’t fret if you can’t
answer a question; the module will work without all that information; accept the default
response for any question you can’t answer and things will be just fine. Everything is built
relative to your Apache source code, so when you’re asked to supply that, make sure you
enter the correct path to the src directory under the Apache source directory:
Please tell me where I can find your apache src
[] /usr/local/src/apache_1.3.9/src
Configure mod_perl with /usr/local/src/apache_1.3.9/src ? [y]
Shall I build httpd in /usr/local/src/apache_1.3.9/src for you? [y]
During the installation procedure (which takes quite some time), the CPAN.pm module
downloads mod_perl and all of the support modules listed above, and it runs Makefile
scripts for each to compile and install them. When the process is completely, I usually
reenter the install line as a test. If all went well, instead of 1500 lines of output, you
should see this:
cpan> install mod_perl
mod_perl is up to date.
Unless specifically reconfigured, the CPAN.pm module creates a source and a build direc-
tory under a .cpan directory beneath the current user’s home directory. I always use the
root account when I use CPAN to download and install Perl packages, so all my source
archives and builds fall under /root/.cpan, and I’ve found no reason to change this. It
seems like a good location, well protected, and I don’t recommend that you change these
default locations.
In addition to the sources directory, you should find a build directory, which the
CPAN.pm module uses to build Perl packages. It does this by extracting the files from the
source archive and running the provided Makefile.PL to configure the application
(which may query you for information it needs) and then running make and then
makefile. This is the same procedure that you would use if installing the package by
hand. Most modules will not require you to go into the build directory and run the install
The mod_perl Perl Accelerator 249
by hand, but because of the number of configuration options for mod_perl, that is exactly
what you should do after running CPAN.pm.
You should find a directory under build that matches the name of the Perl package you
installed using CPAN.pm. In my case, the directory for mod_perl version 1.24 was /root/
.cpan/build/mod_perl-1.24. In this directory, enter the following to configure mod_
perl, adding several configuration options:
# perl Makefile.PL \
> APACHE_SRC=/usr/local/src/apache_1.3.9/src \
> USE_APACI=1 \
> PREP_HTTPD=1 \
> DO_HTTPD=1 \
> EVERYTHING=1
Note that this is a single command line; the backslash at the end of the first five lines con-
catenates the following line. The most important option here is PREP_HTTP, which causes
Configuration Options
the source files to be moved into the Apache source tree (defined by APACHE_SRC) but not
built. After you enter the above command line, the configuration will proceed, printing
many lines of information. The EVERYTHING variable instructs the configuration to
Advanced
include all features of mod_perl, even those considered experimental. This will enable, for
example, features like support for Server-Side Include parsing. Using EVERYTHING=1 is the
equivalent of specifying all of the following: PART 3
# ALL_HOOKS=1 \
> PERL_SSI=1 \
> PERL_SECTIONS=1 \
> PERL_STACKED_HANDLERS=1 \
> PERL_METHOD_HANDLERS=1 \
> PERL_TABLE_API=1
You can now compile and install Apache by typing the following command while still in
the mod_perl source directory:
# make
# make install
This is the way mod_perl is documented, and it works perfectly well, but if you compile
the module from this location you will need to remember that subsequent Apache builds
need to be run from the mod_perl directory, instead of from the Apache source tree. This
is awkward, but there is a better way. If you prefer to rebuild Apache from the Apache
source tree as described in Chapter 3, use the following instructions.
250 Chapter 8 Scripting/Programming with CGI and Perl
After running the perl Makefile.PL just shown, you’ll find a new directory, src/
modules/perl, inside the Apache source directory. It contains everything APACI needs to
compile this module into the Apache server. However, to compile Apache and include the
mod_perl module, you will need to modify the LIBS and CFLAGS variables and add an
--activate-module option when running the Apache configure utility. The following
script (run from within the top-level directory of the Apache source tree) is the one I use
to configure, compile, and install basic Apache with mod_perl support:
#!/bin/sh
LIBS=`perl -MExtUtils::Embed -e ldopts` \
CFLAGS=`perl -MExtUtils::Embed -e ccopts` \
./configure \
"--activate-module=src/modules/perl/libperl.a" \
"$@"
make
make install
It’s a single command line, spanning several lines concatenated using the trailing \ char-
acter on the first four lines. Now run make to compile the DSO. When it completes the
task, you should find the DSO, compiled as libperl.so, residing in the apaci directory.
The size of this file (nearly a megabyte on my machine) seems excessive, but remember
that it has the entire Perl interpreter linked into it, which largely accounts for the size.
# ls -al apaci/libperl.so
-rwxr-xr-x 1 root root 947311 Apr 8 15:24 apaci/libperl.so
You can reduce the size of the file somewhat by stripping unnecessary debugging symbols
(using the Linux strip command):
# strip apaci/libperl.so
# ls -al apaci/libperl.so
-rwxr-xr-x 1 root root 872676 Apr 8 15:30 apaci/libperl.so
The mod_perl Perl Accelerator 251
A reduction of only eight percent seems modest, but worth the little time it took. The last
step is to run make install to install the module:
# make install
[root@jackal mod_perl-1.24]# make install
(cd ./apaci && make)
make[1]: Entering directory `/root/.cpan/build/mod_perl-1.24/apaci’
lines deleted
Configuration Options
Appending installation info to /usr/lib/perl5/i586-linux/5.00405/perllocal.pod
This last step uses apxs to install the module into Apache’s libexec directory, and
Advanced
even modifies the Apache httpd.conf file to use it, by adding the following two lines
to that file:
LoadModule perl_module libexec/libperl.so PART 3
AddModule mod_perl
The second directive, PerlPassEnv, takes the name of an existing variable from the main
server’s environment (usually the environment of the user who started Apache, typically
root). This environment variable will be included in the environment set up for the CGI
script:
PerlPassEnv USER
If you are not passing data to your mod_perl scripts through the environment, you can
instruct mod_perl not to set up an environment to be inherited by CGI scripts. The speed
gain and memory savings are usually not substantial enough to warrant disabling this fea-
ture, but it can be done for an extra performance boost:
PerlSetupEnv Off
Configuration Options
If Apache’s mod_include module has been compiled with the proper switches (SSI = 1
or EVERYTHING = 1, as discussed earlier in the chapter), it can support Perl callbacks as
Advanced
SSI tags. The syntax for an SSI line that uses this special tag is
<--#perl sub=subkey -->
PART 3
The subkey can be a call to an external Perl script or package name (which calls the han-
dler method of the package by default), or it can specify an anonymous function
expressed as sub{}:
<--perl sub=”SomePackage” arg=”first” arg=”second” -->
<--perl sub=”sub {for (0..4) { print \”some line\n\” }}” -->
To find only the most recent version of each, go to the CPAN search site and search for
modules matching the string Apache. As a shortcut, you can use the following URL:
search.cpan.org/search?mode=module&query=Apache
The Apache::Registry Module Of all the support modules that mod_perl relies on,
Apache::Registry is without question the most important. The module is so important
that mod_perl can’t be installed or used without it. The two work hand in hand, and
many of the functions we think of as part of mod_perl are actually performed by the
Apache::Registry Perl module. The module performs two tasks that greatly extend
mod_perl. First, it provides a full emulation of the CGI environment, allowing CGI
scripts to be run under mod_perl without modification. Remember, mod_perl only pro-
vides a Perl programmer’s interface to Apache and embeds a Perl interpreter into the
Apache kernel. It is these functions of mod_perl that Apache::Registry uses to provide
CGI emulation to Apache and its Perl interpreter. Although it is a separate Perl module,
Apache::Registry is inseperable from mod_perl, and each depends on the other.
The second essential Apache::Registry function is called script caching. Perl CGI scripts
are automatically loaded into a special namespace managed by Apache::Registry and
maintained there, rather than being unloaded from memory after they are used. This
means that a CGI program is loaded and compiled only the first time it is called, and sub-
sequent calls to the same program are run in the cached code. This greatly increases the
throughput of the Perl engine, as I’ll show in a later section on benchmarking mod_perl.
Although Apache::Registry provides the functions of CGI emulation and script
caching, these are usually attributed to mod_perl, and for that reason I won’t refer
again to Apache::Registry. Whenever I refer to mod_perl, I’m speaking of mod_perl
with Apache::Registry and other support modules of lesser importance. Without
these, mod_perl doesn’t do much for the typical Apache administrator but is merely a
programmer’s interface.
The PerlSendHeader On line causes mod_perl to generate common HTTP headers just as
mod_cgi does when processing standard CGI. By default, mod_perl sends no headers. It
is a good idea to always enable the PerlSendHeader directive, especially when using
unmodified standard CGI scripts with mod_perl.
I used the following Alias directive to assign a directory outside Apache’s DocumentRoot
to the /perl request URL:
Alias /perl/ /home/httpd/perl/
The directory so defined does not have to be located under the Apache DocumentRoot,
and in my case, it is not. I used the following Linux command lines to create the new
directory, copy the environ.cgi script to it, and then change the ownership of the direc-
tory and its contents to the nobody account and the group ownership to www. On your
system, ensure that the file is owned by the same user account under which the Apache
httpd processes run, and that the group ownership is set to a group that includes your
Web administrators.
Configuration Options
# mkdir /home/httpd/perl
# cp /home/httpd/cgi-bin/environ.cgi /home/httpd/perl
Advanced
# chown -R nobody.www /home/httpd
REQUEST_METHOD = GET
QUERY_STRING =
HTTP_USER_AGENT = Mozilla/4.7 [en] (WinNT; I)
PATH = /bin:/usr/bin:/usr/ucb:/usr/bsd:/usr/local/bin
HTTP_CONNECTION = Keep-Alive
HTTP_ACCEPT = image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png,
*/*
256 Chapter 8 Scripting/Programming with CGI and Perl
REMOTE_PORT = 2102
SERVER_ADDR = 192.168.1.1
HTTP_ACCEPT_LANGUAGE = en
MOD_PERL = mod_perl/1.24
SCRIPT_NAME = /perl/environ.cgi
SCRIPT_FILENAME = /home/httpd/perl/environ.cgi
HTTP_ACCEPT_ENCODING = gzip
SERVER_NAME = Jackal.hiwaay.net
REQUEST_URI = /perl/environ.cgi
HTTP_ACCEPT_CHARSET = iso-8859-1,*,utf-8
SERVER_PORT = 80
HTTP_HOST = Jackal
SERVER_ADMIN = [email protected]
A Few Other Important Apache Modules Two other Apache modules for Perl,
Apache::ASP and Apache::DBI, are worth mentioning but are too ambitious to cover in
detail here. Both allow you to add extensive functionality to Apache. Because each of
them relies on the mechanisms of mod_perl, they also offer efficiency and speed. They are
available, and documented, from the perl.apache.org and search.cpan.org sites.
Apache::ASP This Perl module provides Active Server Pages (ASP), a popular
Microsoft-developed technology that originated with the Microsoft IIS server. The
Microsoft Win32 version of ASP for IIS allows the embedding of Perl, VBScript, and
JScript code in HTML documents. Using Apache::ASP, programmers already proficient
in ASP on IIS servers can leverage this knowledge by programming ASP pages in Perl.
Although VBScript and JScript cannot be used with Apache::ASP, there is an effort
underway to bring these to non-Microsoft platforms in a product called OpenASP (see
“ASP for Apache” in Chapter 9).
Apache::DBI This module (which should not be confused with the standard DBI.pm
module) enables the caching of database handles in the same way that Perl code is cached
and reused by Perl scripts. The section “Persistent Database Connections” later in this
chapter discusses Apache::DBI in detail.
embedded Perl systems. Both are complete development systems in their own right, and
can be used to develop complete Web-based applications.
Configuration Options
HTML::Mason The newer HTML::Mason Perl module appears, on the surface, to work
remarkably like its cousin, EmbPerl, but there are some critical differences between the
Advanced
two. While EmbPerl tends to take a grass-roots approach, starting with HTML and
enhancing it, Mason takes a top-down view of things. Before the HTML or Perl code ever
comes into play, Mason starts with a master plan, in which Web pages are composed of PART 3
the output of components. These components are usually mixtures of HTML, embedded
Perl, and special Mason commands. The emphasis is on site design and page structure,
rather than on simply embedding Perl functions in HTML documents. This approach
encourages code and design component reuse. Mason is full of functions to facilitate the
reuse of code, either by simple inclusion in a document, or through filters (which modify
the output of a component) and templates (which work somewhat like style sheets to
apply a format to an entire directory of pages).
HTML::Mason is not strictly an Apache add-on. It will work in stand-alone mode or CGI
mode, but the developers highly recommend that it be used with Apache supporting the
mod_perl module. HTML::Mason is well documented at the author’s Web site, which also
hosts a small library of user-written components that can be downloaded and used or
examined to learn the use of the product.
uses caching to achieve the rest. You’ll see two ways of using mod_perl’s caching engine
more efficiently, and then the results of a few benchmark tests I ran to verify that mod_perl
does, indeed, deliver on its promise of greatly increasing your Perl script performance.
Then remove all use DBI lines from scripts that should use cached database handles. This
will prevent calls to DBI functions from being handled by the standard DBI module. All
such calls will instead be handled automatically by Apache::DBI module.
You should never attempt to open a database connection during Apache’s startup
sequence (for example, from a mod_perl startup script). This may seem like a logical way
to open database connections for later use, but the database handles created this way are
shared among the httpd child server processes, rather than being opened one-per-httpd-
process. This can create conflicts between httpd processes trying to use the same database
handle simultaneously.
The mod_perl Perl Accelerator 259
Preloading Modules
Remember that the main Apache httpd process creates child httpd processes to handle
all user requests, and these child processes inherit the namespace of the main process.
Each child process inherits its own copy of the Perl modules loaded by the main server to
support mod_perl. As requests for CGI scripts are fulfilled, each process will also load
its own copy of Apache::Registry and maintain its own cache of compiled Perl scripts.
These child httpd processes are usually killed after answering a fixed number of
requests (configured using the MaxRequestsPerChild directive). This can mitigate
problems associated with potential memory leaks, but it also destroys each httpd pro-
cess’s Apache::Registry cache and requires that each be built again from scratch. This
can happen thousands of times during the lifetime of the main server process.
To prevent cached Perl code from being destroyed along with the child process that
loaded it, mod_perl provides two configuration directives that enable Perl scripts to be
preloaded into the namespace of the main server and inherited by all the child processes
it creates.
Configuration Options
The first of these is PerlRequire, which specifies a single Perl script to load when Apache
starts up:
Advanced
PerlRequire startup.pl
Generally, this script contains use statements that load other Perl code. This directive is PART 3
used to preload external modules that are common to a number of Perl scripts:
# contents of startup.pl
use Apache::Registry;
use CGI;
use CGI::Carp;
use Apache::DBI;
The script specified must exist in one of the directories specified in the @INC array
(described in the next section).
The second directive that can be used for this purpose is PerlModule, which can specify
a single module to preload when Apache starts. The startup.pl script shown can also be
rewritten entirely in httpd.conf with these four directives:
PerlModule Apache::Registry
PerlModule CGI
PerlModule CGI::Carp
PerlModule Apache::DBI
260 Chapter 8 Scripting/Programming with CGI and Perl
The advantage of using PerlModule is that an external startup script is not required. A
limitation of PerlModule, however, is that no more than 10 modules can be preloaded
using PerlModule directives. For most sites, this limitation is not significant, but if you
need to preload more than 10 modules, you will need to use a startup script.
The last two directories will appear only if mod_perl is used. The last directory specified
in this array provides a convenient location for storing Perl scripts that are intended for
use only with mod_perl. Although Perl fills @INC from values compiled into Perl when you
run a program, you can add directories to this array with the use lib statement, which
is best placed in a startup script to ensure that all processes inherit the modified @INC
array. To add the directory /usr/local/perlstuff to the @INC array, add a line like the
following somewhere at the beginning of your startup script:
use lib /usr/local/perlstuff
Unlike ordinary scripts, code loaded through use and require modules is not automat-
ically reloaded if it is changed. For this reason, you should use these statements only to
call code that will not change, particularly if you are loading it into the main server
namespace, which will not be refreshed as child processes expire and are killed.
Benchmarking mod_perl
To find out just how much faster mod_perl runs than a traditional Perl script invoked
through the CGI interface, I used the excellent ApacheBench benchmarking tool that
comes packaged with Apache (you’ll find it as ab in the bin directory where Apache is
installed).
Listing 8.9 shows a very simple CGI script that does nothing but return the system envi-
ronment.
#!/usr/bin/perl
Configuration Options
#ReportWho.cgi - Show environment variables set by the server
Advanced
print "<HTML><HEAD><TITLE>Environment Variables</TITLE></HEAD><BODY>";
print "<H2>Environment Variables:</H2>";
print "<HR>\n"; PART 3
foreach $evar( keys (%ENV)){
print "<B>$evar:</B> $ENV{$evar}<BR>";
}
print "</BODY></HTML>\n";
I used ab to execute this script as ordinary CGI with the following command line:
# ab -n 10000 -c 20 192.168.1.1:80/cgi-bin/environ.cgi
Here, -n represents the number of requests to make, and -c indicates the number of con-
current connections to my server that would be opened by ab.
I collected statistics on 10,000 requests to /cgi-bin/environ.cgi, and then I executed
the following command line to collect the same statistics on /perl/environ.cgi. These
requests are handled by mod_perl and Apache::Registry.
# ab -n 10000 -c 20 192.168.1.1:80/perl/environ.cgi
262 Chapter 8 Scripting/Programming with CGI and Perl
The results of my benchmark test (Tables 8.1 and 8.2) show the number of requests that
can be answered is 350% that of unmodified Apache.
Server Port 80 80
Concurrency Level 20 20
Failed requests 0 0
If the additional efficiency that could be obtained through persistent database connection
sharing were introduced, these numbers would be even more impressive. That’s why I ran
the second set of tests shown in Tables 8.3 and 8.4.
The results in these tables show an even more remarkable improvement in speed using
mod_perl. This example queries the zipcodes MySQL database 1000 times, using this
command line:
# ab -n 1000 -c 20 192.168.1.1:80/cgi-bin/zipcodes.cgi?zip="35016"
I then ran the same test through mod_perl, using this command:
# ab -n 1000 -c 20 192.168.1.1:80/perl/zipcodes.cgi?zip="35016"
This test really gave mod_perl a chance to shine. It not only takes advantage of the
embedded Perl interpreter, which eliminates the shell process creation overhead associated
with CGI, but also allows Apache::Registry to open a database connection and pass the
database handle to processes that ordinarily would have to open and close their own con-
Configuration Options
nections. With absolutely no attempt to optimize mod_perl, I saw an increase of nearly
1400% in the number of connections served per second. I’m no benchmarking expert, and
these results are from something less than a controlled scientific experiment, but they were
Advanced
enough to convince me that mod_perl runs circles around conventional CGI.
Server Port 80 80
/cgi-bin/zipcodes /perl/zipcodes
Document Path .cgi?zip=”35801” .cgi?zip=”35801”
Concurrency Level 20 20
Failed Requests 0 0
264 Chapter 8 Scripting/Programming with CGI and Perl
Connect 0 0 35 0 25 1810
Figure 8.4 shows this status page, which was invoked on my server using the URL
http://jackal.hiwaay.net/perl-status.
The mod_perl Perl Accelerator 265
Configuration Options
Advanced
PART 3
The first part of the display shows the embedded Perl version, along with the server string
that identifies many of the capabilities of the server, as well as the date when the server
was started. The lower part of the display is a menu of links to more detailed information
about specific areas. Particularly check out the Loaded Modules page, shown in Figure 8.5.
This page lists all of the Perl modules that have been compiled and are held in cache by
Apache::Registry. When Apache is first started, this page shows the 21 modules loaded
into cache and available to be run by the embedded Perl interpreter.
These are the modules preloaded by mod_perl and required to implement and support it.
After running a single script (the Perl/CGI zipcode query), I checked this page again, and
discovered that the following scripts had been loaded, compiled, and cached:
Apache::Registry
CGI
CGI::Carp
CGI::Util
DBI
DBD::mysql
266 Chapter 8 Scripting/Programming with CGI and Perl
These are the modules that were called for use in my query program, except for
Apache::Registry, which is loaded at the same time as the first CGI request. (Remember
that in httpd.conf, we specified Apache::Registry as the handler for scripts called with
the URL /perl/*.)
The modules CGI.pm and DBI.pm have been loaded, as well as the database-dependent
module for MySQL. These have actually been loaded for only one httpd daemon. If that
daemon receives another request for a Perl/CGI script that needs these module, they do
not have to be loaded or compiled, and there is no need to spawn a Perl process to run
them because there is one already running (mod_perl’s embedded Perl interpreter). This
gives Perl scripts blinding speed under mod_perl.
Use the mod_perl status pages to ensure that the module is properly installed and is caching
your scripts as they are executed. Particularly, ensure that the most frequently accessed
The mod_perl Perl Accelerator 267
scripts are preloaded when Apache starts. By viewing the status pages immediately after
starting Apache, you can see which scripts are preloaded, compiled, and cached for arriving
requests.
Configuration Options
of Perl functions that correspond to functions in the Apache API. (These were formerly
accessible only to programs written in C.)
Advanced
The following configuration directives are defined by mod_perl, each corresponding to a
different phase in the Apache request processing cycle:
PerlHandler - Perl Content Generation handler PART 3
PerlTransHandler - Perl Translation handler
PerlAuthenHandler - Perl Authentication handler
PerlAuthzHandler - Perl Authorization handler
PerlAccessHandler - Perl Access handler
PerlTypeHandler - Perl Type check handler
PerlFixupHandler - Perl Fixup handler
PerlLogHandler - Perl Log handler
PerlCleanupHandler - Perl Cleanup handler
PerlInitHandler - Perl Init handler
PerlHeaderParserHandler - Perl Header Parser handler
PerlChildInitHandler - Perl Child init handler
PerlChildExitHandler - Perl Child exit handler
PerlPostReadRequestHandler - Perl Post Read Request handler
PerlDispatchHandler - Perl Dispatch handler
PerlRestartHandler - Perl Restart handler
A caution may be in order here, that you are entering real programmers’ territory—but
actually all of these handlers are very easy to use. They allow you to specify Perl code to
268 Chapter 8 Scripting/Programming with CGI and Perl
perform functions at various stages during the handling of an HTTP request without
having to use the specialized functions of the Apache API (although those are still avail-
able). The most important of these directives is PerlHandler, which defines a Perl module
that is called by Apache during the Content Generation phase immediately after a docu-
ment is retrieved from disk. The module defined by PerlHandler can do whatever it
wants with that document (for example, it is this handler that is used to parse an SSI doc-
ument). Previously, I showed how to use this directive to define Apache::Registry as the
handler for scripts identified (by the /perl/ in their URL) to be run under mod_perl.
Listing 8.10 illustrates a very simple Perl logging program to write request information to
a MySQL database. The $r in this example is an object that represents the HTTP request
headers and is extracted from another object that is passed to the script by mod_perl,
which contains everything Apache knows about the HTTP request being processed.
package Apache::LogMySQL;
use strict;
sub handler {
my $orig = shift;
my $r = $orig->last;
my $date = ht_time($orig->request_time, ‘%Y-%m-%d %H:%M:%S’, 0);
my $host = $r->get_remote_host;
my $method = $r->method;
my $url = $orig->uri;
my $user = $r->connection->user;
my $referer = $r->header_in(‘Referer’);
my $browser = $r->header_in(‘User-agent’);
my $status = $orig->status;
my $bytes = $r->bytes_sent;
The mod_perl Perl Accelerator 269
my $dbh =
DBI->connect(“DBI:mysql:mydblog:jackal.hiwaay.net”,”root”,”password”)
|| die $DBI::errstr;
$sth->execute($date,$host,$method,$url,$user,
$browser,$referer,$status,$bytes) || die $dbh->errstr;
return OK;
}
1;
__END__
Configuration Options
If this file is saved as LogMySQL.pm under the Apache package directory (/usr/lib/
perl5/site_perl/5.005/Apache on my system), it can be specified as a handler for the
logging phase of Apache’s HTTP request cycle with the single directive:
Advanced
PerlLogHandler Apache::LogMySQL
Each time a request is handled, at the Log Handler phase, this program is called. Note PART 3
that it creates its own namespace (Apache::LogMySQL). There’s not a lot to know about
this application, except that $r refers to the Apache request object, and all the informa-
tion required for the log is retrieved from that object. A special function, ht_time() in the
Apache::Util module is used to format the request timestamp that is logged. Also note
the commented Use DBI() line; that line is required only if Use Apache::DBI was not
specified in a startup.pl script so that database connections will be shared. In this
example, since Apache::DBI is used, each time this handler script calls DBI->connect, it
is handed a database handle for a database connection already opened (by Apache::DBI)
to use. This handle is returned when the script finishes, and it is used over and over.
This example is a bare skeleton of what is required to set up a Perl handler. Although it
is a real example, it is minimal. You should evaluate DBI logging modules already written
(Apache::DBILogConfig or Apache::DBILogger) before you write your own, although
you may want to do it just for fun. Look for Apache logging modules at http://
search.cpan.org/.
To use <Perl> sections, you need only create variables with the same names as valid con-
figuration directives and assign values to these, either as scalars or as Perl lists, which are
interpreted later as space-delimited strings. In other words, if you wanted to create a Port
directive and assign it the value 80, you could use the following <Perl> section:
<Perl>
$Port=80;
</Perl>
When Apache is started and this configuration file is parsed, these variables are converted
to regular configuration directives that are then treated as though they were read directly
from httpd.conf. A couple of examples will illustrate how this works. Here, a Perl sec-
tion is used to configure some general server directives:
<Perl>
@PerlModule = qw(Apache::Include Apache::DBI CGI);
$User=”wwwroot”;
$Group=”wwwgroup”;
$ServerAdmin=”[email protected]”;
__END__ # All text following this token ignored by preprocessor
</Perl>
The following example illustrates how hashes are used to store the contents of container
directives. Nested containers are stored as nested Perl hashes.
<Perl>
$Directory{“/secure/”} = {
The mod_perl Perl Accelerator 271
Of course, the Perl sections in these examples offer no benefit over the use of ordinary
configuration directives. The real benefit would be in cases where Perl code dynamically
creates (potentially hundreds of) virtual hosts. Suppose, for example, that we had a text
file that consisted of virtual host definitions, one per line, stored as sites.conf. This is
Configuration Options
a very simple example that does virtually no sanity checking, but it could be used to gen-
erate a number of IP-based virtual hosts. Whenever virtual hosts in the list need to be
Advanced
added, deleted, or modified, the change is made to sites.conf, and httpd.conf doesn’t
need to be changed.
<Perl> PART 3
open SITECONF, "< /usr/local/apache/conf/sites.conf" or die "$!";
while (<SITECONF>) {
chomp;
next if /^s*#/ || /^s*$/; # Skip comments & blank lines
my @fields = split(/:/,$_,-1);
die "Bad sites.conf file format" unless scalar(@fields)==6;
my ($sitename, $sadmin, $ip, $http_dir, $errlog, $tfrlog)= @fields;
$VirtualHost{$ip} = {
ServerName => $sitename,
ServerAdmin => $sadmin,
DocumentRoot => "/home/httpd/".$http_dir,
ErrorLog => "logs/".$errlog,
TransferLog => "logs/".$tfrlog
};
}
close SITECONF;
__END__
</Perl>
272 Chapter 8 Scripting/Programming with CGI and Perl
If you choose to use Perl sections to configure virtual hosts dynamically, remember that
you can run httpd -S to display the virtual host configuration.
In Sum
The earliest method of interfacing external programs to the Apache Web server is the
Common Gateway Interface, once the de facto standard for programming Web applica-
tions. CGI remains a viable technique for Apache programming.
The biggest drawback of traditional CGI (poor performance) has been largely eliminated,
first by the use of FastCGI (implemented in the Apache module mod_fastcgi) and more
recently by mod_perl. Both of these Apache add-ons eliminate the overhead of starting a
new Perl interpreter process each time a CGI script is called. The mod_perl module goes
a step farther and uses a caching mechanism to ensure that scripts, once compiled into Perl
pseudo-code, are available for subsequent invocation without requiring recompilation.
This chapter has shown how to modify Apache to use these programming techniques, and
it has illustrated the use of each with a simple, yet useful, application that queries a rela-
tional database using user-entered data as a search key. The next chapter examines pro-
gramming techniques that are somewhat newer than CGI, each of which has garnered a
lot of attention and a large number of devotees in the past few years.
Other Apache
9
Scripting/Programming Tools