Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp.com.

Notice: This material is excerpted from Running A Perfect Web Site with Apache, ISBN: 0-7897-0745-4. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 06 - Managing an Internet Web Server

This chapter deals with making your server robust, efficient, automated, and secure.

One of the biggest strengths of the Apache Web server is that it is highly tunable - just about every feature that imposes any sort of extra server load is an option, which means you can sacrifice features for speed if you need to do so. That said, Apache is designed for speed and efficiency, even with all its features enabled; for all but the most CGI intensive sites, you'll probably swamp a full T1 worth of bandwidth before exhausting the resources of a well-constructed Linux/Pentium box.

Apache has also been designed to give site administrators control over where to draw the line between security and functionality. For some sites with many internal users, such as an Internet service provider, being able to control the policies toward what functionality can be used where is important. Meanwhile, a Web design shop might want complete flexibility, even if it means that an errant CGI script could expose a security hole or do damage. In fact, many people feel that CGI in general is one big security hole, but we'll get to that in a bit.

In this chapter, you learn about:

  • Server child process control

  • Increasing efficiency

  • Hardware issues

  • Log File Rotation

  • Security issues

Make no mistake, this is a somewhat technical chapter. You may want to come back to this chapter later.

Server Child Process Control

As described earlier, Apache employs the concept of a "swarm" of semi-persistent daemons, sometimes also called "children," running and answering queries simultaneously. While the size of that swarm varies, there are limits as to how large it can get, and how quickly or slowly it can grow. This is critical; one of the main performance problems with older servers that executed a fork() system call at every request was that there was no way to control the total number of simultaneous daemons, so when the main memory of a machine would get consumed and start swapping, the machine would just become unusable. This was colloquially called daemon-spamming.

Some other servers out there let you specify a fixed number of processes, with the "fork for every request" behavior kicking in if all the children are busy when a new request comes in. This is also not the best model - not only do many people set that fixed number too high (having 30 children running when only 5 need to be, which can hinder performance), but this model also removed the protection against daemon-spamming.

So, the Apache model is to start out with a certain number of persistent processes, and make sure you always keep some number (actually, a range somewhere between a minimum and a maximum) of processes "spare" to handle a wave of simultaneous requests. If you have to launch a few more processes to maintain the minimum number of spares, no problem. If you find yourself with more idle servers than your maximum number of spares, the excess idle ones can be killed. There is a maximum number of processes, beyond which no more will be launched, to protect the machine against daemon-spamming.

This algorithm is configured using the following configuration directives:

     StartServers  10
     MinSpareServers 5  
     MaxSpareServers 10  
     MaxClients   150
The numbers given above are the defaults. This says that when Apache launches, 10 children (StartServers) are automatically launched, regardless of the request load at start. If all 10 children are swamped, then more are forked, until there are enough running so that all the requests are answered as fast as they are received, with at least 5 (MinSpareServers), but not more than 10 (MaxSpareServers), free servers to deal with "spikes" in requests (i.e., when a sudden burst of requests come in well within half a second of each other). Incidentally, these spikes are often caused by browsers that open a separate TCP connection for each inline image in a page in an attempt to improve perceived performance to the user, often at the expense of the server and network.

Usually a stable number of simultaneous child processes is reached, but if the requests are just pouring in (you've installed the Pamela Anderson Fan Club page on your site, for example), then you might reach the MaxClients limit. At that point, requests will queue into your kernel's "listen" queue, waiting to get served. If still more pour in, your visitors will eventually see a "connections refused" message. However, this is still preferable to leaving unlimited the number of simultaneous processes, since the server would just launch children with wild abandon and start daemon-spamming, resulting in nobody getting any response from the server at all.

It is recommended that you do not adjust MaxClients, as 150 is a good number for most systems. However, you might be itching to see how many requests you can handle with that multiprocessor Sun Sparc 1000 with a gigabyte of RAM; in that case, setting MaxClients much higher makes sense. On the opposite end of the spectrum, you might be running the Web server on a machine with limited memory or CPU resources, and you might want to make sure that Apache doesn't consume all of resources at the cost of possibly not being able to serve all requests that come to your site. In that context, setting MaxClients lower makes sense.

The Scoreboard File and MMAP

Because this multiprocess model required some decent communication between the "parent" and child processes, the most cross-platform method of performing that communication was chosen: a "scoreboard" file, where each child had a chunk of space in the file to which it was authorized to write, and the parent httpd process watched that file to get a status report and make decisions about whether to launch more child processes or kill idle processes.

At first, this file was located in the /tmp directory, but after hearing of problems regarding UNIX setups that regularly clear out /tmp directories (causing the server to go haywire), the scoreboard file has since been moved into the log directory. You can configure where this goes exactly with a "ScoreBoardFile" directive.

There is a program in the /support subdirectory in the Apache distribution called httpd_monitor. It can be run against the scoreboard file to give a picture of the state of all the child processes and whether they are just starting, active, sleeping, or dead. It can give you a good idea of whether your settings for MaxSpareServers and MinSpareServers are decent. Consider it a close equivalent to the UNIX system command iostat.

There is a more efficient mechanism under some UNIX variants for this, however, using the mmap() system call. For those platforms that support it, Apache 1.1 now uses this functionality. Unfortunately, it means that there is no longer any scoreboard to run httpd_monitor against. If you want this back, find the #define for HAVE_MMAP in conf.h relevant to your OS, and recompile. Apache 1.1 also implements a similar system for System 5-based UNIX variants using "shared memory."

Apache is in the stages of being instrumented pretty deeply, which will, hopefully, result in some sort of real-time statistics interface through an HTTP query to the server itself.

Increasing Efficiency in the Server Software

There are many ways to increase performance over the standard setup, including smarter ways to configure your resources, features that can be turned off for better performance, and even things at the operating system and hardware level that can be addressed. All of this makes a difference between a regular Web server and a high performance Web server.

Most non-hardware improvements fall into three categories: that which reduces the load on the CPU, that which reduces the amount of I/O to the disk, and that which reduces the memory requirements.

Server Side Includes

Server side includes can cause both an increased disk access load and an increased CPU load. The CPU penalty comes from having to parse the htmL file looking for the includes; parsing a file is a fair amount more intensive than just reading it and spitting it out to the socket. The disk access penalty comes from having to make two, three, four, or more separate disk accesses to pull together the page to get served. For example, a typical SSI document might have to have a header and footer pulled into memory to get served. That's three disk accesses to pull the document together, instead of one. If the inlined htmL files were large, the difference would not be as large in relative terms, but because they are usually small files, the disk access penalty is large in relative terms. The problem is compounded by any CGI script that might be included as well; if you had an SSI page with two CGI scripts included, you'd probably get at least twice the performance hit than if you had one CGI script that just rendered the whole page in the first place.

.htaccess Files

Searching directories for .htaccess files is fairly painful; since they work hierarchically, when a request is made for /path/path2/dir1/dir2/foo, Apache will look for an .htaccess file in every subdirectory. In this case, that's at least five. This is a significant disk access load that's best to avoid if possible.

To solve this, you should put anything controlled via your .htaccess files into the access.conf configuration file or even the srm.conf file. If you have to look for .htaccess files in subdirectories and you can narrow it down to a specific subdirectory, it's possible to have the server only look for .htaccess files in that subdirectory by the use of AllowOverride.

For example, suppose your document root was in /www/htdocs, and you want to turn off the searching for .htaccess files, except for in /www/htdocs/dir1/dir2 and everywhere below. You would put something like the following into your access.conf configuration file:

     <Directory /www/htdocs>
     Options All
     AllowOverride None
     </Directory>
     <Directory /www/htdocs/dir1/dir2>
     Options All
     AllowOverride All
     </Directory>
It's important that they are listed in that order so that the second <Directory> doesn't take precedence over the first.

Using .asis Files For Server-Push Animations

.asis files, as you read about earlier, are distinguished by having their HTTP headers directly embedded in the file itself. They are a useful optimization for certain types of files, like server-push animations, which demand the ability to set their own headers and are usually dished out by CGI scripts. The usual server-push CGI script has the additional overhead of assembling the images on the fly, whereas with an .asis file, the whole stream can be linked into one file, reducing the I/O hit. Using .asis also helps the memory and CPU performance situation.

The only thing one loses is the ability to do "timed" pushes, where there is a lapse of time between frames implemented as a "sleep()." But because server-push is also bandwidth-limited, many consider that to be a dubious feature.

Quick Tips on Hardware Issues

While this book is mostly about a piece of software, there are definitely some decisions one can make about the hardware setup to support the server, particularly heavily loaded servers.

Separate Disks for Logs and Data

One of the biggest causes of a disk access bottleneck is having your Web logs and data on the same disk. Put your log files on a separate disk - in fact, if you can, put them on a separate SCSI chain all together. If your Web server has a higher load average than you'd like, but there's still idle CPU time, you're probably disk access bound. This situation is fairly easy to correct.

Disk Caching in Memory

In many cases, the biggest hit in servicing an HTTP request comes from physically pulling the file from disk and pumping it to memory and then off to the client. If you have disk caching on, you can dramatically cut the response time for simple, small, non-CGI requests.

Kernel Modifications

Many older UNIX variants, and even some recent ones, have certain basic TCP/IP implementation problems that cause heavily visited sites to take forever to respond to requests, or even just freeze up, even though the actual load on the machine might be very low. Because HTTP/1.0 is not connection-based and a page with 10 inlined images necessitates 11 separate TCP connections (simultaneously, on some browsers), TCP implementations in UNIX kernels really get a workout when used as a heavily loaded Web server. Most were not designed to handle 100 connections per second, though now that is becoming a design requirement.

Because every operating system is different, it doesn't make sense to go into too much depth here. However, some of the things to watch for are:

  • SOMAXCONN - usually a kernel option in a configuration file, or for those operating systems that come with source, sometimes a compile-time option. This is the maximum number of socket connections that can be maintained in the listen() queue. On many operating systems, this is set to be somewhere around five, maybe 16. There is no harm and lots of good done to increase this to 64, 128, even 256. You don't want it much larger than that; the data structures can get too big and take up more memory than they should, but 128 is a good number for moderate to heavy sites.

  • Network mbufs - again, this is usually either a compile-time or runtime option for most operating systems. Increasing this will cause the runtime memory requirements of the kernel to increase slightly, but will allow for better network throughput.

The Apache Project maintains a page on specific details about tuning for different platforms at http://www.apache.org/docs/perf.html.

Log File Rotation

Certainly one goal for the site administrator should be to automate the rotation of access and error logs. Even a lightly loaded server will generate a couple megabytes of log activity per day. Left unchecked, your disk space could dry up fast.

The most basic element of log file rotation is to get the Web server to stop writing to the old log and start writing to another without disrupting service to the outside users. The most straightforward way to accomplish this is by renaming the log just slightly, and sending a HUP signal to the parent process. "Just slightly" means, renaming it to "access_log.0" or something similar on the same hard disk, on the same partition. Why? Each child has a file descriptor open to the log file. When you rename the file, the file descriptor will still point to the same actual log right up until the time the child receives an "echo" of the SIGHUP from the parent process. When that happens, the file descriptor is closed, a new one is obtained, and the new "access_log" gets created. This is pretty much the only way to guarantee not losing traffic reports while rotating logs.

Here is an example script that performs such a rotation:

#!/bin/sh
logdir="/usr/local/etc/httpd/logs"   # name of the log directory
acclog="access_log"                  # name of the access log
errlog="error_log"                   # name of the error log
pidfile="$logdir/httpd.pid"          # file that stores the parent's
                                     # process ID

mv $logdir/$acclog $logdir/$acclog.0
mv $logdir/$errlog $logdir/$errlog.0
kill -HUP 'cat $pidfile'
This needs to be run as the same user that launched the HTTP daemon originally, for example "root." You may want to write additional scripts to place these ".0" files into an archive of some sort; my favorite one is to use the year and month as subdirectories, such that the logs for January 1st, 1996, go into a file named "1996/01/01" somewhere off a directory with a lot of room. That way, it's easy to archive off somewhere else (to DAT tape, to CD-ROM, or even to remove it) by moving a directory.

Security Issues

The security of your server is, no doubt, one of your biggest concerns as a Web site administrator. Running a Web server is, by its nature, a security risk. For that matter, so is plugging your machine into a network at all. However, there is a lot that can be done to make your Web server more secure, both from external forces (people trying to break into your site) and internal forces (your own Web site users either mistakenly or willingly opening up holes).

CGI Issues

The biggest cause for concern about protecting your site from external threats are CGI scripts. Most CGI scripts are shell-based, using either PERL or C-shell interpreted programs rather than compiled programs. Thus, many attacks have occurred by exploiting "features" in this system. CGI Security is covered in the CGI section of this book, so this section won't go into too much detail about how to make CGI scripts themselves safe. There are a couple important things you should know, however, as an administrator.

When a CGI script is run, it is running with the user-ID of the server child process. In the default case, this is "nobody." To adequately protect yourself, you may want to consider the "nobody" user an untrustworthy user on your site, making sure that user does not have read permission to files you want to keep private and does not have write permission anywhere sensitive. Certain CGI scripts will demand write access to certain files (for example, for a "guestbook" application). So if you want to enable those types of applications, it's best to specify a directory to which CGI scripts can write without worrying about a malicious or misdirected script overwriting data that it shouldn't.

Furthermore, site administrators can limit the use of CGI to specific directories using the ScriptAlias directive. Alternatively, if one has turned on .cgi as a file extension for CGI scripts, one can use the Options ExecCGI directive in access.conf to further control its use.

An example of this follows. If you want to allow for CGI to be used everywhere on the site (with a document root of /home/htdocs) except for the "users" subdirectory because you don't trust your users with CGI scripts, then your access.conf should look something like the following:

<Directory /home/htdocs/>
Options Indexes FollowSymLinks Includes Multiviews ExecCGI
AllowOverride None
</Directory>

<Directory /home/htdocs/users/>
Options Indexes SymLinksIfOwnerMatch IncludesNOEXEC Multiviews
AllowOverride None
</Directory>
Because ExecCGI isn't in the "Options" list for the second directory, no one can use CGI scripts there.

Unfortunately, there really is no middle ground between allowing CGI scripts and disallowing them. Currently, most languages used for CGI programs do not have security concepts built into them, so applying rules like "don't touch the hard disk" or "don't send the /etc/passwd file in e-mail to an outside user" need to be dealt with in the same manner as if you had an actual UNIX user who needed the same restrictions applied to him or her. Maybe this will change when Sun's Java language gets more use on the server side, or when people use raw interpreted languages less and higher-level programming tools more often.

Server side Includes

As you can see from the previous example, there was another change between the "trusted" part of the server and the "untrusted" part: the Includes argument to "Options" was changed to IncludesNOEXEC. This allows your untrusted users to use server side includes without allowing the "#include" of CGI scripts or the #exec command to be run. The #exec command is particular troublesome in an untrusted environment because it basically gives shell-level access to an htmL author.

Symbolic Links

In an untrusted environment, UNIX symbolic links also are a concern for the Web site administrator. A malicious user could very easily create a symbolic link from a directory where he has write permission to an object or resource, even outside the document root, to which all he needs is read permission. For example, one could create a link to the /etc/passwd file, and then release that onto the Web, exposing your site to potential crack attempts, particularly if your operating system does not use shadow passwords.


There was a recent incident involving the Alta Vista search engine (http://www.altavista.digital.com), in which a search for words common to password files (bin, root, ftp, and so on) turned up references to actual password files that had, intentionally or not, been left public. These including a few with the encrypted passwords, which were easy enough to break with a few hours of CPU time on most workstations.

To protect against this, the site administrator has two options: to only allow symbolic linking if the owner of the link and the owner of the linked-to resource are the same by using SymLinksIfOwnerMatch, or to disallow symbolic links altogether by not specifying FollowSymLinks or SymLinksIfOwnerMatch.

Also note that both <Directory> segments in the previous example included AllowOverride None. That is the most conservative setting; if you want to allow certain things to be tunable in those directories using .htaccess files, then you can specify them with the AllowOverride directive. However, stating "none" is the safest policy.

Publicly-Writeable Spaces

The last security threat that is specific to Web servers is that of allowing publicly-writeable spaces to be served up via HTTP. For example, many sites out there allow their FTP "incoming" directory to be accessed via the Web directly. This can be a security hole if someone were to place a malicious CGI script there or a server side include file which calls "#exec" to do some damage. If you decide you need to take the risk of providing this service, there are some things you can do to protect yourself.

First, the most conservative setting you should set for the "Options" directive is:

Options Indexes
You could use "None," but "Indexes" really doesn't introduce any additional security problems, as long as you're comfortable with others being able to download anything that has been submitted. In the light of recent legislation by the U.S. government regarding "indecent" materials, you may not want to take this risk either.

Second, make sure you set AllowOverride None so that people can't upload an .htaccess file into your directory and modify all your settings and security policies.

Third, make sure that the FTP daemon you are using does not allow the execute bit to be set. By preventing that, you prevent the execution of uploaded CGI scripts. If you are using XBitHack to activate your server side includes, then you can prevent those from being run as well. This is mainly a backup for setting the Options as above, which should protect you against these threats anyways.

These same laws apply if you have CGI scripts that generate their own uniquely addressable htmL or CGI files. For example, if the "guestbook.cgi" program constantly appends the submitted personal information to a "guestbook.html" file, then all the same rules apply; the contents of that htmL file must be considered unsafe. This can be improved if the CGI script double-checks what's getting written and removes "dangerous" code, such as server side includes.


QUE Home Page

For technical support For our books And software contact support@mcp.com

Copyright © 1996, Que Corporation

Table of Contents

05 - Apache Configuration

07 - Creating and Managing an Intranet