Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp.com.

Notice: This material is excerpted from Running A Perfect Web Site with Apache, ISBN: 0-7897-0745-4. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 13 - CGI Scripts and Server APIs

By now, you know enough to get your server up and running. You can create and publish htmL documents and display images. You can sit and watch people from around the world log onto your server but you still seem to have this nagging feeling that there's something missing. Maybe your server should be doing...something. You know, something like searching documents or even displaying one of those cute little access counters that tell how many people have linked to your page. Your server just seems a little reactive and not interactive.

It's time to talk about creating dynamic documents on your server. One of the earliest means of creating documents on-the-fly is through the use of the Common Gateway Interface (CGI). Using CGI scripts, your server can interact with third-party applications that can execute document searches, query databases, or return a dynamically-created htmL page. You can give users that customized and professional feel to your server.

There are other means of creating interactive services on your Web servers. Server Side Includes are customized capabilities offered by Web server applications that allow you to produce dynamic documents but without the resource overhead of CGI script processing. In addition, some servers offer the ability to directly extend the capability of the server program itself using a technology known as Application Programming Interfaces (APIs).

This chapter covers the following topics:

  • An introduction to the CGI standard and CGI scripting
  • Use of Server Side Includes
  • Use of Web server APIs

Introduction to CGI

Normally, Web servers respond to requests from Web browsers in the form of htmL documents and images. The browser sends a URL to the server and the server sends the file, whether it's an htmL document, gif or JPEG graphic, sound file, or movie, to the browser via an HTTP connection. Sometimes, the browser sends a URL that does not point to a document but instead points to an application. The server activates this application which then responds to the browser with the requisite information. This application is a CGI script. This section covers how this script interacts with the Web server and browser.

One important feature of htmL 2.0 is the capability for Web designers to use the language to create interactive forms. These forms collect data entered by the user; the Web browser processes this data and sends it via an HTTP request to a Web server. Usually, the Web server will receive requests for htmL documents or graphic images. However, the htmL form implies that a specific action is requested of the server. With this type of request, the server knows to ignore the content of the form data and redirect the information to a CGI script specified in the htmL form page.

The CGI script is actually a third-party application developed in a language such as C, C++, PERL, Visual Basic, or really any language supported by the operating system in which the server is running. However, some languages lend themselves to CGI scripting more than others and we'll discuss those later in this chapter.

How the CGI Works

The process through which the Common Gateway Interface works is quite simple.

  1. The browser accumulates data from an htmL form and prepares it for transmission to the server.
  2. The server reads the URL enclosed in the browser request and activates the application.
  3. The server relays the information from the htmL form to the CGI application.
  4. The CGI script processes the form data and prepares a response. This processing can include a database query, a numerical calculation, or an imagemap request. The response is usually in the form of an htmL document. However, the response is cleverly phrased by the CGI application to convince the Web browser that it originated from the server.
  5. The CGI application passes the response to the server which immediately redirects it to the Web client. The server does not usually affect the output of the CGI application.

This process is outlined in figure 13.1. Note that the server merely passes information to the script. The script receives the data from the Web server through some mechanism unique to the language in which the script was developed. As long as this mechanism is in place, any programming language can be used to implement a CGI script.

Fig. 13.1 - The CGI script works with the Web server to respond to certain Web browser requests.

Client-Server HTTP Header Formats

Web browsers communicate with Web servers via the HTTP protocol. Not only does this protocol specify the physical packet structure of the protocol, but it also defines the manner in which the server and browser exchange information. For example, a Netscape Navigator client might send the following text to a Web server for a simple file request:

GET /article1.html HTTP/1.0
Accept: text/html
Accept: image/gif
Accept: image/jpeg
User-Agent: Mozilla/2.0b5 (Windows; I; 32bit)
 ...a blank line...

This message header informs the server that the browser is looking for the file article1.html and intends to use version 1.0 of the HTTP specification. The browser then informs the server as to which file formats it can interpret. In the above message, this list is truncated from what browsers usually express, but the server is informed that the client can interpret several text and graphics MIME types. The browser then informs the server as to its brand of client; in this example, the browser is defined as Netscape Navigator. Finally, the browser passes a blank line to complete the request.

The server will respond with a message generally like the following:

HTTP/1.0 200 OK
Date: Thursday, 01-Feb-96 19:15:32 GMT
Server: Apache 1.0.3 
MIME-version: 1.0
Last-modified: Friday, 15-Dec-95 17:54:01 GMT
Content-type: text/html
Content-length: 7562
 ...a blank line...
<htmL><HEAD><TITLE>Article....

In this response, the server provides enough information to allow the browser to process the requested data. The server denotes that it too is providing data using the HTTP v.1.0 protocol. Furthermore, it returns an HTTP code of 200 OK which tells the browser to relax and the requested file was not only found but is being returned in this message. The date and server type are described in the header. The server type is included as the browser may interpret certain features not described in other servers. The server tells the Web client which version of MIME encoding is being used so that the browser can reprocess the data. The browser is also informed as to the MIME type of the data and the size of the file; this last datum is important as it allows the browser to inform the user as to the progress of the data transfer. Finally, a blank line (actually, two carriage return/line feed pairs, i.e. CRLFCRLF) separate the HTTP headers from the body of text.

The server needs to be flexible enough to provide the file in a format that is accessible to the client. For example, the server would need to provide a gif file if a browser, which could only process gif files, requests a file that is offered in JPEG.


For information about HTTP specifications see http://www.ics.uci.edu/pub/ietf/http/.

As mentioned previously, the HTTP server doesn't usually process output from a CGI application; the response is merely funneled through the server back at the browser. The message, however, must be configured so as conform to the HTTP message header specifications. We will discuss later in this chapter ways that you can program your CGI script to insert an HTTP header at the beginning of your response to ensure correct processing by a Web browser.

htmL Forms and CGI

By using an htmL form page, you can allow users to enter data that is processed by a CGI script. As discussed earlier in this book, users can enter text and specify options using forms developed with htmL. The types of data input options are as follows:

  • Multiline text entry fields
  • Pop-up selection menus
  • Radio buttons
  • Check boxes

Figure 13.2 shows an example of an htmL form that can be used to transfer data to a CGI application. Note that this sample page contains text, check boxes, radio buttons. The htmL code for this page is shown in listing 13.1.

Listing 13.1 Transfer Data to a CGI Application with This Form

<htmL>
<HEAD>
<TITLE>
Forms Test
</TITLE>
</HEAD>
<BODY>
<FORM ACTION="http://hoohoo.ncsa.uiuc.edu/cgi-bin/post-query" 
   METHOD=POST>
A normal text field:
<TEXTAREA NAME="comments1"></TEXTAREA><p>
<HR>
<DL>Please indicate your favorite holiday:
<DD>
<INPUT TYPE="radio" NAME="holiday" VALUE="Christmas">Christmas
<DD>
<INPUT TYPE="radio" NAME="holiday" VALUE="Thanksgiving">Thanksgiving
<DD>
<INPUT TYPE="radio" NAME="holiday" VALUE="Easter">Easter
<DD>
<INPUT TYPE="radio" NAME="holiday" VALUE="NYDay">New Year's Day
</DL>
<DL>Please put a check next to the applications you own:
<DD>
<INPUT TYPE="checkbox" NAME="msword" VALUE="No" chECKED>Microsoft Word
<DD>
<INPUT TYPE="checkbox" NAME="photoshop" VALUE="No">Adobe Photoshop
<DD>
<INPUT TYPE="checkbox" NAME="netscape" VALUE="No">Netscape
<DD>
<INPUT TYPE="checkbox" NAME="excel" VALUE="No">Microsoft Excel
</DL>
<INPUT TYPE="submit" VALUE="Submit This Form">
</FORM>
</BODY>
</htmL>

Fig. 13.2 - You can use several types of htmL forms to retrieve information from Web users.

Note that all of the form elements in the above code use the NAME attribute. The idea is that the user enters text in a field or checks a radio button; this data is assigned a variable corresponding to the value of the NAME attribute. The CGI script uses these data by referencing the corresponding variable name. For example, the response from a post-query script to the above example is shown in figure 13.3.

Fig. 13.3 - A post-query script is useful for displaying the values of an htmL form.


A post-query script is a generic term for any script that merely echoes back the results of an htmL form submission. In the nominal NCSA httpd software distribution, a simple CGI script entitled post-query reflects the values of the entered text data. Post-query scripts are one of the simplest implementations of CGI scripting and are useful for debugging htmL form pages.

Two alternative methods of transferring form data to a CGI script are POST and GET. These are the possible values of the METHOD attribute in the opening <FORM> tag. You're limited to passing no more than 24KB of data back to the server using GET. POST, however, allows transfer of much more data. This results from the fact that a request made through the GET method concatenates all the htmL form variables into a single string; this string is appended to the URL in the HTTP message that identifies the CGI script. Requests made through the POST method combine all the form parameters into an internal variable that is passed to the script.

The CGI Environment

In order to get the CGI application to run on any operating system, there needs to be some mechanism to convey the form data from the HTTP server to the CGI application. With UNIX, this is done through the use of environment variables, standard input and output. With Web servers running under the MacOS, AppleEvents are used to convey data to and from the CGI script and Web server. With Windows 3.1, Windows 95, and Windows NT, CGI variables are exchanged using a Windows private profile file in key-value format.

CGI Variables

The variables described in this section are passed from the browser to the server; they pertain to information about the browser. Your CGI application can use these variables to display information about the server, the user, the user's browser, or the user's connection to the server. The CGI environment variable is included in parentheses where applicable.

  • Server Software (SERVER_SOFTWARE)
  • The name and software version of the Web server answering the request and launching the CGI application.

    Example: Apache 1.0.3

  • Server Name (SERVER_NAME)
  • The server's host name or IP address.

    Example: www.mcp.com

  • Server Port (SERVER_PORT)
  • The port number that received the request.

    Example: 80

  • Server Admin (SERVER_ADMIN)
  • The administrative contact for the web server, as obtained from the web server config files. Useful for giving a feedback email address in case of unrecoverable problems.

  • CGI Version (GATEWAY_INTERFACE)
  • The version of the CGI standard to which the server replies.

    Example: CGI/1.1

  • Request Protocol (SERVER_PROtocOL)
  • The name and version of the protocol used by the client for this request.

    Example: HTTP/1.0

  • Request Method (REQUEST_METHOD)
  • The HTTP method specified in the request.

    Examples: GET, HEAD, POST


    Most of the headers in an HTTP request are made available to the CGI environment by taking the name of the header, turning it into all-capitals, and prepending HTTP_ to the front of it (i.e., if the browser sends Foo: Bar, the CGI script will see HTTP_FOO as a variable, and Bar as the value of that variable. This is the case with some of the variables below.

  • Referrer (HTTP_REFERRER)
  • The URL of the document from which the CGI script was referred, if the browser sends it.

    Example: http://www.anywhere.com/cgi-test.html

  • From (HTTP_FROM)
  • The e-mail address of the Web browser user.


    The From variable is not used by every browser because of privacy concerns although it is included in the HTTP specification.

  • User Agent (HTTP_USER_AGENT)
  • This variable contains the description of the browser software. This is useful, although not used by all browsers, for using CGI specific to various browsers.

    Example: Mozilla/2.0b6 (Windows; I; 32bit)

  • Logical Path (PATH_INFO)
  • This is the part of the URL after the / right after the name of the CGI script, but before any ? - for example http://host/script.cgi/path?foo would set the PATH_INFO to be path, if the script was really script.cgi.

  • Physical Path (PATH_TRANSLATED)
  • If a logical path is specified in the client message, the server can try and map that as a path onto the document tree of the running Web server. In the above example, the server would see if a request to /path would have mapped to an actual resource on the server, and returned the full pathname to that (i.e., /web/htdocs/path/).

  • Script URI (SCRIPT_NAME)
  • The name of the CGI script specified by the request.

    Example: http://host/cgi-bin/foo?argument would result in a SCRIPT_NAME of /cgi-bin/foo.

  • Script Filename (SCRIPT_FILENAME)
  • The actual filename of the CGI program on the file system.

    Example: http://host/cgi-bin/foo?argument could result in a SCRIPT_NAME of /usr/local/etc/httpd/cgi-bin/foo.

  • Query String (QUERY_STRING)
  • The encoded version of the query data. This data follows the ? in the URL and is usually the result of a query from an htmL form.

    Example: http://host/phonebook.cgi?Joe%20Smith+5551321 would result in Joe%20Smith+5551321.

  • Remote Host (REMOTE_HOST)
  • The IP host name of the Web browser making the request, if available.

    Example: s115.slipper.net

  • Remote Address (REMOTE_ADDRESS)
  • The IP address of the Web browser making the request.

    Example: 167.142.100.115

  • Authentication Method (AUTH_TYPE)
  • The protocol-specific method of authentication used to validate the user if the document is protected and the server supports authentication. This corresponds with the AuthType directive in Apache.

  • Authenticated User Name (REMOTE_USER)
  • The name of the authenticated user if the document is protected and the server supports authentication.

  • Content Type (CONTENT_TYPE)
  • The MIME type/subtype of the htmL form data contained in a PUT or POST request.

    Example: text/plain

  • Content Length (CONTENT_LENGTH)
  • The number of bytes of data contained in a PUT or POST request. This allows the browser to display the progress of a lengthy transmission to the user.

    Example: 42

  • Accept (HTTP_ACCEPT)
  • The list of MIME types accepted by the client. You can pass parameters for some of the MIME type/subtype combinations.

  • Example: text/plain, text/html, image/gif

Apache Extensions to the CGI environment

There are a few environment variables beyond those specified in the CGI 1.1 specification which Apache supports.

DOCUMENT_ROOT

As one would guess, this is the document root for the server, as specified in the server configuration files. This is useful for, for example, allowing PERL scripts to find a common definitions file without having to hardcode the path location in each PERL script. You might have a lib or data directory off of your document_root where you store extra PERL libraries or dynamic data, so to reference it in a PERL script you'd say, for example,

require "$ENV{'DOCUMENT_ROOT'}/lib/common.pl";

This way you can move your document tree around without having to worry about having hardcoded paths - everything would be based around the DocumentRoot as specified in the server configuration files.

REDIRECT_...

Apache supports custom error responses. It is usually useful, if that custom error response is a CGI script, to be able to get some information about the original request which caused the error. So, Apache takes each CGI variable from the old request and prefixes it with REDIRECT_ into the new environment. So for example:

QUERY_STRING -> REDIRECT_QUERY_STRING
PATH_INFO -> REDIRECT_PATH_INFO

Then, whatever appropriate new values for the old variables are defined.

There are two more special environment variables defined in this instance: REDIRECT_URL and REDIRECT_STATUS. REDIRECT_URL is simply the URL of the custom error response, while REDIRECT_STATUS is the error code which triggered this response. So for example, you might have

ErrorDocument 500 /error-handler.cgi

In this case, the REDIRECT_URL become /error-handler.cgi and the REDIRECT_STATUS is 500.

HTTP_COOKIE

If you are running with the mod_cookie module, you will see another environment variable you wouldn't normally see, HTTP_COOKIE. This is the legendary Netscape cookie functionality, where the server gives the web browser a token (cookie) when they first talk to each other, and then the browser sends it with every request. This token is unique and random, but usually guaranteed to be persistent for at least a user's "session", so it is possible to map this token to a "user". Usually it'll look something like this:

HTTP_COOKIE = s=myhost20434482411973732

What the key actually is isn't important - it's basically just a random number that mod_cookie makes up when it needs one. The important thing is that it can be used as a key in a database, or logged for tracking purposes. For example, if you see a CGI script hit 10 times, 3 times with one cookie, 3 with another, and 4 with another, you can be pretty sure you only had 3 people using that script, instead of 10 people once or 1 person ten times.

Notes on certain CGI variables

There are three variables that may or may not be set depending on other factors in the server configuration. Those are:

REMOTE_HOST

This is the DNS-resolved hostname matching the IP number of the client making the request from the server. If the server was compiled with -DMINIMAL_DNS, or if the directive LookupHostname is set to Off, that variable will be set to the same value as REMOTE_ADDR, which is just the IP number. If your CGI program requires the hostname, and you have DNS resolution turned off for performance reasons, you can get the hostname by performing a gethostbyaddr call in C, or PERL, or whatever its equivalent is in other languages. Also, not every IP address is set to respond to reverse-DNS lookups, so even if the server is normally resolving every IP number, you might not be able to get a hostname for that number.

REMOTE_USER

This variable is only set if the script was placed under password-based authentication - this is the username that was used to get access. A very common bug report is "my CGI scripts aren't getting REMOTE_USER set!" - when in actuality what happened was that the CGI script was in a different directory (say, /cgi-bin/) from the other password-protected pages, and cgi-bin wasn't protected in the same way as those other pages. Since browsers cache passwords, and also since there's very rarely any user-interface level distinction between viewing a protected page and viewing an unprotected page, it's easy to understand why this may seem confusing. But don't worry, that's why you own this book.

REMOTE_IDENT

This is the string returned by a lookup to the client's machine using the ident protocol, as defined in RFC831. This will only return something if the IdentityCheck directive is set to On in the server config files, and of course if the remote site is actually running an ident daemon.

Setting Extra Variables

If you compiled in support for the module mod_env, there are two directives available to you for adding further information into the CGI environment.

PassEnv

PassEnv will let you pass through any environment variable from the shell environment from which the server is launched. For example

PassEnv USER

will pass along the contents of the USER variable from the shell of the user to the CGI environment.

SetEnv

This will let you explicitly set a particular variable in the environment. For example

SetEnv LIBDIR /www/lib

might be the best way to pass on to your CGI scripts where their libraries are, just as

SetEnv DEBUG 3

might be the best way to set the debugging level for your scripts.

Server APIs

Besides the CGI and SSI specifications, there is one other major means by which you can add functionality your server. Both Apache and Netscape support the notion of an API to the server - this is so code can be written to a published interface, compiled, and linked into the server such that the server and extra code become one actual program. This can give a tremendous boost to performance, and it can also allow the content creator to control certain deep aspects of the server that CGI does not allow.

Both the Apache API and the Netscape API are written in C, so modules to those servers must be written in C as well.

The Apache API provides a very generalized interface to the functionality of the server. Almost all the functionality beyond the core of the server is implemented through this API - all user authentication functions, the CGI interface, all access control functions, all "URL-munging" functions like Alias and Redirect. Apache modules can be written to implement "handlers" for certain data types - this is how functionality such as internal imagemap handling and Server Side Includes are implemented. Even logfile functionality is implemented as a module. This modular approach makes it very easy to "drop in" new functionality on top of or in place of older functionality. It allows server owners to tune their servers for optimum performance - for example, if you don't want to use the internal imagemap functionality, you can compile Apache without that module, and save yourself a couple dozen kilobytes per running child process, which may be significant at higher levels.

Examples of functionality that people have done using an API which normal CGI could not provide, either at all or in an efficient manner:

  • Totally configurable logging into any syntax (mod_log_config)
  • Authentication to an MSQL database (mod_auth_msql). This can be used as a prototype for other database interfaces as well.
  • Automatic cookie maintenance for session tracking (mod_cookie)
  • Radical new Server-Side Include functionality, beyond existing NCSA-style SSI syntax.

  • The specifications for the Apache API can be found at http://www.apache.org/docs/API.html, or on the CD-ROM included with this book.

    The NSAPI specification is available at http://www.netscape.com/newsref/std/server_api.html

    Using HTTP Cookies

    Cookies (as briefly mentioned above, under the "HTTP_COOKIE" CGI environment variable description) are a new HTTP mechanism proposed by Netscape Communications, but as of this writing, cookies are also supported by the Microsoft Internet Explorer and a couple other browsers as well. Cookies are designed to communicate state information to the browser from the server. This is in contrast to standard the HTTP process where server information outside of the HTTP response is not communicated to the browser.

    When a browser first visits a "cookie-enabled" Web site, the site can send back a Set-Cookie: header in the HTTP response. On subsequent visits to that site, or to particular other sites within that domain, the browser sends one or more Cookie: headers in the request, and the server can change the cookies sent by sending yet more Set-Cookie headers back to the client. Possible applications include client preferences, such as user accounts and personal information, for online shopping services.

    The syntax for a cookie header is as follows:

    Set-Cookie: name=Value; expires=Date;
    path=Path; domain=Domain_Name; secure

    The cookie name is the only required attribute and identifies the cookie. You can set an expiration date with the expires tag; after that date, the cookie becomes invalid. The domain keyword is used by the server to validate the cookie; while searching the cookie list for valid entries, the domain keyword is matched against the domain of the requesting host. This enables the server to match the cookie from many other browsers making similar requests. Similarly, the path keyword is used to validate the cookie request. The cookie is transferred if the path defined by the requesting browser matches the cookie path attribute. The secure keyword alerts the server to transfer the cookie only if the connection is made using the Secure Sockets Layer protocol.


    The Cookie specifications are available at http://www.netscape.com/newsref/std/cookie_spec.html. As of this writing, the implementation of the cookie mechanism has not been finalized; consult the specifications before attempting to utilize cookies. It looks like cookies will make it into the HTTP specifications, but with guidelines on the user interface issues.


    QUE Home Page

    For technical support For our books And software contact support@mcp.com

    Copyright © 1996, Que Corporation


    Table of Contents

    12 - htmL Forms

    14 - More Scripting Options