Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp.com.

Notice: This material is excerpted from Running A Perfect Web Site with Apache, ISBN: 0-7897-0745-4. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 17 - Database Access and Applications Integration

The previous two chapters really lay the groundwork for one of the most exciting and interesting uses of the Web: using a World Wide Web client as an interface to the wealth of data contained within a structured database. It's in this capacity that the flexibility and usability of the Web really shines through, making the difference between a static, unchanging site and a site that responds dynamically to the user's preferences and environment.

This is also the fastest-growing area of the Web. At first, as people were just getting accustomed to htmL and the Web, most sites featured fairly static htmL pages. Even if well-designed, the ultimate effect was a very easy to use and splashy catalog with the ability to reference other sites. Database integration changes all of that. Databases allow the user to search through reams of data far too large to browse through manually, even in summary form. They allow for different views of the same data, creating and utilizing data relationships on-the-fly. And most importantly, by using "live" data on the Web, the site is always kept up-to-date, accurate, and interesting.

In this chapter, you learn about:

Getting Started

The different uses for database interfaces on the Web is as complex and interesting as the Web itself. Customers visiting a well-wired retail site can easily check the status of their orders, browse through a customized list of inventory items, or request to be notified via e-mail when a product becomes available. A tourist planning a trip to a city may search for available suites at a resort, register online for activities that interest them, or download the menu - and the pages which later show the scenic points of the city are customized to the user's interests. The site is now much more useful than its equivalent paper counterpart, since it reflects the data on the site in a manner that is customized to each particular user.

All of the data just described can be easily stored within a traditional database. Full-text indexing methods do not really apply to these cases, since the queries and updates use structured data. Customers wanting to purchase a product do not want to retrieve documents that describe the marketing techniques for the product, which a full-text index may return.


As a general rule of thumb, if your data is structured and well-ordered, such as a mailing list or a parts list, then you should consider using a database. If your data has little or no structure and includes a significant amount of free-flow text, you might be better off using a text indexing method. If your data has both free-form text and structured fields, such as a collection of documents with author/content headers, then you might want to consider a hybrid approach. Some databases have support for free-form text or binary data built in.

The integration of the database to your Web site is done through the CGI interface. You write programs that take user input, do some extra processing, and return a Web page back to the user that shows the effects of the user's work. However, the difference is the "extra" processing; rather than just execute a program or perform a fairly simple task, you will usually access a database engine. The database engine, or server, takes your input, performs a search or other operation on the data within the database, and returns the results of that operation. The exact details of how this is done vary widely based on the type of database to which you connect.

If your data is not already in an accessible database, then the first step toward making your data available on the Web is to load it into a database of some sort. There are a large number of databases on the market, including commercial, shareware, and freeware solutions. The databases range from simple structured tables with indexes on various fields - such as the cbase library from Citadel - up to fully scaleable, high-performance SQL databases from companies such as Informix, Oracle, Sybase, and so on. Database software for UNIX machines may be as cheap as $100, or may cost hundreds of thousands of dollars. Most of the reliable, good-performance database engines for small UNIX machines cost thousands of dollars, but are usually easier to program and will handle higher user loads more reliably than their freeware or shareware cousins. In general, the more complex your data and the more users it needs to be able to handle at once, the more expensive the commercial product will need to be.


If your application only needs to look up the value of a single field that matches the user's input exactly, you might not need a database. Why not use PERL's associative arrays to rapidly access a record if you know the record's name?

Once the data is stored within an accessible database, the next step is to choose an interface language that can accept input in CGI format, query or modify the database, and then return results in the htmL language. Because PERL seems to be the de facto language for use with Web servers, many database solutions involve custom PERL libraries that can interface with several popular database languages. These PERL dialects are available at ftp://ftp.demon.co.uk (and mirrored at ftp://ftp.cis.ufl.edu/pub/perl/scripts/db/), and can assist with the interface to Informix, InterBase, Oracle, Postgres, SyBase, Unify, Ingres, X.500 directories, and common C-Tree and NDBM formats.

Most commercial database packages may also be accessed through the development tools that are sold to access the database. For example, Informix's embedded SQL-within-C language (ESQL/C) can be used directly to program database interfaces, or can be used in cooperation with a PERL wrapper to handle post-processing. In many instances, using existing database tools is the easiest route for database programmers who are already familiar with the database tools from that vendor.

Control Flow and a Sample Application

Because of the design of the Web, designing a database interface for the Web is different than traditional procedural or event-driven database applications. The program must be designed to work with pages; your database program is called when a form is submitted or a link is chosen, and must generate a new htmL page to be returned to the user.


Turn off Crawler Access - One of the great features of a database-driven Web environment is the great flexibility and custom appearance that can be generated for a user. However, this sometimes drives Web indexing programs, such as Alta Vista and WebCrawler or even internal harvesting applications, crazy as they try and index every single "page" on your server since they cannot detect that each page is actually the output of a program. For the benefit of both the search engines on the Net and your own performance and throughput, use the /robots.txt server control file to limit the scope of the crawlers to your static data and away from your database applications. If you have non-form links that can modify the database, this step is critical to maintaining database integrity.

As an example of a database interface on the Web, we'll walk through a fairly simple "query-add-delete-edit" database application. This typical application can be used as a basis to build up more complex database systems, involving more advanced topics such as relational table joining and data integrity concerns.

This Web database application has several database activities, which are fairly typical of database programs on the Web:

With a handful of fairly simple CGI programs, this database now has a client-server interface that can run on a variety of platforms, ranging from high-end workstations to very inexpensive ASCII terminals.

Programming Hints

Following are a few programming hints that will make your Web applications easier to use and more effective.

Use "Hidden" Input Fields

Hidden fields can be a great benefit to the database designer. Hidden fields are fields that are included in htmL forms and are submitted with the rest of the user's data entry, but the user never sees them and cannot normally change the values (see the caution regarding "Security" later in this section). A programmer may use hidden fields to pass information along to later database programs that is not of interest to the user, such as the record IDs of the row being updated, user format preferences, generation numbers, temporary file names, and so on.

In the above example, the database programmer may have chosen to implement all six programs in a single CGI script, and rely on hidden fields to indicate which of the six functions should be executed on the current input information.


Although the htmL specification requires the user's browser to submit all data in the order that it was received, not all Web browsers do that. However, all browsers that I could find either send the fields correctly (that is, in order) or in reverse order. Therefore, a hidden field that needs to be known before any of the rest of the data can be processed (such as a "which function to perform" tag) should be included at the very top and very bottom of an input form.

Whistle While You Work

Because the CGI program is doing more complex tasks than retrieving a file off a disk, there may be a much longer delay when loading Web pages, particularly if the user enters complex search criteria or the database machine is heavily loaded. If the delay is too long, the user might cancel the download, hit back, and try again, which may frustrate both the user and the rest of the system as query after query is started and then canceled. To avoid this situation and to let the user know that the system is indeed processing his input, output a header immediately that indicates that the database is working and will return results shortly. Even a simple "Working..." message as the first line of the document will help avoid this problem.

However, you must also check to make sure that your Web server will not save up the output of your program until it is finished. Some Web servers do this under some circumstances so that they can inform the user's browser how long the resulting document will be. With such servers, use the "pass through" or "http direct" option for your queries; the "Content-Length:" field of the HTTP header is optional and your program may choose to skip it if the length of the data is not known.

The default type of service for the Apache server is to buffer 8 kilobytes of CGI script output before passing it off to the network layer for transmission. To avoid this buffering, you must make your script an nph script. nph stands for nonparsed headers, and it literally means that the server removes itself as the middleman and the CGI script talks directly to the client. Currently the only way to designate a script to be an nph script, one must name it such that the first four letters of the script's name are nph- (i.e., /www/htdocs/nph-foo.cgi.

Use POST and GET Submission Types Correctly

When implementing an htmL Form, the form designer may choose whether the user's data will be submitted using the "GET" or "POST" methods. According to the htmL 2.0 specification, the "GET" method should be used when submitting the form has no side effects on the database (such as queries, or the start of an update), and the "POST" method should be used when submission will change the state of the database (such as add, update, or delete operations). This is not just a trivial distinction: most browsers will confirm from the user before submitting a form twice using "POST," and will silently submit a form twice with "GET." Obviously, you don't want the user to accidentally "add" the same record twice to the database.


One htmL guideline that is commonly available on the Web incorrectly states that the difference between the two should be the length of the data being submitted, and to use "POST" when dealing with possible long sets of data. This is not how the htmL 2.0 designers intended the two methods to be used nor how browsers are implemented, but it does reflect an important practical consideration: many Web servers cannot handle "Get" methods when dealing with longer data sets. Check your Web documentation or experiment if unsure, but try to use Get and Post in the spirit that they were intended.

At the time of this writing, the America Online browser cannot handle POST forms at all. Therefore, database programmers who wish to provide functionality to AOL users cannot use POST in their applications. Hopefully, this will be fixed on AOL's end before too much longer, and Webmasters may always use the appropriate method.

Always Maintain Security

Implementing secure CGI programs is important since the programs can possibly alter data on the machine itself. However, with database programming, the problem is magnified greatly since the programs work directly with data that can be fairly easily corrupted if the programmer is not careful. An important thing to remember is that the user can alter any of the data on a form, even in ways not usually possible using a normal Web browser. For example, a radio button may be returned to the CGI script as a value that is not legal and never appeared in any of the choices to the user. The values of "hidden" fields may be changed by a malicious user at any time. Fields that have a length limitation that the browser normally enforces may be submitted with longer strings (this one is particularly painful if using a language like C and directly writing user values into arrays of fixed length). As a database programmer who is concerned about the integrity of the data, you cannot assume that any of the data that comes from the user's browser will fit the constraints of the calling page.

As an example of such a potential pitfall, consider a Web interface where a user may update his or her address information online. The application starts off requesting the user's name and address password, verifies it, and returns a form with the user's ID number in a hidden "recorded" field and the data entry fields populated with the user's existing address information. The user modifies the information, submits the form, and the database program updates the record number in the "recordid" field to reflect the changes that the user made.

The problem here is that the user may change the ID number manually to any other ID number in the system, and the update program will then update the wrong user's record. A better solution would be to store the ID number in a special format and then encrypt the ID number using a key known only to the server. If the ID number is modified by a potential data vandal, the format will not match up after decryption and the database server may reject the request.

User Authentication

The Web is rapidly advancing in its ability to have servers automatically recognize and verify which user is accessing a CGI program, but such automatic solutions to this problem are still too sparsely implemented to be able to be used in a generic environment. However, database programmers may still make use of password fields or htmL's almost-official browser authentication to verify users at the beginning of the application, and have the browser keep track of authentication credentials as the user moves through the application.

User Preferences

If your site already has users who are logging in or registering themselves, then one of the best things that your site can do is track their interests and preferences to customize the site for that user. This makes your site come "alive" and become more interesting and useful to all of the users, since they get information that is customized and appropriate for their interests. It also makes your site more memorable and more likely to be revisited and passed around through word-of-mouth, which is am important consideration for sites that are seeking popularity.

For example, a real-estate server might remember that a user is interested in a certain price range, location, and style, as well as the last date that they visited the site. When the user returns to the site, the home page might include a listing of new properties added since his last visit, each one in the style and price range that interests the user. When touring a new property online, the descriptions might change slightly to emphasize the features that interest the user.

Browsers as Database Front Ends

Creating a Web interface to a database may do more than just give users outside your company access to your information - it may also provide an easy, cross-platform solution for in-house use. Once a good, robust gateway is written between the Web and your existing database server, your users can immediately have cross-platform access and update ability to that database by using their Web browser as a database client. Almost every hardware platform and operating system has a World Wide Web browser available, which eliminates the need for your MIS staff to worry about creating a client for Macintosh, Windows, and UNIX. In addition, if you use large numbers of ASCII terminals, then your database is still accessible through plain-text Web clients, such as Lynx and Panda.

The largest shortcoming to using this approach is that you are restricted to the level of functionality that all of your Web browsers support. All data and forms are displayed inside the browser rather than as a separate application. If your application needs its own menu bar rather than links within the page, or floating palettes, spreadsheets, graphical editing, or any of a number of custom controls, it will be difficult to implement using the Web and standard htmL.

Since the formal htmL 2.0 specification does not offer database programmers enough flexibility and control to be able to implement "real" clients for databases, many people decide to go with extensions or extra features that are offered on particular Web browsers. These features either completely or partially close the gap between what the database programmer wants the database interface to do, and what the browser can actually handle.

One good example of such an extension is Java and JavaScript. Created by Sun Microsystems, Java allows you to program in one language that works across many platforms, allowing you to "break" the htmL rules when a needed functionality is not supported by htmL. For example, you might use a Java "applet" to allow the user to draw a line on a graphical image of a map, connecting two cities in the U.S. that he wants to visit. This provides an easier method of collecting this data than asking the user to type in the two cities' names. Other browser extensions include the ability to handle validation callbacks after filling in an entry field for data validation, ability to search graphical images, use multiple editing windows, and so on. However, users who are not using a browser that supports such extensions either will not be able to use the database interface, or will need to use some sort of a "fallback" work-around.

There are new products on the market that create a hybrid between a platform-specific client interface and a Web interface. One such product is InterAp from Stac Electronics (http://www.stac.com/Homepages/Software/intcont.html), which allows you to create advanced applications that can interoperate with the Web and other Microsoft Windows applications. Similar applications are available from other vendors, and Microsoft has announced increased Internet functionality to many of its products and services.

Advanced Topics and a Look Ahead

There are many issues dealing with the peculiar workings of the Web that cause some sticky situations for database programmers. Most of these issues apply only when using the Web to update databases, rather than merely searching them online, and most of the issues are traditional database issues cast into a different light because of the Web interface.

Use of the "Back" Button

Because the user may click on the "Back" button at any time, implementing user flow paths becomes a rather complex issue. The database programmer must keep in mind that at any screen, the user can choose the "Back" icon and go back to a prior page, abandoning the current flow path without the server's knowledge. Unless steps are taken, this may lead to double-updating of records, or the re-writing of older data on top of newer data. The user might also be able to go "back" after deleting a record and attempt illegal operations, such as then attempting to update that record. The database programs should be able to handle such odd situations gracefully.

The database designer should also be able to handle the user submitting the same request multiple times, and intelligently ignore subsequent requests if appropriate. For example, if the user add a record to a database, clicks on "Back," and then re-submits the same form to add a record, the system should tell the user that the item has already been added into the system and his current request has been ignored.

Concurrency Control

When designing an interface to a database that includes the ability to update or change the data within the system, there are peculiarities that are particular to the Web which the database programmer needs to keep in mind. In general, most of the issues revolve around how Web browsers retrieve information: when the user clicks on a link or submits a Form, the browser opens up a connection, downloads the new page from the Web server, and then closes the connection. From the server's point of view, the server does not know what the client does after the page is downloaded; if the user clicks on "Back," or even quits his browser application, the server is not informed of this fact.

This behavior makes implementing concurrency control difficult, since the server cannot afford to lock a record while the user is editing it, for fear that the user is not really editing the record at all, but has gone off to browse someplace else or has quit the application. Also, dealing with cursors of "things to do" can be problematic as the items in the current working set are added and deleted by other users, and coordinating multiple users working on a common working set requires a lot of programming acrobatics, and usually results in some training issues to the end users.

One partial solution to this problem is to use the "client pull" features available on some Web browsers. The browser presents the page to the user for his actions, and allows the user to modify that page for a reasonable period of time. During that period, the database considers that data "checked out" by the user doing the modifications, and will not permit others to modify it or put it on their "to do" lists. After a certain period of time, the server will automatically "check in" the object, and the version that the user has checked out is now "stale." However, the browser will also load, automatically, a new version of the page, which should indicate to the user that his update period has expired and he needs to re-load this page in order to check the object back out. There is still a period during which the user can check out the page and then quit the application, but the user is at least notified when their lock becomes stale. Browsers which do not support client pull will behave normally, except that they will not receive notification of their outdated lock status.

Heavy Server State Management

Some database applications require the database process to build up a rather complex "state" within the server, such as a cursor of currently "active" rows from a query. This state can be expensive to rebuild on every page fetch from the server, which slows down the responsiveness of the site and adds greatly to the server load. Even the basic overhead of re-establishing a connection with the database server every single time a page is fetched may cause a noticeable impact on performance.

One fairly straightforward solution is to have the page "teach" the browser the information the server will need to re-use by using "hidden" buttons within a Form. Remember that the user does not see the value or contents of hidden buttons, but they are returned by the browser at the time the form is submitted. For example, a list of "customers to visit today" may be submitted on every page submission, allowing the user to "walk" down the list without re-querying the database to rebuild the list after every update. However, some browsers are limited in the number of items that can be on a form or do not work correctly for very long lists, so this technique might not be usable when the lists can get long. Also remember that not all browsers return items on a form in the same order, in spite of the htmL 2.0 specification.

Another solution is to use temporary files located on the server to store the server's state. This has the advantage of being able to store lists of almost any size, and allows the user to keep the list updated regardless of moving back and forth using the "Back" button. The server would send a hidden field (or a parameter to the CGI script) that indicates which temporary file on the server to use. However, since there is no way for the server to know when the user is finished with the list, the server must delete the list on its own after some period of time. And because the user may bookmark a page and re-visit it later, the server should be able to handle a page submission after it has already deleted the user's temporary list, usually by rebuilding the state within the list.

One fairly complicated solution is use a fairly complex server to "reconnect" the database server process with the client's browser across multiple page retrievals. That is, the database process continues running even after it has completed the request for an "expensive" page. If the user finishes that page within a given period of time, that process can continue onward with the state is has already built up, without the overhead of rebuilding it. There are commercial applications that will perform the "reconnect," and are usually phased-in after a database site becomes so heavily used that performance issues are starting to effect user satisfaction with the interface.

Custom HTTP Servers

If your application is expected to get very heavy usage (millions of hits a week, or a large amount of processing on a smaller number of hits), one way to improve performance is to bypass the standard Web server. Rather than using programs that are called from the standard Web server (such as CGI programs called from the Apache server), a custom program handles all HTTP requests and transactions directly, speaking and understanding the raw HTTP protocol. The custom server must handle many of the nit-picky details of content negotiation itself that are usually handled by the HTTP server, but the overall effect of going this route might be better performance and easier handling of machine resources. However, it requires a significant understanding of Internet programming and the HTTP 1.0 specification.

For example, the standard Apache server can be run on the normal HTTP port on a machine to handle most of the general requests and for pages that require some of the Apache server's features. Once the user accesses pages that are the performance-critical sections of the application or accesses sections that need additional functionality not easily added to a standard server, the browser connects to a second server on the same machine, on port 8080. The second server is a simple, single-minded process dedicated to maintaining the database interface. Because it does not have all of the features of the full-blown Apache server, it is significantly smaller (in one common case, only 67K) and executes significantly faster than running the same processes off Apache using CGI scripts.

Because writing your own HTTP server is a fairly complicated and difficult task, an interesting option made available recently is to use the extensibility of the Apache server, as of release 1.0.0. This release of the server permits programmers to add their own modules at link time to provide additional or alternate functionality to the standard Apache server. Rather than writing an entirely new custom server to handle your database applications, modules may be added to the Apache server to permit it to communicate directly with the database and to maintain connections while awaiting new requests. At the same time, modules that are not directly needed in the performance-critical points of your overall Web application may be left out of the custom server, resulting in a smaller, quicker custom executable, without having to re-write all of HTTP.

Licensing Issues

Most commercial UNIX databases are sold with a certain number of "user licenses," which, by copyright law, must still be obeyed by the Web database programmer. Policies for what counts as a "single user" vary from vendor to vendor; some vendors count by the number of UNIX IDs that are in use, which means that the single "Web" user can take care of hundreds of transactions at once. Some vendors count it as the number of people who access the program, which would make the database program unusable in a general access Internet environment, but might work for an in-house local Intranet application. Finally, other vendors count it as the number of simultaneous server processes running, in which case care must be taken not to exceed this number during "peak" or busy periods on the Web server. When setting up your server, it is important that you find out very clearly from your vendor how your database server handles such legal and technical issues.

Commercial products are available that will control the number of simultaneous connections, and allow users to "wait in line" for the next available open license. This not only allows you to keep under your licensed user limit, but also allows the server hardware to be shared with other applications so that the Web usage cannot overwhelm the machine.

Commercial Products

The field for commercial plug-ins to the Web that will make database integration easier and more effective is exploding. All major database companies have announced upcoming products which will make it easier to access databases over the Web. The best place to find up-to-date information is on those vendor's home pages on the Web or in the comp.databases.* newsgroups.

Some third-party packages that may help you with your development are:

This market is exploding, but if you have extensive technical experience programming with your database package in your environment, rolling your own solution using your existing database tools is frequently just as easy as some of the more complex packages, and allows you to customize the system to meet your particular needs.

Connecting databases live to the Web is one of the most interesting and useful applications of Web technology, and frequently lifts up a popular and well-visited site from the other static and unchanging sites that might be in the same field.


QUE Home Page

For technical support For our books And software contact support@mcp.com

Copyright © 1996, Que Corporation


Table of Contents

16 - Usage Statistics and Maintaining htmL

18 - Financial Transactions