Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp.com.
Notice: This material is excerpted from Running A Perfect Web Site with Apache, ISBN: 0-7897-0745-4. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.
The previous chapters covered the basics of setting up a Web server, writing htmL, and creating forms and scripts. The last chapters in this book use the tools acquired in the first part of the book to build or explain several useful applications that can be built with Web servers.
This chapter starts off with an introduction to search engines, a mechanism to search through simple databases, and more sophisticated indexing and retrieving software that search through an entire Web server and present the resulting list in a hypertext document in accordance with the tradition of the Web. Specifically, you look at two popular freeware Web search engines, namely, ICE and Glimpse.
This is the age of information sharing. Data is of little use if it cannot be shared and shared alike. The second part of this chapter presents techniques for using Web technology as a workgroup tool for information sharing. Several vehicles for conducting workgroup discussion are presented, including list servers, newsgroups, and Web conferencing systems. This chapter also discusses methods for creating annotation capability.
In this chapter, you learn the following:
In the last chapter, you saw how to write a simple script to search an online phone book for names and numbers. Although this can be considered a simple database application, it differs from what is normally thought of as a database because users can view but not enter information. Creating Web database applications that can modify, add, and delete information from databases is covered in the Chapter 17, "Database Access and Applications Integration." This chapter is more concerned with search and retrieve applications that are used as guide maps around the vast Web.
Even though an organization centrally maintains many types of data, that data often still needs to be made available to hundreds or even thousands of users, either internally or externally. Examples of this type of data include a company phone and address book, a product catalog that maps product numbers to titles, or a list of regional sales offices and contacts. All of these types of information can be stored in a relational database, but there's really no need for anything more than a simple text file. If the goal is to quickly and easily make information available. A simple Web search routine can achieve the desired result without all the headaches of maintenance that is associated with a relational database. However, there are performance limitations related to the size of such a simple database, so if your applications will be used for data sets with more than a couple hundred records at most, you may want to consider a more industrial-strength solution.
In the previous chapter, the phone book example demonstrated how to search a text file containing names and phone numbers. At the heart of the search is the grep command, which simply looks for pattern matches in a file. One of the benefits of this approach is that the text file need not be in any certain format. The grep command just reads each line of the file for a match; it doesn't care how many columns there are or what characters are used to separate fields. Consequently, the phone book script from the previous chapter can be used to search any text file database. That script has been generalized from the phone book example and is reprinted here for convenience. Figure 15.1 shows the resulting search form.
You can make searches case-sensitive by removing the -i option
from the grep command.

Listing 15.1 Simple PERL Search Script
# search.pl
# Invoke the perl compiler
#!/bin/perl
# Define the location of the database
$DATABASE="/usr/local/etc/httpd/cgi-bin/phone.txt";
# Define the path to cgiparse
$CGIPATH="/usr/local/etc/httpd/cgi-bin";
# Convert form data to variables
eval `$CGIPATH/test/cgiparse -form -prefix $`;
# Determine the age of the database
$mod_date=int(-M $DATABASE);
#Display the age of the database and generate the search form
print <<EOM;
Content-type: text/html
<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
The database was updated $mod_date days ago.<p>
<FORM ACTION="/cgi-bin/search.pl" METHOD="POST">
Search for: <INPUT TYPE="TEXT" NAME="QUERY">
<INPUT TYPE="SUBMIT" VALUE="SEARch">
</FORM>
<p><hr><p>
EOM
# Do the search only if a query was entered
if (length($query)>0) {
print <<EOM;
Search for <B>$query</B> yields these entries:
<PRE>
EOM
#Inform user if search is unsuccessful
$answer = `grep -i $query $DATABASE`;
if (!$answer) { print "Search was unsuccessful\n" ;}
else { print $answer\n" ; }
print <<EOM;
</PRE>
</BODY>
EOM
}
Fig. 15.1 - This generalized database search form is used with the preceding search script to search any text file database.
To use the script for data other than the phone book, simply change the name and location of the text file containing the desired information. Because the script uses the generic grep command, it can be used with almost any text file for any purpose.
If the egrep (or grep -e on some systems) command
is available on your system, use it instead of grep because it's
faster.

To take advantage of the simple search routine above, you must have some text file data to start with. If your data is currently in another format, such as a proprietary database, you must first convert it to an ASCII text file. You can easily create the necessary text file by exporting the data from the native format to ASCII text. Almost all databases include the capability to export to text files.
For easiest use of the search script, export data so that there is
exactly one record per line. This produces the neatest output from the
script.

After the text file has been created, you simply need to specify its path in the search script.
With a few simple modifications, you can generically use the script to search one of many databases that all have different paths. This can be done most efficiently in one of two ways. You can allow the database to be chosen by selecting one of several hyperlinks, in which case extra path information in the URL can be used to specify the database. On the other hand, you can allow the user to choose which database to search in a fill-in form.
Suppose you want users to be able to choose between several different divisional phone books. One way to do this is to include a pre-search page on which the user selects the database by clicking the appropriate hyperlink. Each link calls the same database search script, but each link includes extra path information containing the path to the database. The following htmL code demonstrates how the hyperlinks are constructed:
<H2>Company Phonebooks</H2> <A HREF="/cgi-bin/search.pl/db/IAphone.txt">Iowa Locations</A> <A HREF="/cgi-bin/search.pl/db/CAphone.txt">California Locations</A> <A HREF="/cgi-bin/search.pl/db/KSphone.txt">Kansas Locations</A>
The name of the search script in this example is /cgi-bin/search.pl and the databases are named /db/IAphone.txt, and so on. The search script itself needs to be modified to use the extra path information.
First, the name of the database to search is now specified in the extra path information rather than hard-coded into the script. Therefore, the line at the top of the script that specifies the path to the data needs to read the extra path information. This is done by reading the PATH_INFO environment variable. In PERL, the syntax for this is
$DATABASE=$ENV{"PATH_INFO"};
Second, the ACTION attribute of the form, which is generated inside the script, needs to specify the path to the database, as well. This way, after the user performs the initial query, the correct database is still in use. This is done by changing the <FORM ACTION...> line to the following:
<FORM ACTION="/cgi-bin/search.pl$DATABASE">
No slash (/) is necessary to separate the script name (/cgi-bin/search)
from the extra path information because $DATABASE already begins
with a slash.

These are the two modifications necessary to implement choosing a database via hyperlinks. The hyperlinks to other databases are now included in the search form also. The resulting form is shown in figure 15.2. The complete modified script code is included below. Only new or changed lines have been commented.
Listing 15.2 Modified PERL Search Script
# search2.pl
#!/bin/perl
# Get database name from extra path info.
$DATABASE=$ENV{"PATH_INFO"};
$CGIPATH="/usr/local/etc/httpd/cgi-bin";
eval `$CGIPATH/test/cgiparse -form -prefix $`;
$mod_date=int(-M $DATABASE);
# Show the current database and list other available databases.
# The <FORM ACTION ...> line now includes the database name
# as extra path info.
print <<EOM;
Content-type: text/html
<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
Current database is $DATABASE.
It was updated $mod_date days ago.<P>
You can change to one of the following databases at any time:<P>
<A HREF="/cgi-bin/search/db/IAphone.txt">Iowa Location</A><BR>
<A HREF="/cgi-bin/search/db/CAphone.txt">California Locations</A><BR>
<A HREF="/cgi-bin/search/db/KSphone.txt">Kansas Locations</A><P>
<FORM ACTION="/cgi-bin/search2.pl$DATABASE" METHOD="POST">
Search for: <INPUT TYPE="TEXT" NAME="QUERY">
<INPUT TYPE="SUBMIT" VALUE=" Search ">
</FORM>
<p><hr><p>
EOM
if (length($query)>0) {
print <<EOM;
Search for <B>$query</B> yields these entries:
<PRE>
EOM
$answer = `grep -i $query $DATABASE`;
if (!$answer) { print "Search was unsuccessful\n" ;}
else { print $answer\n" ; }
print <<EOM;
</PRE>
</BODY>
EOM
}
Fig. 15.2 - Now the interface allows the user to use hyperlinks to select a new search database.
Depending on the application, it may be more convenient for users to choose their database via a form rather than via hyperlinks. The initial form uses option buttons to choose the desired database, and after that the chosen database is active for all searches. Figure 15.3 shows the initial form used to select the database.
Fig. 15.3 - In this form, you select the search database and then proceed to the search form.
Listing 15.3 shows the form's code.
Listing 15.3 Using Pull-Down Menus Instead of Hyperlinks
<TITLE>Database Search</TITLE> <BODY> <H1>Database Search</H1> Choose your database from the list below:<P> <FORM ACTION="/cgi-bin/search3.pl" METHOD="POST"> <INPUT TYPE="RADIO" NAME="DATABASE" VALUE="/db/IAphone.txt" chECKED> Iowa Locations<BR> <INPUT TYPE="RADIO" NAME="DATABASE" VALUE="/db/CAphone.txt"> California Locations<BR> <INPUT TYPE="RADIO" NAME="DATABASE" VALUE="/db/KSphone.txt"> Kansas Locations<P> <INPUT TYPE="SUBMIT" VALUE=" Submit "> </FORM> <p><hr><p>
The initial selection form passes the path of the chosen database in the input field named "DATABASE", so only two modifications are necessary to the original search script that receives this information. First, the path to the database is now read from the initial selection form, so a separate line defining $DATABASE is no longer necessary. Second, the search form must have a way to keep track of the current database. This is conveniently accomplished by including a hidden input field in the search form named "DATABASE". This way, whether the search form is called from itself or from the initial selection form, it always knows the path to the correct database. The code for the search script is in listing 15.4. Only the new or changed lines are commented. The resulting search form appears in figure 15.4.
Listing 15.4 Search Script Code
# search3.pl
#!/bin/perl
$CGIPATH="/usr/local/etc/httpd/cgi-bin";
eval `$CGIPATH/test/cgiparse -form -prefix $`;
# $DATABASE is now defined as a form variable
$mod_date=int(-M $DATABASE);
# A hidden field <INPUT TYPE="HIDDEN"
NAME="DATABASE" ...> stores the database path.
print <<EOM;
Content-type: text/html
<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
The current database is $DATABASE.
The database was updated $mod_date days ago.<p>
<FORM ACTION="/cgi-bin/search3.pl" METHOD="POST">
<INPUT TYPE="HIDDEN" NAME="DATABASE" VALUE="$DATABASE">
Search for: <INPUT TYPE="TEXT" NAME="QUERY">
<INPUT TYPE="SUBMIT" VALUE=" Search ">
</FORM>
<p><hr><p>
EOM
if (length($query)>0) {
print <<EOM;
Search for <B>$query</B> yields these entries:
<PRE>
EOM
$answer = `grep -i $query $DATABASE`;
if (!$answer) { print "Search was unsuccessful\n" ;}
else { print $answer\n" ; }
print <<EOM;
</PRE>
</BODY>
EOM
}
Fig. 15.4 - Once the search database is selected in a separate form, this form is used to perform the search.
The previous examples searched only one file at a time. However, grep is flexible enough to search multiple files and directories simultaneously.
In the previous example, the user was allowed to choose between several different phone directories. However, it's also possible to search several files at the same time. The script is easily modified to do this because the grep command can search multiple files simultaneously. Instead of specifying one file in the $DATABASE environment variable, specify a path to the directory containing the phone text files (/db). So the line beginning $DATABASE= in the original script (search.pl) changes to the following:
$DATABASE="/db/*.txt";
The grep command now searches all files in the /db directory that correspond to the specified wildcard pattern for the desired information.
Taking it a step further, the grep command can also accept multiple files in different directories. For example, you can specify the following database files:
$DATABASE="/db/phone*.txt /db2/address*.txt"
Now, the grep command searches all TXT files in the /db directory beginning with phone and all TXT files in the /db2 directory beginning with address.
By combining the grep command with a directory command, it's even possible to recursively search subdirectories. To do this, change the $DATABASE line to the following:
$DATABASE=`find /db -name '*.txt'`
Because the find command operates recursively on directories, $DATABASE contains the names of all TXT files under the /db directory and its subdirectories.
Although most Web browsers today have forms capability, not all do. To allow these browsers to search for information, it's common to offer an alphabetical or numerical index of data as an alternative to entering a form-based query. Typically, you create a hyperlink for each letter of the alphabet and specify an URL for each hyperlink that performs the appropriate search. For example, in a phone book listing where last names are listed first, you could search for capital Cs at the beginning of a line to get a listing of all last names beginning with C. To create a hypertext index that can submit this type of search automatically, use the following code:
<H1>Phone Book Index</H1> Click on a letter to see last names beginning with that letter.<P> <A HREF="/cgi-bin/search?%26A">A</A> <A HREF="/cgi-bin/search?%26B">b</C> ... <A HREF="/cgi-bin/search?%26Z">Z</Z>
The queries in this example begin with the caret (%26 =
^) to force grep to look for the specified character at the beginning
of a line.

So far, you have only looked at searching collections of simple text files. However, one of the most useful utilities on any Web server is the capability to search for words anywhere on the server, including plain text and htmL files. It's theoretically possible to simply grep all htmL and TXT files under the document root (and other aliased directories), but this can be very time-consuming if more than a handful of documents are present.
The solution to the problem of searching a large Web server is similar to that used by other types of databases. We maintain a compact index that summarizes the information present in the Web server's content area. As data is added to the database, you just keep updating the index file. The usual method of maintaining the integrity of the index file is to run a nightly (or more frequent) indexing program that generates a full-text index of the entire server in a more compact format than the data itself.
A popular indexing and searching solution on the Web is ICE, written in PERL by Christian Neuss in Germany. It's freely available on the Internet at http://www.igd.fhg.de/~neuss/me.html and is included on the WebmasterCD. In the discussion that follows, you learn how ICE works and how it can be modified to include even more features. By default, ICE includes the following features:
Because it's written in PERL, ICE can run on any platform for which
PERL is available, including UNIX, Windows NT, Mac, DOS, and OS/2. However,
some modifications are necessary because of the differences in file systems
employed by these operating systems.

ICE presents results in a convenient hypertext format. Results are displayed using both document titles (as specified by htmL <TITLE> tags) and physical file names. Search results are scored, or weighted, based on the number of occurrences of the search word or words inside documents.
The heart of ICE is a PERL program that reads every file on the Web server and constructs a full-text index. The index builder, ice-idx.pl in the default distribution, has a simple method of operation. The server administrator specifies the locations and extensions (TXT, htmL, and so on) of files to be indexed. When you run ice-idx.pl, it reads every file in the specified directories and stores the index information in one large index file (by default, index.idx). The words in each file are alphabetized and counted for use in scoring the search results when a search is made. The format of the index file is simple:
@ffilename @ttitle word1 count1 word2 count2 word3 count3 ... @ffilename @ttitle word1 count1 ...
The index builder is typically run nightly or at some other regular interval so that search results are always based on updated information. Normally, ICE indexes the entire contents of directories specified by the administrator, but it can be modified to index only new or modified files, as determined by the last modification dates on files. This saves a little time, although ICE zips right along as it is. On a fast UNIX workstation, ICE indexes 2 - 5M of files in under 15 seconds, depending on the nature of the files. Assuming an average htmL file size of 10K, that's 200 - 500 separate documents.
UNIX users can run the index builder nightly using the UNIX cron facility for scheduling regular events. To use the crontab command, a system administrator must add your name to the cron.allow file. The following is a typical cron entry for ICE that runs the index-builder nightly at 9:34 PM (21:34):
34 21 * * * /usr/local/etc/index/ice-idx.pl
Windows NT users can use the native at command to schedule the indexing utility.
It's often a good idea to schedule cron jobs at odd times
because many other jobs run on the hour by necessity or convention. Running
jobs on the hour that don't have to run at this time increases the load
on the machine unnecessarily.

Searching an index file is much faster than searching an entire Web server using grep or a similar utility; however, there is a definite space/performance tradeoff. Because ICE stores the contents of every document in the index file, the index file can theoretically grow as large as the sum of all the files indexed! The actual compression ratio is closer to 2:1 for htmL because ICE ignores htmL formatting tags, numbers, and special characters. In addition, typical documents use many words multiple times, but ICE stores them only once, along with a word count.
When planning your Web server, be sure to include enough space for
index files if you plan to offer full-featured searching.

The htmL code that produces the ICE search form is actually generated from within a script (ice-form.pl) but calls the main search engine (ice.pl) to do most of the search work. The search simply reads the index file previously generated by the index builder. As the search engine reads consecutively through the file, it simply outputs the names and titles of all documents containing the search word or words. The search form itself and the search engine can be modified to produce output in any format desired by editing the PERL code.
The ICE search engine is powerful and useful by itself. However, there's always room for improvement. This section discusses several modifications you can make to ICE to implement various additional useful features.
A very useful feature of ICE is the ability to specify an optional directory context in the search form. This way, you can use the same ICE code to conduct both local and global searches. For example, suppose you're running an internal server that contains several policy manuals and you want each of them to be searchable individually, as well as together. You can simply require that users of the system enter the optional directory context themselves; however, a more convenient way is to replace the optional directory context box with option buttons that users can use to select the desired manual.
A more programming-intensive method is to provide a link to the search page on the index page of each manual. The URL in the link can already include the optional directory context so that users don't have to enter this themselves. This way, when a user clicks the link to the search page from within a given manual section, the search form automatically includes the correct directory context. For example, you can tell the ICE search to look only in the /benefits directory by including the following hyperlink on the Benefits page:
<A HREF="/cgi-bin/ice-form.pl?context=%2Fbenefits>
Search this manual</A>
The slash of /benefits must be encoded in its ASCII representation
(%2F) for the link to work properly.

For this to work, you need to make the following necessary modifications to ice-form.pl:
<INPUT TYPE="TEXT" NAME="CONTEXT" VALUE="$CONTEXT">)
If the size of your index file grows larger than two or three megabytes, searches will take several seconds to complete due to the time required to read through the entire index file during each search. A simple way to improve this situation is to build several smaller index files, say, one for each major directory on your server, rather than one large one. However, this means you can no longer conduct a single, global search of your server.
A more attractive way to break up the large index file is to split it up into several smaller ones, where each small index file still contains an index for every file searched, but only those words beginning with certain letters. For example, ice-a.idx contains all words beginning with "a," ice-b.idx contains all words beginning with "b," and so on. This way, when a query is entered, the search engine is able to narrow down the search immediately based on the first letter of the query.
In the event that your server outgrows the first-letter indexing scheme,
the same technique can be used to further break up files by using unique
combinations of the first two letters of a query, and so on.

To break up the large index file alphabetically, you need to modify the ICE index builder (ice-idx.pl) to write to multiple index files while building the code. The search engine (ice.pl) also needs to be modified to auto-select the index file based on the first letter of the query.
Although ICE allows the use of AND and OR operators to modify searches, it only looks for words meeting these requirements anywhere in the same document. It would be nice to be able to specify how close to each other the words must appear, as well. The difficulty with this kind of a search is that the ICE index doesn't specify how close to each other words are in a document. There are two ways to overcome this.
First, you can modify the index builder to store word position information, as well as word count. For example, if the words "bad" and "dog" each occur three times in a file, their index entries might look like the following:
bad 3 26 42 66 dog 3 4 9 27
In this case, 3 is the number of occurrences, and the remaining numbers indicate that dog is the 4th, 9th, and 27th word in the file. When a search for bad dog begins, the search engine first checks if both bad and dog are in any documents and then whether any of the word positions for bad are exactly one less than any of those for dog. In this case, that is true, as bad occurs in position 26 and dog occurs in position 27.
There's another way to search for words near each other. After a search begins and files containing both words are found, those files can simply be read by the search program word-by-word, looking for the target words near each other. Using this method, the index builder itself doesn't have to be modified. However, the first method usually results in faster searches because the extra work is done primarily by the index builder rather than by the search engine in real-time.
For larger databases of text (say, more than a couple dozen megabytes of text) a more scalable and robust solution than ICE can be found in a publicly available package from the University of Arizona called Glimpse.
Glimpse is available from http://glimpse.cs.arizona.edu/.

The program is written in C and is thus not as instantly portable as the PERL-coded ICE, but the developers have ported it to a large number of different UNIX platforms. Glimpse is comprised of two main tools: the indexer, called glimpseindex and the search command called glimpse. You first build the index by pointing glimpseindex at a particular directory with a couple of command-line calls describing where to store the index files. A typical command line to the indexer looks something like the following:
glimpseindex -H "searchdir" "documentroot"
The documentroot is the root tree of your Web files. The indexer builds a number of files that begin with the prefix .glimpse_ - these files are stored in the searchdir directory, which you might not want to store in the main Web directory.
To run a query against the index, you use glimpse with some command-line options, as the following shows:
glimpse -H "searchdir" "query"
For example,
glimpse -H /www/searchdir radio
finds all files that contain the word "radio." Glimpse supports some amazing options, such as the ability to search for words with a given number of spelling errors (that is, one spelling error allows "ratio," two allows "ration"), you can combine logical AND and OR clauses, and you can even limit the scope of the results to file names that match particular patterns.
One of the most significant advantages of glimpse, though, is that its algorithms allow a variable size for the index, with the caveat that smaller indexes result in slower searches. So if you have a fast machine but relatively little spare disk space, you can set the index so that it is smaller. The allowed ranges are tiny (index is 2 - 3% the size of the indexed data), small (7 - 9%), or medium (20 - 30%). This is configured using different command-line options with glimpseindex.
Integrating this functionality into your server is done using another package available from the Glimpse folks called GlimpseHTTP. The most significant part of this package is a CGI script called aglimpse. Put the aglimpse file in the right place (perhaps your cgi-bin directory?) and edit the configuration options at the top of the script. These options set things such as the home of your .glimpse_ files, the document root for your Web server, and so on. With this you should now have an efficient, full-featured search engine for your Web server.
At some point, you may find that ICE, Glimpse, and other search engines that build indexes of file-system contents break down, either because you start heavily using server-side includes or because you start having a large amount of content that only CGI scripts access. At this point, instead of gathering data from the file system for the index, you really want to gather data from the Web server itself. This robot-based searching is the basis for several commercial Internet search engines out there, such as AltaVista, Infoseek, Lycos, and WebCrawler. You can also get this functionality for your own server using Harvest.
Harvest project's home page is http://harvest.cs.colorado.edu/

Harvest is a reasonably complex system, and while indexing content over the network is one of its primary goals, it is far from a drop-in replacement for Glimpse. One of the bigger goals for the Harvest project is to create an infrastructure for "distributing" indexes, using a network of Gatherers and Brokers. The Gatherers slurp content from Web sites, creates indexes of the content it finds, and hands those off to the Brokers, who then publicize the existence of those indexes and handle queries. It is worthwhile to note that as of this writing, Netscape's new Catalog Server interfaces with this infrastructure, providing information about the content it indexes in the Summary Object Interchange Format (SOIF) format, just as used by Harvest.
Harvest is mentioned here as the "future" of server indexing, and if you desire functionality beyond that provided by ICE or Glimpse, you should really invest the time and effort to explore the Harvest project.
While the Net seems vastly endless in its repertoire of solutions to choose from, it becomes more and more incumbent upon you to thoroughly study the feature sets of the various search systems while deciding on one that best suits your Web site with respect to operating system, Web server, volume and value of content, security, and so on. The following list should serve as a basic checklist of things to consider before deciding on any one of the solutions:
The following table shows a list of available commercial, shareware, and freeware search systems that may be used on a Web site. It is important to note that this list is, by no means, exhaustive.
| Product | Company | More Information |
|---|---|---|
| Excite | Architext Software | http://www.excite.com |
| Livelink Search | OpenText Corp. | http://www.opentext.com |
| Verity | Verity, Inc. | http://www.verity.com |
| CompasSearch | CompasWare Development, Inc. | http://www.compasware.com |
| NetAnswer | Dataware Technologies, Inc. | http://www.dataware.com |
| Fulcrum Search Server | Fulcrum Technologies, Inc. | http://www.fultech.com |
A very desirable enhancement to a search system is to include some sort of summary of each document presented in the search results. Infoseek, Lycos, and AltaVista do exactly this by displaying the first couple of sentences of each document on its search results page. This helps users quickly find the documents most relevant to their topic of interest.
To include summary content, store the first 50 - 100 words in every document in the index file created by the index builder. Doing this, however, requires yet more storage space for the index file, and therefore may not be desirable. You could also have the script that displays results itself open the files returned in the search and grab the first 50 - 100 words from the file, but that imposes something of a burden on the file system if the script is heavily used. Once again, you have a tradeoff between efficiency and disk space, which is in many ways a good thing.
The World Wide Web was originally developed as a medium for scientific and technical exchange. One of the important elements of that exchange is the sharing of ideas about other people's work. This has been common on UseNet news for many years now, but articles are limited largely to plain ASCII text. The Web, with its superior hypertext presentation, presents opportunities for richer exchange but has developed as a remarkably one-sided communications medium thus far. This is unfortunate for those who want to take advantage of the Web's superior document capabilities along with the flexibility and interactivity of UseNet.
The Web has developed primarily as a one-way medium simply because the great majority of Web servers and clients have not supported any kind of interactive behavior; Web servers can only serve documents, and Web clients can only browse documents. However, these limitations are not fundamental to either the HTTP protocol or htmL. The ingredients necessary for worldwide annotation of Web documents and posting new documents to servers are already in place, but these have not yet been implemented. There are, however, a few exceptions, that are discussed in the following sections.
The most notable exception is NCSA Mosaic, which supported a feature called group annotations in the first few versions. This feature allows users to post text-only annotations to documents by sending annotations to a group annotation server, which NCSA provided with earlier versions of their Web server. Group annotations, however, have been abandoned in later versions of Mosaic in favor of the HTTP 1.0 protocol, which supports group annotations in a different manner.
The second exception is CGI scripting, which allows data to be received rather than sent by a server. The data is usually simple text, such as a query or form information, but it can also be an entire document, such as an htmL file, spreadsheet, or even an executable program. The new file upload capabilities for forms as supported by Netscape 2.0 and other browsers is a great step in the right direction.
Because HTTP and htmL already support most (if not all) of the ingredients necessary for a more interactive Web, it's probably only a matter of time before these features are incorporated into browsers and servers alike. In the meantime, however, prototypes of what the future holds have been constructed using news, e-mail, and CGI scripts.
UseNet news makes available today in plain ASCII text some of what the Web will do tomorrow in htmL. In fact, the NNTP protocol is in many ways superior to the HTTP protocol for the purpose of disseminating small messages very widely, for it pushes messages to other news servers instead of waiting for them to be pulled on demand like HTTP.
News can effectively be used as both a private or public tool for information exchange. Public newsgroups are the most familiar, with worldwide distribution and the ability for anyone to post articles to these groups. By running your own news server, you can also create entirely private newsgroups or semi-private groups, which the public can read but not post to. The ability to control who can read news and who can post to a local server makes news a useful tool for workgroup discussion.
Many Web browsers can both read and post news. This simplifies the
use of both news and hypertext in an organizational context by providing
a common interface for viewing both kinds of documents.

While news is an excellent medium for conducting entirely private (inside a corporate network) or entirely public conversations (UseNet), it's not as well suited for allowing discussions between a select group of individuals located all over the world. It's possible to create a special news server for this purpose and use password security to ensure that only the intended group of people can read or post news to the server. However, users of the system are inconvenienced because most newsreaders expect to connect to one news server only. If users were already connecting to another news server to receive public news, they have to change the configuration information in their newsreader to connect to the special server. Fortunately, there are other answers to this problem.
E-mail is a more flexible method of having semi-private discussions among people all around the world. Using a mailing list server (list server), it is possible to create a single Internet e-mail address for a whole group of people. When an item is sent to the mailing list address, it's forwarded to all members of the list. This approach has several advantages over running a news server, in addition to the previously mentioned convenience issue.
A popular UNIX-based list server is majordomo, available from http://www.greatcircle.com/majordomo/.

First, e-mail is the most widely accessible of all Internet services. Individuals are more likely to have e-mail access than any other Internet service. Secondly, e-mail is something that users typically check regularly for new messages. Consequently, there is less effort involved in receiving "news" or discussion items from a mailing list than in checking for news in a separate newsreader. The same applies to posting news, which tends to encourage use of the system.
Through various e-mail gateways, it's possible to do almost anything
by e-mail that can be done on FTP, Gopher, news, or the Web, only slower.

A very nice complement to a mailing list is a mailing list archive, which stores past items on the mailing list. Public mailing list archives are frequently found on FTP sites, but they can also be stored on the Web. A really powerful tool called Hypermail converts a mailing list archive into a hypertext list of messages, neatly organized to show message threads. Mail archives converted with Hypermail can be sorted by author, subject, or date.
Because Hypermail uses the standard UNIX mailbox format, it can even
be used to convert your personal mail on a UNIX workstation to hypertext.
This allows you to read mail in hypertext, a capability not yet supported
by most e-mail readers. The ability to read and write htmL in an e-mail
reader is a useful and interesting addition to e-mail and the Web.

More information on Hypermail can be obtained from http://www.eit.com/software/hypermail/hypermail.html. It is reasonably easy to compile and install on any UNIX workstation.
While e-mail and news are both valuable tools for workgroup discussion, they still lack an important feature: the ability to make comments on a document in the document itself. In the paper world, this is accomplished with the infamous red pen. However, the equivalent of the editor's pen in the world of hypertext markup is just beginning to appear. The ultimate in annotation is the ability to attach comments, or even files of any type, anywhere inside an htmL document. For now, however, it's at least possible to add comments to the end of an htmL page. Several people are working on annotation systems using existing Web technology. The following sections take a brief look at a few of them.
Not to be confused with Hypermail, HyperNews does not actually use the NNTP protocol, but it allows a similar discussion format and is patterned after UseNet. You can see examples of HyperNews and find out more about it at http://union.ncsa.uiuc.edu/HyperNews/get/hypernews.html.
A similar system originating at CERN allows new "proposals," or comments, to be submitted in response to a given document. This is a practical way for a group of engineers, for example, to discuss a document. Some degree of security is possible by requiring users to have a valid user name and password before they can post comments. This can be combined with user authorization procedures to control who can see documents, as well. More information is available from http://www.w3.org/pub/WWW/WIT/.
The glaring deficiency of the Web, namely, that it has been a one-way drive, has not gone unnoticed, however. There are quite a few systems available that employ the traditional client-server architecture to implement Web conferencing systems.
One commercially available Web conferencing product is WebNotes for Windows NT, a product of OS TEchnologies Corporation. WebNotes is a client-server solution where the "client" is any htmL capable Web browser (Mosaic, Netscape, and so on). The WebNotes server software maintains discussion threads of topics of discussion, remembers "already-seen" messages by users and allows users to post discussion material either as text or as htmL documents with inline graphics. It also employs a text search engine that facilitates retrieving discussions based on the result of a search query.
More information and a live demonstration of WebNotes can be found
on OS TEchnologies' home page at http://www.ostech.com.

Other Web conferencing systems that can be found on the Net include, but are not limited to:
Some of these systems also let users upload files to the server, thereby allowing them to upload picture binaries to inline their message content with graphics.
Many of the annotation-like systems on the Web today are academic in nature. At Cornell, a test case involving a computer science class allows students to share thoughts and questions about the class via the Web. Documentation on the Cornell system is available at the following Web site:
http://dri.cornell.edu/pub/davis/annotation.html
The Cornell site also has useful links to related work on the Web. Some of the related systems that have been developed use custom clients to talk to an annotation database separate from the Web server itself, much like the early versions of Mosaic. This architecture may well be the future of annotations and the Web.
On the lighter side, take a peek at MIT's Discuss->WWW Gateway to get a behind-the-scenes look into an American hall of higher education. For a particularly novel and entertaining use of the Web, take a peek at the Professor's Quote Board at the following site:
http://www.mit.edu:8008/bloom-picayune.mit.edu/pqb/
In this chapter, you learned how to write a simple query system to search a textual database and to publish the result on the Web. You also took a look at how to use existing search engines to implement search functionality on a Web server and search through the entire Web server's htmL content. The later section of the chapter gave you an overview of the future applications of the World Wide Web and to look at the Web not as a one-way street but as a totally interactive solution over the Internet. In the future, the Web browser will take on the role of the universal client and act as a front-end to access all kinds of servers like Web, news, e-mail, FTP, bulletin board, database, and other client-server applications - even operating systems.
For technical support For our books And software contact support@mcp.com
Copyright © 1996, Que Corporation