Chapter 16 - Usage Statistics and Maintaining htmLSetting up your server is literally just the start of your work. Administering, maintaining, and monitoring usage are important tasks that will keep you busy after your server is operational. The rapid growth of Internet usage and the growing interest in analyzing that usage means that there is a lot of work you have to do. Among your major tasks will be setting up the mechanism for monitoring usage statistics, monitoring and presenting those statistics, and checking that the htmL code is correct. Once your Web server is up and running, there's still a lot of work to do to keep it running smoothly and to maintain a responsive and professional looking site. Ongoing maintenance activities are associated both with the server itself and with the data files on the server. Your customers - those people who put up the content of the WWW site and expect it to provide some return on investment - need to know if other people are coming to the site and what they do when they are there. Fortunately, with some effort, you can tell them. A tremendous amount of information about the client systems and activities of the WWW site is captured and available for analysis. You are able to find out information such as how often your server is being accessed, what files are being accessed most, what client is accessing your server and how often - almost everything but how much money the user has to spend. You can convert the data into graphical summaries of any or all this information quite easily using programs designed to gather server usage statistics. When the volume of information on your server becomes large, checking all the information to ensure that all hyperlinks operate correctly and that all intended files have been linked to your server becomes more and more difficult. As the number of related documents grows, the only practical way to do this is to use automated programs that check your documents for you. In this chapter, you learn:
Understanding Usage LogsWhen your Web server is running, every document or file request is logged as a separate entry in the server's log file. By default, this file is named logs/access.log under the server root directory, as defined in the HTTPD configuration file. Errors are logged separately in logs/error.log. The access and error logs are very similar, but are discussed separately for clarity.
The Log FormatsAll major Web servers produce logs in a format known as CLF, for Common Logfile Format. Quite a few utilities are available, both freeware and commercial, to analyze these logs on most major platforms. The format includes just the basic information about the request (see below). Notably missing from this format are the type of browser used, the "referring" URL and any cookies used. Diagnostic information about errors is also not included, that's recorded in a separate error log.
The Access LogMost server programs either have a default directory for storing log files or allow you to configure the server program to set where the log files should be kept. Usually, the files "access_log" and "error_log" are kept in a subdirectory under the main server directory called "logs". You may have configured the server to log elsewhere, however. Information in the access log includes (in this order):
Note that if the value for any of this data is not available, a "-" will be put in its place. For example, if the requests are to a space not protected by a password, the field for the login name is replaced with the dash. The extra information can be recorded to a log file using one of the specialized logging modules for Apache, as described in chapter 7. Since there aren't standard tools for analyzing that extra information, it probably doesn't need to be mentioned here, but we can point to it. The following is an excerpt from an access log generated Apache, on the Apache Web site. Listing 16.1{em]Sample Logfile from the www.apache.org Web Site
sf-110.sfo.com - - [01/Mar/1996:01:42:47 -0800] "GET / HTTP/
1.0" 200 3920
dyn79.ppp.pacific.net.sg - - [01/Mar/1996:01:42:49 -0800] "GET /
docs/ HTTP/1.0" 200 1770
as1s12.erols.com - - [01/Mar/1996:01:42:52 -0800] "GET /images/
apache_pb.gif HTTP/1.0" 200 2326
dyn79.ppp.pacific.net.sg - - [01/Mar/1996:01:42:57 -0800] "GET /
images/apache_sub.gif HTTP/1.0" 200 6083
port46.fishnet.net - - [01/Mar/1996:01:42:59 -0800] "GET /images/
apache_pb.gif HTTP/1.0" 200 2326
narfi.ifi.uio.no - - [01/Mar/1996:01:43:01 -0800] "GET /docs/
directives.html HTTP/1.0" 200 3907
sf-110.sfo.com - - [01/Mar/1996:01:43:02 -0800] "GET /dist/ HTTP/
1.0" 200 1833
sf-110.sfo.com - - [01/Mar/1996:01:43:03 -0800] "GET /icons/
blank.gif HTTP/1.0" 304 -
sf-110.sfo.com - - [01/Mar/1996:01:43:03 -0800] "GET /icons/
back.gif HTTP/1.0" 304 -
sf-110.sfo.com - - [01/Mar/1996:01:43:03 -0800] "GET /icons/
text.gif HTTP/1.0" 304 -
sf-110.sfo.com - - [01/Mar/1996:01:43:03 -0800] "GET /icons/
dir.gif HTTP/1.0" 304 -
sf-110.sfo.com - - [01/Mar/1996:01:43:03 -0800] "GET /icons/
tar.gif HTTP/1.0" 304 -
ts20-04.tor.inforamp.net - - [01/Mar/1996:01:43:04 -0800] "GET /
images/apache_pb.gif HTTP/1.0" 200 2326
dyn79.ppp.pacific.net.sg - - [01/Mar/1996:01:43:07 -0800] "GET /
images/apache_home.gif HTTP/1.0" 200 1465
sf-110.sfo.com - - [01/Mar/1996:01:43:17 -0800] "GET /docs/ HTTP/
1.0" 200 1770
sf-110.sfo.com - - [01/Mar/1996:01:43:24 -0800] "GET /docs/
compat_notes.html HTTP/1.0" 200 3593
narfi.ifi.uio.no - - [01/Mar/1996:01:43:30 -0800] "GET /
docs/ HTTP/1.0" 200 1770
sf-110.sfo.com - - [01/Mar/1996:01:43:32 -0800] "GET /
docs/install.html HTTP/1.0" 304 -
194.158.228.97 - - [01/Mar/1996:01:43:43 -0800] "GET /
images/apache_pb.gif HTTP/1.0" 200 2326
narfi.ifi.uio.no - - [01/Mar/1996:01:44:03 -0800] "GET /
HTTP/1.0" 200 3920 -
From these pieces, it is possible to put together a wide variety of statistics on your server usage, including:
Because every document access is recorded, log files can grow very quickly. This is compounded by the fact that inline image files are processed as separate requests; for example, a request for a document with three inline gifs actually shows up as four separate requests - one for the document and three for the gifs. Even on a lightly loaded server, the access log can grow to several megabytes each month. More heavily loaded Web servers can generate several megabytes per day. (Here's a clue: the average logfile entry is probably around 80 bytes - multiply that by your hit rate and you'll see how much space is being consumed). If you want to save historical log data, it is a good idea to move the current logfile to an archive and compress it. Compression can typically cut a logfile down to 10 percent of its original size, at least. You might want to do this automatically at the beginning of each month, week, or even at midnight every night. The Error LogThe error log is the place where the Web server will record any problems it encounters. Some of these problems are rather noteworthy - such as access that failed because the object requested was missing, or there was a system error. Others are merely warning messages which may or may not be indications of serious problems. In any event, it is worth consulting on a regular basis to make sure things are working right and your customers see a professionally maintained Web site. The format of the logfiles is pretty simple: a timestamp and the warning message. For example:
[Sat Apr 6 00:58:54 1996]
send lost connection to client 202.229.54.36
This is an example of a simple warning - for some reason the server lost the connection. It's impossible to tell whether it was the server or client which is at fault; often times this will appear on older browsers when they stop loading the page, for example. However, it is useful debugging information if you are having problems. If you see a very large number of these, comparable in number to your access log itself, there might be problems with your Internet access provider. Other log entries that describe this condition are:
[Sat Apr 6 01:00:41 1996]
read timed out for async02.acm.org
[Sat Apr 6 10:54:44 1996]
request lost connection to client
remote209.compusmart.ab.ca
These will typically be the majority of the entries, if your site is well-maintained. There are other, more ominous warning messages, however. For example, [Sat Apr 6 09:19:49 1996]
access to /export/pub/apache/robots.txt
failed for homer-bbn.infoseek.com,
reason: File does not exist
In this case, someone was trying to access a file which did not exist. This may indicate that your pages have a broken link somewhere; if you are using the configurable logging module and are recording the "Referrer" field from the request, you can find it immediately; if you aren't, you can try and see where this user went before trying to get the broken link. However, a warning like this does not always indicate a problem with your site. For one thing, some browsers out there have bugs which occasionally force them to request the wrong object, usually because their rules for resolving relative URL links into full URLs for the request can be broken. Furthermore, you may have rearranged the content on your site into a new hierarchy without providing redirection, so browsers or proxy caches with older copies of your pages ask for the old objects. With Web crawling search engines like Infoseek or WebCrawler, these old references may be requested several months after the change. The request above in particular, however, is for a file named robots.txt. This file is the commonly accepted way of informing "robots" of policies for your site regarding what content on your site may be indexed, if any. If you have user authentication turned on anywhere, you'll also see occasional warnings about "user not found" or "password mismatch" if they enter the wrong user/password combination. For example:
[Sat Apr 6 10:22:08 1996] access to /www/private/ failed for
ppp66.isp.net, reason: DBM user george not found
If your CGI script has problems, any messages sent to the "STDERR" pipe will be recorded in the logfile. For example:
[Fri Apr 5 10:01:24 1996]
foo.cgi: Insecure dependency in eval
while running with -T switch at lib/CGI/Imagemap.pm line 317.
This is a warning from the PERL 5 interpreter invoked by this CGI script. Another major CGI-related error message is "malformed header from script", i.e.:
[Fri Apr 5 10:01:24 1996]
access to /www/htdocs/foo.cgi failed
for test09.host.com, reason: malformed header
from script
This will occur if the script is not following the rules for CGI output. Other warning messages are largely self-descriptive.
Sifting Usage DataThe access file is a great record of your server's activity, but it's pretty tough to get anything meaningful out of the raw data. You need to sift through and sort the log files and turn them into valuable demographics that illustrate the usage of the Web site. This will justify the investment in the successful pages and assist in understanding how to improve the less successful pages. This information can be used to support the quality of the server or justify the need to upgrade. There are a wealth of tools and products available for sifting and analyzing the access log file. They range from simple operating system commands to sophisticated relational databases.
Quick and Dirty Analysis with UNIX and DOSAlthough there are a number of programs available to analyze access logs, the following are simple steps for finding answers quickly. Using some simple searches, however, you can find many items you need without having to write a line of code. For starters, look at the basic search tools available under UNIX and DOS.
Searching in UNIXOn a UNIX workstation, the simplest way to search a text file is to use the time-honored grep utility. For example, to find all access log entries containing fred, enter:
grep 'fred' access_log
grep has many powerful capabilities in addition to basic searching. For example, you could search for all lines except those containing becky using the -v option:
grep -v 'becky' access.log Other useful grep options include: -i Turns off case-sensitive searching -c Returns only a count of all matching lines -n Displays the line number of each matching line You can also use multiple options to help limit your search by putting multiple option flags behind the dash. For example, if you want to search for all lines not containing the word 'becky' and you want to shut off case-sensitive searching while looking for the term 'becky,' type:
grep -vi 'becky' access.log Searching in DOSThe DOS FIND command performs nearly the same function as UNIX's grep command. To search for all instances of nasa.gov in the access log, enter:
FIND "NASA.GOV" ACCESS.LOG
Although the DOS FIND command does not have as many options as grep, it has enough for simple log-file searching, including: /v - Displays all lines not containing the search string /c - Returns only a count of all matching lines /n - Displays the line number with each matching line /i - Does a case-insensitive search Because the log files are just ASCII text, you can also open your logs in a word processor and use the search features that are part of that particular program. You can also write macros to search for particular strings of text, such as certain error codes, to help you scan through your logs faster.
Useful Search PatternsNow it's time to put grep to work looking for useful data in the access log. Without writing a line of programming code, you can see:
Sifting by AddressSuppose you get a couple of calls one day from users wanting to know why they can't get to the weather map anymore. You ask for their addresses and discover that they never should have had access in the first place. What do you do now? To verify their claims and assess the damage, you can start by simply searching for their addresses in the log file. Suppose the unauthorized users are from iam.illegal.com and ur.illegal.com. To see what they've looked at besides the weather map, you can simply search for illegal.com by entering:
FIND "illegal.com" access.log The result is a fascinating chronicle of unauthorized activity. If there are too many lines to count, use FIND /C or grep -c to do the dirty work for you, and e-mail the results to your boss on a good day. This scenario is not all that unlikely, by the way. Basic Web server security is good, but only as good as the rules that are made for it. More often than not, problems arise when people make assumptions or generalizations that turn out to be false. You may think, for example, that all addresses in a certain subnet (beginning with 127.34.26, for example) are located on your network, only to find out later that the first 20 addresses belonged to another company. The trick here is just to be aware of what you're doing when you're doing it. Taking the "easy way out" can sometimes open up more of a hole in your security than you really intended. If you're running a restricted-access Web server, you might want to check now and then to make sure that no one has gotten in from the outside. You can do this easily by looking for all accesses not from your site:
FIND /V "widgets.com" ACCESS.LOG In this case, anything returned by the search indicates a possible security breach. If you're running on a UNIX machine, run a grep -v command analogous to the previous DOS FIND command as a cron job every week and mail yourself the results so that you don't forget to check now and then.
Sifting by File or DirectoryPerhaps you've recently added a new feature to your Web site and want to see how much attention it's getting. Just search your logs for the directory or file name and you're in business. To see how many times your What's New page has been read in the current logging period, simply enter:
FIND "whatsnew.htm" ACCESS.LOG Or if you've added a whole new directory of stuff (called "/stuff"), try:
FIND "/stuff" ACCESS.LOG
Computing Total AccessesOne measure of your Web server's utilization or exposure is the number of total document requests. This is not necessarily a measure of effectiveness because many people who visit your site may spend only a few seconds there and travel on. This is especially true now because of the Web's notoriety. In fact, the ratio of tourists to seriously interested patrons of the Web may even be lower than the percentage of sales resulting from direct-mail campaigns. Fortunately, Web space is a lot cheaper. Nonetheless, the number of documents requested or "hits" is of major interest. If nothing else, measuring your server's growth in utilization can give you a good indication of when you'll have to buy more powerful hardware. Without running a more advanced usage statistics program, you can get a good feel for your server's growth simply by counting the number of total document accesses. In general, you want to exclude gif files, however, because inline gifs show up as separate document requests, which distorts the true number of htmL pages accessed. Of course, if providing images is a major part of your site, you may not want to exclude them in the count. But, for example, to find out many htmL pages have been accessed on your server minus the gif files, enter:
FIND /C /V ".gif" ACCESS.LOG To see how many accesses occur during some specified time period, simply run this command every six hours and compute the difference between each run. For more regular time periods, however, such as days and hours, you can use the next technique.
Computing Accesses During a Given PeriodThe access log turns out to be in a very convenient format for finding out how many document requests have been processed in most common time periods. For example, if you want to find out how many documents were transferred between 3:00 and 4:00 p.m. on October 25, 1994, use:
FIND /C "25/Oct/1994:15" ACCESS.LOG Using this technique, you can look at total accesses in a given hour, day, month, or year. By piping the output of one FIND or grep command into another, you can obtain even more detailed information. For example, to find all accesses from red.widgets.com in the month of October, use:
FIND "red.widgets.com" ACCESS.LOG | FIND "/Oct/" The first FIND command finds all occurrences of red.widgets.com, while the second FIND looks only in that data for occurrences of /Oct/. (Of course, if you haven't cleaned up your log files for a while, you end up with data from this and all previous Octobers since you last purged or archived your file.) In UNIX, the same thing can be accomplished using regular expressions in a single grep command. For example:
grep 'red.widgets.com.*/Oct/' access.log
Usage UtilitiesNow for the really neat stuff. What has been described above gives you a lot of answers about your site and its usage. But they require separate actions and still give you raw output. There are numerous products - some free, some commercial - that take all the grunt work out of collecting and totaling usage statistics. These programs range from freeware that still requires some programming effort on your part to commercial packages that provide easy-to-use graphical user interfaces to set up and customize. They all take the raw data in your log files and create reports and graphs customized to your specifications. Amongst the freeware offering, one of the best is wwwstat (available from http://www.ics.uci.edu/WebSoft/wwwstat/). wwwstat is good because it produces thorough and nicely-formatted output and can be used with gwstat, which turns the output of wwwstat into attractive usage graphs (in gif format, of course). gwstat is available from ftp://dis.cs.umass.edu/pub/gwstat.tar.gz, and both wwwstat and gwstat are available on the WebmasterCD.
wwwstatWwwstat is a PERL script that reads the standard access-log file format and produces usage summaries in several categories. wwwstat produces summary information for each calendar month and can be run for past months as well as the current month. Summary categories include:
Figure 16.1 shows an example of Daily Transmission Statistics generated by wwwstat. Fig. 16.1 - Wwwstat generated these Daily Transmission Statistics. Figure 16.2 shows wwwstat's summary of statistics by client domain, which brings home the truly global nature of the Internet. Part of the wwwstat distribution is a file containing all the country codes in use on the Internet. Fig. 16.2 - wwwstat's output of country codes and names. Because wwwstat is a PERL program, and a simple one at that, it can run on most other implementations of the PERL interpreter on other platforms.
gwstatGwstat takes the output of wwwstat and turns it into illustrative graphs. gwstat produces two sizes of gif files - thumbnail sketches and full-size graphs like the hourly-usage graph (see figs. 16.3 and 16.4). You can specify the sizes of both the thumbnail and full-size graphs. Fig. 16.3 - Gwstat's thumbnail sketches. Fig. 16.4 - Hourly-usage graph produced by gwstat. gwstat requires three other programs to run, so installation can be time-consuming if you don't already have the other programs. However, the results are well worth it. gwstat is designed to work exclusively under the X Windows system on UNIX workstations, so unfortunately, there is no way to port it to other platforms. Besides wwwstat and PERL, gwstat needs Xmgr, ImageMagick, and GhostScript. Information on these programs and hyperlinks to them is available at http://dis.cs.umass.edu/stats/gwstat.html. At that location, there are also instructions for creating a composite imagemap (refer to fig. 16.3).
statbotAnother popular WWW log analyzer is statbot. It works by "snooping" on the log files generated by most WWW servers and creating a database that contains information about the server. This database is then used to create a statistics page and gif charts that can be "linked to" by other WWW resources. Because statbot "snoops" on the server log files, it does not require the use of the server's cgi capability. It simply runs from the user's own directory, automatically updating statistics. statbot uses a text-based configuration file for setup, so it is very easy to install and operate, even for people who have no programming experience. You can find statbot at http://www.xmission.com/~dtubbs/club/cs.html.
AccessWatchA third freeware product is AccessWatch, a PERL script from Bucknell University. It converts the analyzed data into an htmL file. Here is an example of AccessWatch output. It was generated for a subdirectory of htmL files about creating an online newspaper, called CReAte. Fig. 16.5 - An example of AccessWatch output. It then adds detailed data in htmL tabular form. You can view the full page at: http://www.eg.bucknell.edu/
AccessWatch is available from http://www.eg.bucknell.edu/~dmaher/accesswatch/getAccessWatch.html. You can find a long list of other analysis tools in the yahoo
directory http://www.yahoo.com/Computers_and_Internet/Internet/
Commercial ProductsCommercial products will be proliferating soon. Two early offerings are WebTrends and net.Analysis. WebTrends is a Mid-range product that functions more in a batch processing mode. net.Analysis is a high-end product complete with an Informix database and real-time capability. Both offer great flexibility in customizing reports. Reports generated by WebTrends include statistical information as well as colorful graphs that show trends, usage, market share, and much more. Reports are generated as htmL files that can be viewed by a browser on your local system or remotely from anywhere on the Internet. WebTrends claims it can read the log files of all available servers. You can download an evaluation copy from http://www.webtrends.com/ and try it out with your server. It is highly recommended that you try out any software for an evaluation period before you purchase it. The following figures are some examples of WebTrends output available from its Web site. These are representative of the kinds of output possible from all of the packages. Fig. 16.6 - This graph illustrates what Internet domains connected and the number of user sessions over a sample day. Fig. 16.7 - This table includes additional information such as total and average hits per day. Fig. 16.8 - This graph illustrates the hits to the pages over a set period of days. Fig. 16.9 - This table includes additional information such as total number of hits and user sessions. Fig. 16.10 - This graph illustrates the activity as percentage of total visits. Fig. 16.11 - This table includes additional information such as the number of user sessions per state. Fig. 16.12 - This graph illustrates the activity over a twenty four hour period as percentage of total visits. Fig. 16.13 - This table includes additional information that contrasts the weekdays and weekends as well as indicates the busiest and slowest times.
net.Analysisnet.Analysis is a product designed for complex real-time log analysis. It places the log into an Informix database and runs a host of customizable queries to present as complete an analysis as possible. Here are two examples of the results generated by net.Analysis.
net.Analysis is available from: http://www.netgen.com/ A list of other programs to analyze log files is available from http://union.ncsa.uiuc.edu/HyperNews/get/www/log-analyzers.html. Also there is a list at the Yahoo directory at: Checking htmLAs your server grows, it becomes more and more difficult to find broken hyperlinks, both to documents on your own server as well as documents on other servers. This is especially true if many people are responsible for creating and editing documents on your server. Fortunately, there are tools to help you analyze the structure of your htmL database and find problems. Some of these tools are freely available on the Internet.
htmL AnalyzerhtmL Analyzer is a C program that both finds broken links and attempts to ensure that the htmL database is well-organized and makes sense to users. It is available in various forms from: http://wsk.eit.com/wsk/dist/doc/admin/webtest/ verify_links.html ftp://ftp.cc.gatech.edu/pub/gvu/www/pitkow/html_analyzer ftp://ftp.ncsa.uiuc.edu/Web/Mosaic/Contrib/ The file name will be something like html_analyzer-0.30.tar.gz. The documentation for htmL Analyzer is contained in the program's distribution. The basic philosophy of htmL Analyzer is that the text of any given hyperlink should always point to the same place and that no other text should point to that same place. This is necessary in order for users to get a clear picture of the organization of the htmL database. htmL Analyzer performs three checks on a database of htmL files - validity, completeness, and consistency.
Checking for ValidityThe first check performed by htmL Analyzer is for link validity. This ensures that all hyperlinks point to valid locations (that is, no server errors are returned). Empty hyperlinks (such as, HREF=""), local links (such as, HREF="#intro"), and links to interactive services (Telnet and rlogin) are not checked. Even without running the other two checks, validity checking helps to ensure that users of your site won't be frustrated by broken links.
Checking for CompletenessThe completeness check ensures that each anchor's contents always occur as a hyperlink. If a hyperlink contained the text Beginner's Guide, for example, and the same text occurred as regular text (not a hyperlink) elsewhere, this is reported. The intent of the completeness check is to improve user-convenience by expecting a hyperlink everywhere there can be, and also to prevent user confusion because the same text sometimes occurs in a hyperlink but not in others.
Checking for ConsistencyThe final check ensures that every occurrence of a hyperlink anchor points to the same address and that every occurrence of that address is pointed to by the same hyperlink anchor. In other words, htmL Analyzer checks to see that there is a one-to-one correspondence between hyperlink anchors and their respective addresses. Here is an example of the results of htmL_Analyzer. In this example, there is no file /u/CIMS/Demo_Description.html located on the server named nsidc1.colorado.edu, an httpd server listening on port 1729. The first series of tests discovered this and notified the user. It also discovered an incomplete link and an inconsistent link. Listing 16.2 Sample Results of htmL_Analyzer
+++++++++++++++++++++++++++++++++++++++++++++++
VERIFYING LINKS...
WWW Alert: HTTP server at nsidc1.colorado.edu:1729 replies:
HTTP/1.0 500 Unable to access document.
WWW Alert: Unable to access document.
WARNING: Failed in checking:
http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html
With content of: Description of this demo
In local file: ./temp/example.html
VERIFYING COMPLETENESS...
WARNING: These filenames contain the content:
Description of this demo
Without a link to:
http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html
example.html
VERIFYING CONSISTENCY OF LINKS...
WARNING: Link used inconsistently.
HREF: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html
occurs 1 time with content:
Free Text Frame
as in file: ./temp/example.html, but also
occurs 1 time with content:
More Info Frame
as in file: ./temp/example.html
VERIFYING CONSISTENCY OF CONTENTS...
WARNING: Content used inconsistently.
CONTENT:
Free Text Frame
occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/
Even_more_info .html
as in file: ./temp/example.html, but also
occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/
More_info.html
as in file: ./temp/example.html
++++++++++++++++++++++++++++++++++++++++++++++++++++
MOMspiderMOMspider is a PERL program originally written as a class project in distributed information systems at the University of California. MOMspider stands for Multi-Owner Maintenance Spider and is similar to other spiders and robots that traverse the World Wide Web looking for information. MOMspider is available from http://www.ics.uci.edu/WebSoft/MOMspider/ and requires libwww-perl, a library of PERL code for the World Wide Web available from the same site. Because MOMspider is designed to follow hyperlinks anywhere on the Web, it has many features for controlling the depth of searches and is respectful of other sites' wishes not to be visited by automated robots like MOMspider. MOMspider also has an interesting feature that can build a diagram of the structure of the documents it finds. In addition, MOMspider can avoid sites that are known to cause problems for Web-roaming robots. Examples of these kinds of sites are those that use scripts to generate all output rather than static htmL documents.
Finding What's NewWhen your Web site is being maintained by many people independently, such as an internal server in a large organization, it becomes impractical, if not impossible, to require that htmL authors tell you every time they create or modify a page on your server. However, it is highly desirable that server administrators be able to quickly and easily find out what new items have been added each day in order to spot potential problems before they spread too far. In addition to administrative concerns, information about new or modified documents on the server is helpful for users who can look on the What's New page and see that the server is continually being updated with valuable information. In UNIX, it's possible to find all new or modified files in an entire directory tree with a single command:
find directory_name -mtime 1 The find command looks recursively down the directory tree specified by directory_name to find all files that meet the specified requirements. The -mtime option looks for all files that have been modified in the previous number of days - in this case, 1. You can narrow the search to include only new files (not directories) using the -type f option. You can also look for files of a certain extension using the -name 'search_pattern'. For example, to find only .html files modified in the last week, enter:
find directory_name -mtime 7 -type f -name '*.html' By including a find command like these examples in a shell or PERL script, you can easily generate a list of What's New pages, as in the following PERL example. The script is available as "whatsnew.pl" on WebmasterCD. Listing 16.3 PERL Script That Generates a What's New Page
#!/usr/bin/perl
# whatsnew.pl--David M. Chandler--January 13, 1995
# This program finds all files underneath the search directory which
# have been created or modified within the last day. The output is an
# htmL What's New page with hyperlinks to the new pages.
# Invoke the script and redirect the output to your What's New page
# whatsnew.pl >whatsnew.html
#Put your server's document root here
$SEARchDIR="/httpd/htdocs";
#Create header for What's New document
print "<TITLE>What's New<TITLE>\n";
print "<H1>What's New!</H1>\n";
print "The following documents were created or modified
yesterday:<P>\n";
print "<DL>\n";
#Find all new/modified htmL files in the past day
for each $file (`find $SEARchDIR -type f -mtime 1 -name '*.html'`)
{
#Construct the URL from the filename by removing the
# directory path
if ($file =~ m%$SEARchDIR/(.*)%) {
$url = $1; }
#Find the document title
chop($title = `grep '<TITLE>' $file`);
if ($title =~ m%<TITLE>(.*)</TITLE>%i) {
$anchor = $1; }
#Create the What's New listing
print "<DD><A HREF=\"$url\">$anchor</A>\n";
}
print "</DL>\n";
Windows for Workgroups users can accomplish this task easily in File Manager by using the Date Sort tool, which lists all files in chronological order. Likewise, many Windows-based shells, such as Norton Desktop or PC Tools for Windows, have similar features in their file management utilities. DOS users aren't fortunate enough to have the -mtime option available to list only those files modified recently; however, it is possible to see a directory listing sorted by date so that a quick scan reveals any new or modified files. To list a directory with the most recently created or modified files last, use:
DIR /OD directory_name To list a directory with the most recently created or modified files listed first, use:
DIR /O-D directory_name SummaryThis chapter will set you well on the way to managing the usage of your Web site. You will be able to furnish the content managers with detailed and organized data on the accesses to the site and its pages. You will also be able to check the htmL pages that get placed on the server to see if they are linked properly. This will help to make your site more professional and productive.
QUE Home Page For technical support For our books And software contact support@mcp.com Copyright © 1996, Que Corporation |