Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp.com. Chapter 08 - Basic htmL: Understanding HypertextThe whole point of setting up a Web site is so that users can access the information you place on the site. Publishing documents on the Web requires them to be prepared in HyperText Markup Language (htmL), a page description language with provisions for linking related documents together. It is a simple, text-based language that you can view in a variety of fonts on any platform. You can use it with text-only clients, such as Lynx, on a VT220 terminal or with fully graphical clients, such as Mosaic, on advanced graphical workstations.The present version of htmL, otherwise known as htmL 2.0, is the most commonly used version. Most clients support htmL 2.0, but a few, such as Netscape Navigator and Microsoft Internet Explorer, support additional features, such as blinking text or background sounds, that are not in any htmL specification. Most of this chapter covers standard htmL. Future standards for htmL will include features such as style sheets, tables, support for embedding objects, and possibly a framework for implementing experimental features. The future of htmL promises to make strides toward a universal document format that is both compact and rich in formatting. In this chapter, you will learn the basics of htmL, including:
htmL FundamentalsBefore charging right into the htmL tutorial, it is helpful to review some introductory remarks on htmL to give you a sense of what it is, where it came from, and where it is heading.
History of htmLhtmL is an application of the Standard Generalized Markup Language (SGML). SGML arose out of the international standards community to meet the need for "structured" content, which could be validated algorithmically. In other words, SGML is an open, standards-based (ISO8879) language for describing document languages, and describes what their structure is (what "tags" can go in other "tags"). This definition occurs in a Document Type Definition, otherwise known as a DTD. When Tim Berners-Lee first started using htmL, it was not defined using a DTD, but thanks to the work of Dan Connolly and many others, its syntax and format were regularized and a DTD was created. Oftentimes, keeping htmL pure to its SGML background can be a challenge, but the benefits of this to the publishing community are too large to ignore.
The first version of htmL, htmL 0, was developed at CERN in 1990 and is largely out of use today. htmL 1.0 incorporated inline images and text styles (highlighting) and was the version of htmL used by most of the initial Web browsers. htmL 2.0 is the current standard. The future of htmL is being decided by vendor-sponsored groups like the World Wide Web Consortium (http://www.w3.org/), or in volunteer standards groups like the IETF (http://www.ietf.org/).
htmL TagsAn htmL document is simply the informational text of the document with structural tags embedded in the text. These tags are character sequences that begin with a less-than sign (<) and end with a greater-than sign (>). Tags can be used to, among other things, apply a style to text, insert a line break, or place an image in the document. To the "purist," a tag signifies a structure - you're not just saying "make this phrase really big by putting an <H1> around it," you're saying "this is a first-level heading in my document." The idea is similar to older word processors and page layout systems that require insertion of formatting tags to specify bold, underlined, or italicized type. Newer word processors use the same premise, but usually hide these tags from the user. Some word processors, however, allow you to display the formatting tags - WordPerfect, for example, provides you with the Reveal Codes menu option.For a look at some htmL, first consult figure 8.1, which shows the World Wide Web (W3) Consortium's home page (http://www.w3.org/). Choose the Document Source option from Netscape Navigator's View menu to activate a window with the htmL source loaded. The htmL source corresponding to figure 8.1 is shown in figure 8.2. Fig. 8.1 - The W3C home page as displayed by the Netscape Navigator. Fig. 8.2 - Netscape allows you to view the htmL source of the document in the browser window. Viewing the source code of a document is a great way to learn htmL, but you should be aware that not all browsers have this feature. In addition to differences in features, you should also know that different browsers often display the same page in different ways. Figure 8.3 shows the W3C home page in Lynx, a text-only browser. Notice how the elements pointed out in figures 8.1 and 8.2 are rendered differently in Lynx. You should also be extremely cautious about simply learning by example; while sometimes someone is able to get an interesting effect with a particular tag combination, sometimes this combination is illegal by the specifications. Even though it might look all right in the browser you are using, other browsers may not be able to handle it at all, even if they completely conform to the spec. When in doubt, consult the specs. Fig. 8.3 - The W3C home page as displayed by Lynx. The differences in browser rendering are not a significant problem with the basic htmL formatting tags, but they can be an issue when your documents contain more advanced htmL, particularly those tags that are extensions to htmL supported by only a few browsers. This points to an important challenge in creating Web documents: how to incorporate the advanced features while not breaking browsers that can't render those features. As you read this chapter and the next, note the suggestions for writing browser-friendly htmL. Following these suggestions will make your documents accessible to the largest audience possible.
Platform IndependentMost of htmL's formatting features specify logical rather than physical styles. For example, the heading tags, which normally indicate larger font sizes, do not specify which size to use. Instead, a browser chooses a size for the heading that is larger than its default text size. This allows Macs to view files written on PCs and served by UNIX boxes. This also allows clients like Lynx to render the important text in all caps, if it can't handle changing the font size or color. Even though you can't control the exact font and size with logical structures, it's best to leave it up to the client to handle that logical-to-presentational formatting, since only the client understands best its own rendering limitations.
Three Basic RulesIn spite of the differences between them, Web browsers do consistently follow three rules when parsing htmL. These are:
White Space IgnoredThe fact that browsers ignore white space is often a source of frustration for the beginning htmL author. Consider the following htmL:
Fig. 8.4 - Carriage returns in the htmL source code don't translate to carriage returns on the browser screen. Mosaic tries to display the address all on one line! The carriage returns in the file, which make the address look fine in an editor or on a printout, are ignored by the browser. The same is true of other white space characters like tabs and extra spaces. In the htmL above, there are two spaces between IN and 46290-1097, but only one space between them in the browser window. The second space character is ignored.
Formatting Tags Are Not Case-SensitiveYou can write all htmL formatting tags in upper-, lower-, or mixed case. For example, browsers interpret <TITLE>, <title>, and <Title> the same way.
Most Formatting Tags Occur in PairsWith only a few exceptions, htmL formatting tags occur in pairs in which the beginning tag activates an effect and the ending tag turns off the effect. Tag pairs are often called container tags, since the effects they turn on and off are applied to the text they contain. For example, to specify that a line of text appears in bold, you write:
Uniform Resource Locators (URLs)While not directly related to htmL, Uniform Resource Locators (URLs) are an important part of htmL documents used in many different tags. For this reason, a quick primer on URLs is in order.A URL is basically the address of a document on the World Wide Web. The URL is a way of compactly identifying any document on any type of Web-compatible server anywhere in the world. The URL consists of four parts: an "access scheme," Internet address, port, and object. With the exception of the "news" and "mailto" access schemes, the general format for a URL is as follows: access-scheme://internet_address:port/object In addition, you can optionally specify search or query information after the object when sending data to a search or script. This is covered in Chapter 12, "htmL Forms."
Access SchemeThe access scheme indicates what type of Internet application is requested. Usually an "access scheme" maps directly to an Internet protocol, as is the case with "http," but not always. NNTP for example has both "nntp" and "news" schemes. In order to use a given protocol, both the client (browser) and Internet server must be able to speak that protocol. The most common protocol in Web documents is "http" (HyperText Transfer Protocol), which is spoken by all Web servers and clients. In addition, almost all browsers support FTP, Gopher, Telnet, and News. Some also support WAIS. Some fictional examples of URLs using these protocols follow:
ftp://ftp.fedworld.gov/pub/irs-pdf/form1040.pdf
gopher://gopher.government.gov/reports/census.txt
telnet://loc.gov
news:sci.psychology.clinical
mailto:info@netscape.com
To read Internet news through a Web browser, you have to be able to connect to a news server, which continually receives messages over the Internet and stores them locally for a short time (usually about two weeks). Newsfeeds cost money, and for this reason, no news servers are publicly available on the Internet. If your site wishes to take full advantage of Internet news, you must obtain a newsfeed from your Internet service provider or obtain authorization to connect to your provider's news server. The mailto: URL allows you to send electronic mail to the specified address directly from your browser. The mailto: URL is supported by Netscape, Lynx, and others, but it isn't supported by all browsers.
AddressThe address portion of a URL is simply the hostname or IP (Internet Protocol) number of an Internet server. This address can be either the familiar named dot notation (like ftp.ncsa.uiuc.edu) or a number sequence (like 127.0.0.1).
PortThe port is an optional URL element. If the port is omitted, the default port for the specified protocol is assumed. In the case of HTTP, this is 80.
File NameThe document path, or file name, is the same as that used by DOS and UNIX systems alike, although the slash is forward (/) rather than backward (\) for DOS users. Each slash goes down to the next subdirectory having the specified name, and the path ends in a file name with an extension (such as TXT or htmL). It is also possible to specify a path to an entire directory simply by ending with the directory name and a trailing slash (/). For example, to see the contents of the fruits directory on an FTP server, you can use:ftp://ftp.healthy.com/fruits/ A URL that specifies a protocol, Internet address, and file name is said to be an absolute URL. In some cases, it is also possible to specify one URL relative to another, resulting in a relative URL. For example, suppose your base URL is http://www.healthy.com/fruits/citrus/tarty_fruits.html and you need to specify the URL of the file intro.html located in the fruits directory (one directory level up from citrus). You can do this with the absolute URL http://www.healthy.com/fruits/intro.html, but it can also be appropriate to give the URL relative to the base URL. In this case, the relative URL would be "../intro.html." The two dots followed by a forward slash (../) are an indicator to move up one directory level. If you need to specify the URL of the file "lemonade.html" in the lemons directory (a subdirectory of the citrus directory), you can use the relative URL "lemons/lemonade.html."
General htmL StyleWhile you are generally free to write htmL any way you want, there are a few issues of style to keep in mind. If you're just starting out, take these style issues to heart and develop good authoring habits from the onset. If you've been writing htmL for a while and have perhaps "forgotten" about some of the aspects of good style, this is a great time to remind yourself of them and work them back into your documents.
Uppercase TagsWhile it is true that htmL tags are not case-sensitive, it is a good idea to always make them all uppercase. Remember that tags are embedded in other text and this can make them difficult to read when writing or editing htmL. Tags that are all uppercase stand out much better in a sea of text.Remember, though, that URL's are case sensitive.
Document StructureIt used to be that a discussion of htmL document structure would be right at the beginning of an htmL tutorial. However, since most browsers can still parse an htmL file without the structure-defining tags, many authors have fallen out of the habit of including these tags in their documents and their inclusion becomes an issue of style. Good htmL style suggests that you always include tags to define the major parts of your documents. The three major parts are:
The htmL DeclarationThe htmL declaration is simply accomplished by making the <htmL> tag the first thing in your file and making the </htmL> tag the last thing in your file. These container tags say "Everything between us is htmL code."
The Document HeadThe document head should immediately follow the <htmL> tag and is contained in the <HEAD> ... </HEAD> tag pair. The document head contains information about the document that is typically transparent to the user. While many informational items can be specified in the document head, the two that you should always include are the title and the base URL of the document.The document's title is designated with the <TITLE> ... </TITLE> tag pair. You should make your titles descriptive, while still keeping them fairly short. A forty character title is a good rule of thumb. Document titles typically appear at the top of the browser window (refer to fig. 8.1). They are also used in bookmark files.
The base URL of the document is given in the <BASE HREF="base_url"> tag. You really only need to set this if you anticipate someone arriving at your page through a URL other than the one on which you wish to base relative URL links.
The Document BodyThe document body immediately follows the head and is enclosed in the <BODY> and </BODY> tags. The body contains all of the information that will be presented to the user and the tags used to format that information.Putting these three parts of the document together, you a basic template for an htmL document (see listing 8.1) Listing 8.1 htmL Document Template
Getting StartedTo start writing htmL, all you really need is an editor that allows you to save files in ASCII format and a browser to test your documents. If you plan to include images in your documents, you'll need a graphics program as well.
EditorOn UNIX, many people will claim that the best editors are the same editors people on UNIX have been using for a long time, namely, Emacs and vi. vi is a very simple text editor. Crafted for an era of low memory requirements and small feature sets, "vi" is relatively easy to use but not incredibly full-featured. Emacs, on the other hand, is a very full-featured application. It has a built-in LISP interpreter; one particularly relevant Emacs-LISP module that has been created is the "htmL-Mode" module. Not only will it automatically give you all the default elements of an htmL document when you edit a new file named ".html," it also colors different tags and structural elements, making it very easy to see the difference between an <H1> tag section and an <A> section. More information on these will be provided later.
BrowserYou only really need one browser to test your documents, but it's a good idea to look at your htmL files in two or three browsers to make sure your code is as browser-friendly as possible. It's easy to get a copy of the popular browsers. NCSA Mosaic 2.0 and Netscape Navigator 2.0 are available for public download on Mosaic (ftp://ftp.ncsa.uiuc.edu/Mosaic/) and Netscape's FTP (ftp://ftp.netscape.com/) sites. A browser that actually implements more of the future htmL features is Arena (http://www.w3.org/pub/WWW/Arena/), an experimental browser developed and maintained as a reference software piece by the W3C. It should be noted that UNIX only accounts for about 15 percent of the browser market as of this writing, so to really test your pages, it would be wise to check them out on Windows and Mac browsers as well.
htmL TutorialWith the preliminaries covered, you're now ready to learn the basic htmL tags. All of the tags discussed in this section are found in the document body (between the <BODY> and </BODY> tags) and fall into several categories:
Paragraphs and Line BreaksThe <P> tag is used to indicate the start of a new paragraph. Paragraphs are separated by a blank line. To start a new paragraph without the extra line of separation or to just move to the next line, use the <BR> tag (line break). Line breaks were needed back in figure 8.4 to render an address properly. Figure 8.5 shows the difference between paragraphs and line breaks. Listing 8.2 shows the corresponding htmL.Listing 8.2 htmL for Figure 8.5
Heading StyleshtmL supports six heading styles, which are used to make text stand out by varying degrees. These are numbered one through six, with one being the largest. To format text in a heading style, enclose it in the <Hn> and </Hn> tags, where n is the number of the heading style you want to apply. Figure 8.6 shows how the six heading styles are rendered in Microsoft Internet Explorer by default. The corresponding htmL is shown in listing 8.3.Listing 8.3 htmL for Figure 8.6
Physical StylesPhysical styles are actual attributes of a font, such as bold or italic. htmL supports the four physical styles shown in table 8.1. To apply a physical style, simply place the text to be formatted between the appropriate tag pair shown in the table.
Fig. 8.7 - Physical styles are used to render text in boldface, italics, or a fixed width. The underline style is frequently not supported.
Logical StylesLogical styles indicate the meaning of the text they mark in the context of the document. Since they are not related to font attributes, logical styles can be rendered differently on different browsers. Table 8.2 lists the common logical styles and their meanings and typical renderings. Closing tags are required for all logical styles, but have been omitted in the table to save space. To create a closing tag, just add a slash before the tag name, like </ADDRESS>.
Figure 8.8 shows how Netscape renders many of the logical styles. Listing 8.4 shows the corresponding htmL. Listing 8.4 htmL for Figure 8.8
As you look at the typical renderings in table 8.3, you probably noticed that you can accomplish almost all of them by using the physical styles. If you did notice, you're likely asking "Why should I use the logical styles?" An "official" answer is: to give a contextual meaning to the text that you're marking up. Formatting doesn't really matter with the logical styles; it's the meaning they impart that is important. Such an official answer would come from a person who subscribes to the school of thought that htmL is a page-description language only. Authors who use htmL as a design tool are likely to cast aside such official responses and just use the physical styles to get the same effect. After all, it is easier to type <I>info@abc_corp.com</I> than it is to type <ADDRESS>info@abc_corp.com</ADDRESS>. The decision to use physical styles, logical styles, or both ultimately rests with each author, based on his or her take on whether htmL is for page description or page design.
Preformatted TextText tagged with the <PRE> and </PRE> tags is treated as preformatted text and rendered in a fixed-width font. Since each character in a fixed-width font has the same width, it is easy to line up text into columns and produce a table. Listing 8.5 produces the table you see in figure 8.9.Listing 8.5 htmL for Figure 8.9
ListshtmL lists provide an easy and attractive way to present information in your documents. All lists require a pair of tags for the type of list and for each list item. Table 8.3 lists three types of formatted lists.
Items in an ordered list are automatically numbered by the browser, starting with the number one. The automatic numbering is convenient, because it spares you from having to do it if you rearrange list items. Unordered list items are bulleted rather than numbered. Description lists allow you to present a term, followed by a description below and indented under the term.
List items in all three list types are indented from the left margin, making it easy to distinguish them from the rest of the body text. Figure 8.10 shows examples of unordered, ordered, and description lists as produced by listing 8.6. Listing 8.6 htmL for Figure 8.10
Fig. 8.10 - Unordered, ordered, and description lists provide an easy way to break out information.
You can nest lists inside of other lists, as shown in figure 8.11. Listing 8.7 shows the htmL to produce this figure. Listing 8.7 htmL for Figure 8.11
Special CharactersBecause many characters have special meanings in htmL, it is necessary to use special character sequences when you want special characters to show up as themselves. You can also use special character sequences to produce foreign language characters and symbols. These are referred to as SGML entities.
Reserved CharactersBecause the less than (<), greater than (>), and quotation mark (") characters are used in htmL formatting tags, the characters themselves must be represented by special character sequences. The ampersand (&) is used in these special sequences, so it also must be represented differently. Table 8.4 lists all the special character sequences in htmL. The semicolon (;) is necessary to indicate where the character description ends and normal text resumes.
If you're writing htmL code to produce htmL code on a browser screen, you will use the sequences in table 8.4 frequently. For example, to produce a list of the physical style tags, you would need to use the htmL shown in listing 8.8. Listing 8.8 htmL for Producing a List of Physical Style Tags
Fig. 8.12 - Writing htmL to produce on-screen htmL requires the use of special character sequences.
Foreign Language CharactershtmL uses the ISO-8859-Latin1 character set, which includes foreign language characters for all Latin-based languages. Since these characters are not on most keyboards, you need to use special character sequences to place them in your documents. Like the other special character sequences in htmL, these sequences begin with an ampersand (&) followed by a written-out description of the character and a semicolon (;). Table 8.5 lists all the foreign-language sequences available.
Characters by ASCII NumberYou can reference any ASCII character in an htmL document by including the ampersand (&) and pound sign (#) followed by the character number in decimal and a semicolon (;). For example, to include the copyright symbol (©) in an htmL document, you write:
You can find more information on SGML entities, character sets, and more at http://www.bbsinc.com/iso8859.html.
CommentsIt is possible to include comment lines in htmL that do not show up in browsers. You should consider placing comments in documents that you and others will be working on together. Many stand-alone htmL editors provide templates that include a comment area for information like the author's name and the date the document was last changed. The format for a comment is as follows:
Non-breaking SpaceYou can prevent a browser from breaking a line between two words by inserting a non-breaking space between the words. Non-breaking spaces are represented by the special character sequence .
Horizontal LinesHorizontal lines are a great way to break up sections of text-intensive documents. Placing a horizontal line is easy: just put an <HR> ("horizontal rule") tag in where you want the line to go. No closing tag is required.
ImagesWithout the visual appeal of inline images, it is doubtful that the World Wide Web would have become as popular as it has so rapidly. Graphical Web browsers such as Netscape Navigator, Mosaic, and Microsoft Internet Explorer can automatically display images in both the gif and JPEG formats inside documents.
Graphics Formats: gif and JPEGgif (Graphics Interchange Format) was originally developed for users of CompuServe as a standard for storing image files. Graphics stored in the gif format are limited to 256 colors.gif supports two desirable Web page effects. The first is interlacing, in which non-adjacent parts of the image are stored together. As a browser reads in an interlaced gif, the image appears to "fade in" over several passes. The other effect supported by the gif format is transparency. In a transparent gif, one of the colors is designated as transparent, allowing the background of the document to show through.
A frequently asked question on the World Wide Web newsgroups is: "How can I create transparent gifs?" Both UNIX and Windows users can use a program called giftrans to create transparent gifs from existing images. Another useful tool for this purpose is "giftool." Pointers to both are available from http://melmac.harris-atd.com/transparent_images.html. JPEG (Joint Picture Experts Group) refers to a set of formats that supports full color images and stores them in a compressed form. Most popular graphical browsers currently display JPEG images, though previously these images had to be viewed in a separate program. The progressive JPEG format, which has recently emerged, gives the effect of an image fading in just as an interlaced gif would. Transparency is not possible with JPEG images because the compression tends to make small changes to the image data. If a pixel originally colored with the transparent color is given another color, or if a non-transparent pixel is assigned the transparency color, the on-screen results would be dreadful.
The <IMG ...> TagYou must save images as separate files even though they are referenced and displayed inside an htmL document. To place an inline image on a page, you use the <IMG ...> tag.Syntax: <IMG SRC="URL"> Inline images always aligned flush left, although future versions of htmL may allow centering and flush right alignment. For example, to place the World Wide Web Consortium's logo next to its name on its home page (refer to fig. 8.2), the htmL looked like:
The ALIGN attribute controls the location of text that follows the image. By default, text appears at the bottom of an inline image. Figure 8.13 shows how you can use the ALIGN attribute to change the text to be aligned with the middle or top of the image. Specifically, ALIGN=MIDDLE aligns the baseline of the text with the middle of the image and ALIGN=TOP aligns the top of the text with the top of the image. Listing 8.9 shows the htmL for this figure. Listing 8.9 htmL for Figure 8.13
The ALT attribute specifies alternate text to be shown in place of an image in text-only browsers. Including the ALT attribute tag is a courtesy to dial-up and dumb terminal users; don't overlook this courtesy. Also, graphical browsers sometimes fail to load an image, in which case they use the text specified by ALT instead. For example, to include text-only support in the previous example, the line would look like this:
Hypertext and HypergraphicsNow to the other half of the HyperText Markup Language - the hypertext part. A hypertext reference is very simple. It consists of only two parts: an anchor and a URL. The anchor is the text or graphic that the user clicks to go somewhere. The URL points to the document that the browser will load when the user clicks on the anchor.In htmL, an anchor can be either text or a graphic. Text anchors usually appear underlined and in a different color than normal text on graphical browsers and in bold on text-only browsers such as Lynx. Graphic anchors (hypergraphics) usually have a colored border around them to distinguish them from plain graphics.
Creating Hypertext AnchorsAny text can be a hypertext anchor in htmL, regardless of size or formatting. An anchor can consist of a few letters, words, or even lines of text. The format for an anchor-address pair is simple:
Creating HypergraphicsYou can use hypergraphics to create button-like effects and provide a nice alternative to clicking plain text. The format for a graphic anchor is the same as a text anchor. However, instead of putting text between the <A HREF> and </A> tags, you reference an inline image. Figure 8.14 shows a hypergraphic.
In this example, when the user clicks the W3C logo, the browser jumps to the W3C home page.
Linking to a Named AnchorWhen you link to another document, the browser shows information starting from the top of the linked document. This is fine, unless the document is long and the information you really want displayed isn't near the top. In this case, users have to scroll through the document to find the information you want them to see. An alternative to inflicting this on your users is to set up named anchors in longer documents and then have your hyperlink references point directly to the named anchors.As an example, suppose you have a ten part document stored in a single file longdoc.html and that each section has its own heading. You can set up named anchors on each of the headings using the <A NAME="anchor_name"> and </A> tags as follows:
Fig. 8.15 - Linking to named anchors takes users right to the information you want them to see.
|