LINFO

Markup Language Definition



A markup language is a set of tags and/or a set of rules for creating tags that can be embedded in digital text to provide additional information about the text in order to facilitate automated processing of it, including editing and formatting for display or printing.

Markup languages are fundamental to displaying documents in web browsers, and they are also employed by every word processing program and by nearly every other program that displays text. However, such languages and their tags are typically hidden from the user.

By far the most familiar markup language to most people is, of course, HTML (hypertext markup language), which is used to allow documents to be displayed in web browsers. A newer and much more flexible (but also more difficult to learn and use) approach is to use languages based on XML (extensible markup language), which is a standard for creating languages that describe the content of documents rather than how they should be displayed.

Both HTML and XML are descendants of SGML (standardized general markup language), which was developed by the International Organization for Standards (ISO) in 1986 to facilitate the sharing of machine-readable documents in large projects in government agencies, in the aerospace industry and in in the legal field. SGML has also been used extensively in the printing and publishing industries. However, its complexity has prevented its widespread use for small-scale and general-purpose applications.

A tag is a special string (i.e., sequence of characters). In HTML, XML and related languages, every tag begins with a leftward pointing angular bracket, contains one or more alphanumeric characters, and ends with a rightward pointing angular bracket. These brackets indicate to the browser or other program that renders (i.e., converts to its final form to be viewed by users) that they, along with the enclosed characters, are instructions for the computer rather than ordinary text and that they are not to be visible in the rendered document.

Among the aspects of the display of a document that tags are used to indicate are its layout (including headings, paragraphs and margins), characteristics of the characters in the text (such as typeface, size, style, and whether they are subscripts or superscripts), the positioning of images, and the locations of (and other information about) hyperlinks (i.e., links to other documents or other locations on the same document).

Most tags are designed to be used in pairs (consisting of a start tag and an end tag) and to enclose text within the pair. An example is the pair of HTML tags that is used to indicate bold text:  <b>  and  </b>.  Thus, for example, the phrase bold text in the previous sentence is tagged in the source code for this page as  <b>bold text</b>.  Other examples of commonly used tag pairs are  <p>  and  </p>  to indicate the start and end of a paragraph, and  <i>  and  </i>  to indicate that the enclosed text should be rendered in italics. Every HTML document begins with the tag  <html>  and ends with the tag  </html>. It can be seen that the closing tag in every pair differs from its opening counterpart by the inclusion of a forward slash before the character(s) enclosed within the tag.

Some tags are designed to be used individually because they do not enclose any content. An example is  <br />,  which stands for break and is used to indicate the start of a new line of text. Another HTML example is  <hr />,  which stands for horizontal rule and is used to create a horizontal line. In XML-based markup languages and in modern versions of HTML it is required that even tags that are used singularly be closed, and this is accomplished by the space and forward slash after all other characters within the brackets1.

The tag pair  <h2>  and  </h2>  is used in HTML to tell the browser that the enclosed text is a headline and should be rendered in the second to largest type size. In contrast to this rigid information about how to display the headline, XML would employ a descriptive tag pair such as  <section head>  and  </section head>  so that the details of the type size, font and style can be easily modified by another program according to the particular application and desires of the author or user.

A major advantage of using markup languages that can describe content is that it becomes practical to automatically manipulate the content. For example, a tag pair such as  <price>  and  </price>  could be created so that every instance of a price in a document could easily be reformulated to be written in some special typeface and/or in plain, bold or italic style, could be converted to another currency (e.g., from dollars to yen), could be increased or decreased by a certain percentage or could have sales tax added in (for all prices or only prices over a specified minimum), etc.

Markup is also used to indicate special characters to display or print, including those that are not available on a standard keyboard. An example is the copyright symbol, whose markup is   &#169;. This is an example of techniques other than the use of tags created by angular brackets to provide rendering instructions.

The use of markup languages is becoming increasingly common, and numerous XML markup languages have been developed for specific types of applications. Although most describe text, the great versatility of XML also allows a greater range of applications. For example, SVG (scalable vector graphics) allows complex two-dimensional images to be described completely by text. This makes it very easy to manipulate them, including increasing size without loss of quality (in sharp contrast to conventional bitmap images).

XHTML (extensible HTML) is a reformulation of HTML in order to make it an XML language. It was developed as the successor to HTML and as a transitional step towards making XML languages the standard for web pages in order to simplify browser design and improve the ability to find and manipulate data. However, it has not caught on as fast as had been hoped, and now there is talk of developing new versions of HTML.

The term markup is derived from the traditional publishing practice of marking up manuscripts by writing instructions for typefaces, fonts, sizes, styles, etc. for each section in the margins for the typesetters to use when manually setting the lead type. The specialists who did the markup were known as markup men.


________
1The requirement for the space and forward slash was added to newer versions of HTML as well as to XHTML. In earlier HTML versions these tags were written as  <br>  and  <hr>. However, these deprecated versions are still accepted by modern browsers.






Created December 13, 2006.
Copyright © 2006 The Linux Information Project. All Rights Reserved.