LINFO

ASCII: A Brief Introduction



ASCII, an acronym for American Standard Code for Information Interchange and pronounced ask-ee, is the de facto standard for the character encoding used by computers and communications equipment to represent text, and it (or some compatible extension of it) is used on most computers, including almost all personal computers and workstations. ASCII is one of the most successful software standards ever developed.

Character Encoding

Character encoding is a system that associates a set of natural language characters, typically an alphabet, with a set of something else, usually numbers or electrical pulses, in order to permit computers and other electronic equipment to efficiently process, store and communicate character-oriented information. Other examples of character encoding include Morse code (which encodes the letters of the alphabet and other characters as series of short and long depressions of a telegraph key), EBCDIC (which is used by IBM mainframe computers) and Unicode (which attempts to provide a unique encoding for every character used by the world's languages).

ASCII is based on the characters used to write the English language. A character is any letter, symbol or mark employed in writing or printing a written language (i.e., a language used by humans for which a writing system has been developed). The characters used to write the English language are the 26 upper case (i.e., capital) and the 26 lower case (i.e., small) letters of the alphabet, the Arabic numerals, punctuation marks and a variety of other symbols (e.g., the ampersand, the equals sign, the at symbol and the dollar sign).

An alphabet is the complete, ordered, standardized set of letters that is used to write a written language. A letter is a character in an alphabet that represents one or more phonemes (i.e., the fundamental sounds of a spoken language) and/or is used in combination with other letters to represent a phoneme. English and most other West European languages are written with some variation of the Roman alphabet, also referred to as the Latin alphabet, which was originally used more than two thousand years ago by the ancient Romans.

The commonly used term plain text usually refers to text that is written in only ASCII characters. (It should not be confused with the term plaintext, which is used by cryptographers to refer to text that is not encrypted or text that has been decrypted.) All the characters used in HTML (hypertext markup language) documents are ASCII characters, although the web pages that they generate in browsers can, and frequently do, contain non-ASCII characters (as well as images). Likewise, all characters used in e-mail transmissions are ASCII characters; images, word processor documents, spreadsheets and other attachments to e-mail messages are converted to ASCII before transmission.

As is the case with other character encodings, ASCII is only a means of representing individual characters, and it does not provide any method for including information about the structure or appearance of text. That is accomplished using other standards, such as those specifying markup languages (e.g., HTML, XML and CSS).

Bytes and Bits

ASCII uses a single byte to represent each character. A byte, which is generally the smallest addressable unit of data on a computer, is a continuous sequence of eight bits (i.e., zeros or ones). This means that one byte could represent any of 256 characters (because eight bits allows 256 combinations of zeros and ones) ranging in binary notation from 0000 0000 to 1111 1111 (the spaces are added here to simplify reading).

Binary notation is a system for representing numbers that uses only two digits, i.e., zero and one. It is also referred to as base 2 (because there are only two numerals), in contrast to the decimal system which is called base 10 (because it contains ten numerals, i.e., zero through nine). Conventional computers use the binary system for their fundamental operations because their logic units and memory cells are capable of exhibiting only two states: i.e., off and on.

The Codes

There are 128 standard character encodings in US-ASCII, the original and most basic version of ASCII. Each of these is a seven digit binary number between 0000 0000 and 0111 1111. The eighth (i.e., left-most) bit was originally reserved for use as a parity bit, i.e., a bit that is used to check the accuracy of the other bits in the byte.

The first 32 ASCII codes (zero through 31 in decimal notation, or 0000 0000 through 0001 1111 in binary notation) are reserved for control characters. These are not actually printable characters; rather, they are codes that were originally intended to control devices (e.g., printers) that make use of ASCII. For example, Code 0 (0000 0000) represents the null character (which is used to terminate segments of text), Code 1 (0000 0001) represents the start of a heading, Code 2 (0000 0010) represents the start of text, Code 3 (0000 0011) represents the end of text, Code 4 (0000 0100) represents the end of transmission, Code 9 (0000 1001) represents the horizontal tab key and Code 27 (0001 1011) represents the escape key, which is located in the top left of corner of most keyboards.

Code 32 (0010 0000) represents the single space that is produced by the space bar (located in the center of the bottom row of a keyboard). Codes 33 through 126 represent the printable characters, that is, the 52 letters (i.e., 26 upper case and 26 lower case) of American English as well as the numerals, punctuation marks and several frequently used symbols.

The numerical value of the code for each lower case letter is 32 greater than for its upper case counterpart. Thus, for example, upper case B is 66 and lower case b is 98. In terms of binary numbers, the codes for upper and lower case letters differ in that the sixth place (from the right) is always zero for upper case letters and one for lower case letters (e.g., upper case B is 0100 0010 and lower case b is 0110 0010).

Code 127 (0111 1111) is another special character known as delete. Its original function was to erase a section of paper tape (a popular storage medium until the 1980s) by punching all possible holes at a particular character position.

Origins

Before the introduction of ASCII, it was difficult for computers to communicate with each another. Nearly every manufacturer had its own system for representing alphanumeric characters and control codes. In fact, there were more than 60 different ways to represent characters in computers, and IBM's equipment alone used nine different character sets.

Bob Bemer, a computer scientist at IBM during the late 1950s and early 1960s, played a major role in the development of ASCII, and he is widely regarded as the father of ASCII. Bemer foresaw the eventual integration of computers and communications, and he headed a group of programmers who were attempting to develop techniques by which computers could exchange data with each other via telephone lines, radio broadcasts and microwave transmissions.

Among Bemer's contributions were several characters, etc. that had not previously been used by computers, including the backslash, curly brackets and the escape sequence. According to Bemer, the escape sequence was "the most important piece of the ASCII puzzle." It was obvious from the start that 128 characters would not suffice for a global communications system capable of accommodating the multitude of character sets in use around the world; however, the seven-bit limitation at the time prevented this limit from being exceeded. The escape sequence represented a major breakthrough because it enabled computers to easily switch from one character set to another, and it has thus allowed more than 150 ASCII-based encoding sets to be defined and put into use.

ASCII was originally developed from telegraphic codes, and its initial commercial application was in teletypewriters (i.e., electromechanical typewriters that were widely used as communications terminals). It was first published as a standard in 1963 by the American Standards Association (ASA), which later became ANSI (American National Standards Institute). This version lacked lower case letters, but they were subsequently added and the standard was finalized in 1968 as ANSI X3.4. This seven-bit encoding is also referred to as US-ASCII, as it was designed solely for American English.

Although the ASCII standard was published in 1963, it did not gain general acceptance until 18 years later when IBM began using it in its first personal computers. Until that time, the only computer utilizing it had been the Univac 1050, which was introduced in 1964 (although Teletype Corporation immediately incorporated ASCII into all of its new teletype machines).

Extensions of Basic ASCII

The fact that US-ASCII had been designed exclusively with American English in mind presented a major problem. The only other languages that can comfortably be written with it are Hawaiian, Latin and Swahili. Moreover, these four languages can only be printed without most of the typographic frills that give text a more professional appearance and make it easier to read, such as curly quotes, fractions, the copyright symbol and the trademark symbol.

As computer technology spread around the world, a large and growing demand emerged for versions of ASCII that were suitable for representing additional languages, both those with writing systems based on the Roman alphabet and those which employ other types of character sets. Most written languages based on the Roman alphabet require additional letters and symbols beyond the 96 printable characters in US-ASCII, including characters with diacritical marks (e.g., accents) and currency symbols. For example, the character å is required for Swedish and other Nordic languages and the character ß in needed for German. US-ASCII is not even sufficient for British English, which requires the pound symbol (£).

Consequently, a number of national variants were developed, and in 1972 US-ASCII together with these variants were declared international standard ISO 646 by the International Standards Organization (ISO). However, this created compatibility problems because this standard was still a seven-bit character set (with its 128 character limit), and thus some of the codes had to be reassigned for the language-specific variants. It was difficult for computers and communications equipment to know which character was represented by a code without knowing which variant was in use; consequently, text processing systems were generally able to cope with only a single variant at a time.

Eventually, improved technology eliminated the need to employ the eighth bit of each byte as a parity check, thus permitting all eight bits to be used to represent codes. This made it possible to keep the first 128 codes intact regardless of the language variant and add another 128 character codes, thereby allowing the development and use of extended character sets.

Numerous standards bodies and businesses took advantage of this added capacity to develop extended character sets not only for natural language characters but also for a variety of other symbols, such as those used in mathematics and line patterns for creating forms in documents. However, some of these extensions are not highly standardized, and they can lead to confusion, particularly if several are used on the same system.

ISO 8859

ISO 8859, formally referred to as ISO/IEC 8859, is a joint ISO and IEC (International Electrotechnical Commission) standard for eight bit character encodings that came into widespread favor during the 1990s because of its ability to accommodate a large number of languages together with its backward compatibility with US-ASCII. The character sets were designed in the mid-1980s by the European Computer Manufacturer's Association (ECMA) and were endorsed by the ISO.

ISO 8859 is divided into numbered, separately published parts beginning with ISO 8859-1. The first 128 characters in each of these parts are identical to those of US-ASCII, while the remaining 128 codes are assigned to characters used in other alphabetic languages.

The ISO 8859-1 (also referred to as Latin 1) character set contains almost all of the characters necessary to write all major Western European languages. This is the default character set for the Internet, for Linux and for the X Window System (i.e., the underlying system for managing GUIs on Unix-like operating systems ). The 256 characters support Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.

Additional ISO 8859 character sets include 8859-2 (Latin 2, for Eastern Europe), 8859-3 (Latin 3, for Southeast Europe and miscellaneous languages including Maltese and Esperanto), 8859-4 (Latin 4, for Scandinavian and Baltic languages), 8859-5 (Cyrillic), 8859-6 (Arabic), 8859-7 (Greek), 8859-8 (Hebrew), 8859-9 (Latin 5, identical to 8859-1 except for including Turkish instead of Icelandic), 8859-10 (Latin 6, for Lappish, Nordic and Eskimo languages), 8859-11 (Thai), 8859-12 (undecided), 8859-13 (Latin 7, for Baltic languages), 8859-14 (Latin 8, for Celtic languages), 8859-15 (Latin 9, a revision of 8859-1) and 8859-16 (Latin 10, for South Eastern European languages).

ISO 8859-15, one of the newest members of the ISO 8859 family, was published by the ISO in March 1999 as an intended replacement for ISO 8859-1. It replaces eight less frequently used characters (including letter-free diacritical marks and fraction symbols) with the symbol for the Euro (the new European currency that was introduced in the same year) and seven other letters that allow completion of coverage for French and Finnish.

The ISO 8859 standard is designed for reliable information exchange, but not for typography. It omits symbols needed for high-quality printing such as optional ligatures (i.e., the printing of two or more letters as a single unit), curly quotation marks, dashes and many special symbols. As a result, high-quality typesetting systems often employ proprietary or idiosyncratic extensions together with the ASCII and ISO 8859 standards, or they use Unicode.

Major software developers addressed the need for additional characters beyond US-ASCII before the ISO 8859 standardization took place. As a result, although the first 128 characters on all computer operating systems are (almost) always the same (i.e., conforming to US-ASCII), the remaining 128 characters often differ, according to the developer or vendor.

For example, earlier versions of Microsoft Windows use a superset of Latin 1 that adds a number of characters such as single and double curly quotes, the trademark symbol and an elongated dash (the so-called em dash). These are all useful characters, but there are frequently compatibility problems with systems that use standard Latin 1, such as HTML (which can result in incorrect rendering in web browsers). There are also compatibility problems with other operating systems.

Likewise, Mac OS, the original Macintosh operating system, used a proprietary ASCII extension called MacRoman. However, its successor, the Unix-based Mac OS X operating system, which was launched in 2001, employs Unicode rather than ASCII.

The Future

Accompanying the drop in the cost (in terms of computing resources) of using more than one byte to represent each character, programming languages (e.g., Java) and operating systems (e.g., Linux and Mac OS X) have added native (i.e., built-in) support for Unicode. As a result, ISO 8859 and other legacy encoding systems have been losing some of their popularity.

Yet, despite the development of many extensions of ASCII and the subsequent development of Unicode, the original seven-bit ASCII and ISO 8859-1 remain the most common character encodings in use today. And ASCII will undoubtedly continue to be an important standard for a long time to come because of this widespread use and because it remains firmly entrenched in operating systems, programming languages, data storage systems, display hardware, networking protocols and application programs.

Moreover, because Unicode achieves its ability to represent a vastly larger number of characters through the use of multiple bytes to encode each character, it typically requires additional software and hardware. Thus it has considerable potential to waste memory and disk space for applications that only need to accommodate a limited range of characters, i.e., most applications for the majority of countries that have alphabetic writing systems.

Finally, regardless of the extent to which Unicode eventually comes to dominate character encoding, ASCII will survive in the sense that it is embedded in Unicode: that is, Unicode's first 128 code points map to the same characters as US-ASCII, and its first 256 code points map to the same characters as ISO 8859-1.






Created November 8, 2004. Updated May 26, 2006.
Copyright © 2004 - 2006 The Linux Information Project. All Rights Reserved.