LINFO

Data Compression Definition



Data compression, or just compression, is the process of encoding information using fewer bits. Data decompression, or just decompression, is the process of restoring compressed data back into a form in which it is again useful.

A bit is the most basic unit of information in computing and communications, and every bit has a value of either zero or one. Compression is possible because most uncompressed data is partially redundant; that is, the same information can be stored using fewer bits.

For example, a text file could easily be compressed by using abbreviations for frequently used phrases and long words, such as replacing the word information by an abbreviation such as info. Or better yet, each repeated word or phrase could be replaced by a pair of bytes that does not represent any letters but which refers to a dictionary that the compression program creates of compressed words and phrases. Two consecutive bytes (i.e., 16-bits) can store any of 65,536 (i.e., 216) possible values, and thus they would be sufficient to allow the encoding of roughly that number of different words and phrases.

In the case of image files, there are frequently long sequences of pixels (i.e., the tiny dots that are used to store and display images) with the same color. Thus, for example, a sequence of 500 white pixels could be represented by a bit encoding that stands for white pixel times 500 or by a further simplified form such as white x 500, which could be expressed using about 36 bytes in the case that 32-bit color (i.e., using 32 bits to represent each color) was used. This represents a dramatic reduction from the 16,000 bytes that would be required without compression, that is, if each of the white pixels were represented by 32 bits. This is referred to as run-length encoding. The longer the sequence, the greater the amount of compression that is possible.

Compression is important because it can reduce the storage space (e.g., space on hard disk drives) required for storing data and the amount of bandwidth required for transmitting it over networks. Although there is a cost to compression and decompression in terms of requiring increased processing power, this cost has been greatly reduced by the rapid improvements in the speed of processors (i.e., central processing units or CPUs) in recent years.

Numerous data compression/decompression algorithms have been developed, some of which are general purpose and others of which are for specialized applications, most commonly data, images and sound. (An algorithm is a set of precise, unambiguous rules that specify how to solve some problem or perform some task.) Some, such as run-length encoding, are lossless, that is, there is no data lost during compression and thus the original data can be reconstructed exactly as it was prior to compression.

Others are lossy, that is, they accept some loss of data in order to achieve higher compression ratios than are possible with lossless compression. Lossy compression is particularly useful for sound and image files because such files typically contain extra data whose absence would not be noticed by most people. The JPEG (joint photographic experts group) image format, which is commonly used for photographs, employs a lossy compression algorithm, whereas the also widely used GIF (graphics interchange format) image format uses a lossless algorithm.

In some situations it is most efficient to compress only the differences between two versions of a set of data. Referred to as delta compression, this is particularly useful for reducing the byte size of video frames. It works because any frame can be fully described by only the previous frame and the differences from it.

One of the best known data compression algorithms is LZW, a lossless method created by Abraham Lempel, Jacob Ziv and Terry Welch and published in 1984. LZW provided a better compression ratio in most applications than any well-known method available up to that time; for example, it could typically compress large English language text files to roughly half of their original sizes. It thus became the first widely used data compression method on computers and was incorporated into the GIF image format. However, LZW subsequently fell out of favor both because of claims that parts of it were covered by patents, accompanied by attempts to enforce them1, and because better compression techniques were subsequently developed.


________
1The patent controversy surrounding the LZW algorithm led to the development of the PNG (portable network graphics) image format as a substitute for GIFs and to the development of alternative compression programs for Linux and other Unix-like operating systems such as gzip and bzip2.






Created April 3, 2006.
Copyright © 2006 The Linux Information Project. All Rights Reserved.