Understanding Unicode

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English, no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

What is Unicode?

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Difference between ASCII & Unicode

Unicode is a universal character encoding standard. It defines the way individual characters are represented in text files, web pages, and other types of documents.

Unlike ASCII, which was designed to represent only basic English characters, Unicode was designed to support characters from all languages around the world. The standard ASCII character set only supports 128 characters, while Unicode can support roughly 1,000,000 characters. While ASCII only uses one byte to represent each character, Unicode supports up to 4 bytes for each character.

There are several different types of Unicode encodings, though UTF-8 and UTF-16 are the most common. UTF-8 has become the standard character encoding used on the Web and is also the default encoding used by many software programs. While UTF-8 supports up to four bytes per character, it would be inefficient to use four bytes to represent frequently used characters. Therefore, UTF-8 uses only one byte to represent common English characters. European (Latin), Hebrew, and Arabic characters are represented with two bytes, while three bytes are used to Chinese, Japanese, Korean, and other Asian characters. Additional Unicode characters can be represented with four bytes.

References

Understanding Unicode