Unicode is a universal encoding system to provide a comprehensive character set and was created by the Unicode consortium (a group of multilingual software manufacturers). Unicode simplifies software localization and improves multilingual text processing. It overcomes the difficulty inherent in ASCII and extended ASCII.
Unicode has standardizes script behavior which allows any combination of characters, drawn from any combination of scripts and languages, to co-exist in a single document. Unicode defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless.
Unicode was originally a 2-byte character set. Unicode version 3, however is a 4-byte code and is fully compatible with ASCII and extended ASCII.
These all support encoding the same set of characters.
- UTF-8 uses anywhere from 1 to 4 bytes per character depending on character, but ASCII take only 1 byte and 4 bytes for unusual ones.
- UTF-16 uses 2 bytes for most characters, while very unusual characters take 4.
- UTF-32 uses 4 bytes per character. We can calculate the number of characters in a UTF-32 string by only counting bytes.
The notation uses hexadecimal digits in format as follows.
The numbering goes from U-00000000 to U-FFFFFFFF. Unicode divides the available space codes into planes. A plane is a continuous group of 65,536 code points. The most significant 16 bits define the plane (i.e. number of planes = 65,535) and each plane can define up to 65,536 character or symbols.
Types of Plane –
- Basic multilingual plane (BMP) – Plane 0000, the basic multilingual plane is designed to be compatible with the previous 16-bit Unicode. The most significant 16-bits in this plane are all zeroes. It mostly defines character sets in different languages with the exception of some control and special characters. It is represented as U+XXXX where XXXX are least significant 16-bits, eig., : U+0900 to U+09FF reserved for Devanagari, Bengali U+2200 to U+22FF reserved for mathematical operation etc.
- Supplementary multilingual plane (SMP) – Plane 0001, the supplementary multilingual plane, is designed to provide more codes for those multilingual characters that are excluded in the BMP. Example: 10140-1018F are reserved for Ancient Greek Numbers.
- Supplementary ideography plane (SIP) – Plane 0002, the supplementary ideography plane, is designed to provide codes for ideographic symbols, symbols that provide an idea in contrast to a sound, e.g., 20000-2A6DF are reserved for CJK Unified Extension B
- Supplementary special plane (SSP) – 000E, the supplementary special plane, is used for special characters, e.g., E0000-E007F are reserved for tags.
- Private use planes (PUPs) – Planes 000F and 0010, private use planes are for private use. They are used by fonts internally to refer to auxiliary glyphs.