(toc)

The Unicode System is a universal character encoding standard that Java uses to represent characters. It ensures that Java programs can handle characters from multiple languages and scripts, making it platform-independent and suitable for global applications.


What is Unicode?

Unicode provides a unique number (called a code point) for every character in every language, including symbols, digits, and special characters. It supports over 143,000 characters across more than 150 modern and historic scripts.

  • Code Point Range: Unicode code points range from U+0000 to U+10FFFF.
  • Examples:
    • 'A': U+0041
    • 'अ': U+0905
    • '你': U+4F60

Why Unicode in Java?

Before Unicode, different systems used different encoding standards, such as ASCII or ISO-8859. These systems had limitations, particularly when representing non-English characters. Unicode solves these problems by providing a consistent encoding system.


Characteristics of Unicode in Java

  1. Default Character Encoding:

    • Java uses Unicode to represent char data type and String objects.
    • Each char is 16 bits (2 bytes) in Java, based on the UTF-16 encoding scheme.
  2. Wide Character Support:

    • Java can handle characters from multiple languages, symbols, and emojis.
  3. Compatibility:

    • Unicode ensures that Java programs can be executed on any platform with consistent results.

How Java Implements Unicode

  1. Using the char Data Type:

    • The char type in Java is a 16-bit Unicode character.
    • Example:

      public class UnicodeExample { public static void main(String[] args) { char letter = 'A'; // Unicode: U+0041 char hindiChar = 'अ'; // Unicode: U+0905 System.out.println("Letter: " + letter); System.out.println("Hindi Character: " + hindiChar); } }
  2. Using Unicode Escapes:

    • Unicode characters can also be represented using escape sequences in the form uXXXX, where XXXX is the hexadecimal code point.
    • Example:

      public class UnicodeEscapeExample { public static void main(String[] args) { char letter = 'u0041'; // Unicode for 'A' char smiley = 'u263A'; // Unicode for ☺ System.out.println("Letter: " + letter); System.out.println("Smiley: " + smiley); } }

Unicode Encoding Schemes

Java primarily uses the UTF-16 encoding scheme, which:

  1. Encodes most common characters (Basic Multilingual Plane) in 16 bits.
  2. Encodes supplementary characters using 4 bytes (2 code units).

Other Unicode encoding schemes include:

  • UTF-8: Variable-length encoding (1–4 bytes) and backward compatible with ASCII.
  • UTF-32: Fixed-length encoding (4 bytes for all characters).

Advantages of Unicode in Java

  1. Global Language Support:

    • Enables applications to support multiple languages in a single program.
    • Example: A Java application can process text in English, Hindi, Chinese, and Arabic simultaneously.
  2. Platform Independence:

    • Java’s Unicode support ensures consistent character representation across different platforms.
  3. Ease of Use:

    • Built-in support for Unicode in char and String simplifies handling characters.

Examples of Unicode Usage

Example 1: Printing Unicode Characters


public class UnicodePrintExample { public static void main(String[] args) { System.out.println("English: Hello"); System.out.println("Hindi: u0928u092Eu0938u094Du0924u0947"); System.out.println("Chinese: u4F60u597D"); System.out.println("Smiley: u263A"); } }

Output:


English: Hello Hindi: नमस्ते Chinese: 你好 Smiley: ☺

Example 2: Supplementary Characters


public class SupplementaryExample { public static void main(String[] args) { String emoji = "uD83DuDE00"; // Unicode for 😀 System.out.println("Emoji: " + emoji); } }

Key Points

  1. UTF-16 Encoding: Java’s char and String are based on UTF-16.
  2. Unicode Escapes: Use uXXXX format for specifying characters in code.
  3. Global Compatibility: Unicode allows Java applications to handle text in any language.
  4. Memory Efficiency: UTF-16 uses variable-length encoding for supplementary characters.

Leave a Reply