The Unicode System is a universal character encoding standard that Java uses to represent characters. It ensures that Java programs can handle characters from multiple languages and scripts, making it platform-independent and suitable for global applications.
What is Unicode?
Unicode provides a unique number (called a code point) for every character in every language, including symbols, digits, and special characters. It supports over 143,000 characters across more than 150 modern and historic scripts.
- Code Point Range: Unicode code points range from
U+0000
toU+10FFFF
. - Examples:
'A'
:U+0041
'अ'
:U+0905
'你'
:U+4F60
Why Unicode in Java?
Before Unicode, different systems used different encoding standards, such as ASCII or ISO-8859. These systems had limitations, particularly when representing non-English characters. Unicode solves these problems by providing a consistent encoding system.
Characteristics of Unicode in Java
- Default Character Encoding:
- Java uses Unicode to represent
char
data type andString
objects. - Each
char
is 16 bits (2 bytes) in Java, based on the UTF-16 encoding scheme.
- Java uses Unicode to represent
- Wide Character Support:
- Java can handle characters from multiple languages, symbols, and emojis.
- Compatibility:
- Unicode ensures that Java programs can be executed on any platform with consistent results.
How Java Implements Unicode
- Using the
char
Data Type:- The
char
type in Java is a 16-bit Unicode character. - Example:
public class UnicodeExample
{
public static void main(String[] args)
{
char letter = 'A'; // Unicode: U+0041
char hindiChar = 'अ'; // Unicode: U+0905
System.out.println("Letter: " + letter);
System.out.println("Hindi Character: " + hindiChar);
}
}
- The
- Using Unicode Escapes:
- Unicode characters can also be represented using escape sequences in the form
uXXXX
, whereXXXX
is the hexadecimal code point. - Example:
public class UnicodeEscapeExample {
public static void main(String[] args)
{
char letter = 'u0041'; // Unicode for 'A'
char smiley = 'u263A'; // Unicode for ☺
System.out.println("Letter: " + letter);
System.out.println("Smiley: " + smiley);
}
}
- Unicode characters can also be represented using escape sequences in the form
Unicode Encoding Schemes
Java primarily uses the UTF-16 encoding scheme, which:
- Encodes most common characters (Basic Multilingual Plane) in 16 bits.
- Encodes supplementary characters using 4 bytes (2 code units).
Other Unicode encoding schemes include:
- UTF-8: Variable-length encoding (1–4 bytes) and backward compatible with ASCII.
- UTF-32: Fixed-length encoding (4 bytes for all characters).
Advantages of Unicode in Java
- Global Language Support:
- Enables applications to support multiple languages in a single program.
- Example: A Java application can process text in English, Hindi, Chinese, and Arabic simultaneously.
- Platform Independence:
- Java’s Unicode support ensures consistent character representation across different platforms.
- Ease of Use:
- Built-in support for Unicode in
char
andString
simplifies handling characters.
- Built-in support for Unicode in
Examples of Unicode Usage
Example 1: Printing Unicode Characters
public class UnicodePrintExample
{
public static void main(String[] args)
{
System.out.println("English: Hello");
System.out.println("Hindi: u0928u092Eu0938u094Du0924u0947");
System.out.println("Chinese: u4F60u597D");
System.out.println("Smiley: u263A");
}
}
Output:
English: Hello
Hindi: नमस्ते
Chinese: 你好
Smiley: ☺
Example 2: Supplementary Characters
public class SupplementaryExample
{
public static void main(String[] args)
{
String emoji = "uD83DuDE00"; // Unicode for 😀
System.out.println("Emoji: " + emoji);
}
}
Key Points
- UTF-16 Encoding: Java’s
char
andString
are based on UTF-16. - Unicode Escapes: Use
uXXXX
format for specifying characters in code. - Global Compatibility: Unicode allows Java applications to handle text in any language.
- Memory Efficiency: UTF-16 uses variable-length encoding for supplementary characters.