How to Read Binary File in Java and How Convert Into Ascii
Sep 29, 2019 . 9 min read
Fun with Unicode in Coffee
Normally we don't pay much attention to graphic symbol encoding in Java. Nonetheless, when we crisscross byte and char streams, things tin can go confusing unless we know the charset basics. Many tutorials and posts well-nigh graphic symbol encoding are heavy in theory with footling existent examples. In this post, we try to demystify Unicode with easy to grasp examples.
Encode and Decode
Before diving into Unicode, first allow'southward understand terms - encode and decode. Suppose we capture a video in mpeg format, the encoder in the camera encodes the pixels into bytes and when played dorsum, the decoder coverts back the bytes to pixels. Like process plays out when we create a text file. For example, when alphabetic character H is typed in a text editor, the Os encodes the keystroke every bit byte 0x48 and pass it to editor. The editor holds the bytes in its buffer and pass it on to windowing organisation which decodes and displays the byte 0x48 equally H. When file is saved, 0x48 gets into the file.
In brusk, encoder converts items such every bit pixels, audio stream or characters as binary bytes and decoder reconverts the bytes dorsum original class.
Encode and Decode Coffee String
Let's go ahead and encode some strings in Java.
public class StringEncoder { public static void main ( String [] args ) { String str = "Howdy" ; byte [] bytes = str . getBytes ( StandardCharsets . US_ASCII ); printBytes ( bytes ); } public static void printBytes ( byte [] a ) { StringBuilder sb = new StringBuilder (); for ( byte b : a ) { sb . append ( Cord . format ( "x%02x " , b )); } Organisation . out . println ( sb ); } }
The String.getBytes()
method encodes the string every bit bytes (binary) using US_ASCII charset and printBytes()
method outputs bytes in hex format. The hex output 0x48 0x65 0x6c 0x6c 0x6f
is binary of course string Hello
in ASCII.
Next, allow'southward meet how to decode the bytes back as string.
public form StringDecoder { public static void principal ( String [] args ) { byte [] bytes = { 0x48 , 0x65 , 0x6c , 0x6c , 0x6f }; Cord str = new String ( bytes , StandardCharsets . US_ASCII ); System . out . println ( str ); } }
Hither we decode byte array filled with 0x48 0x65 0x6c 0x6c 0x6f
as a new string. The String course decodes the bytes with US_ASCII charset which is displayed as Hello
.
We can omit StandardCharsets.US_ASCII
argument in new Cord(bytes)
and str.getBytes()
. The results volition exist same as default charset of Java is UTF-8 which apply same hex value for English alphabets as US_ASCII.
The ASCII encoding scheme is quite simple where each character is mapped to a single byte, for example, H is encoded as 0x48, eastward as 0x65 and and then on. It can handle English language character set, numbers and command characters such every bit backspace, carriage return etc., but not many western or asian linguistic communication characters etc.
Say Hello in Mandarin
Hello in Mandarin is nĭ hăo. Information technology is written using 2 characters 你 (nĭ) and 好 (hăo). Allow's encode and decode unmarried graphic symbol 你 (nĭ).
// encode String str = "你" ; byte [] bytes = str . getBytes ( StandardCharsets . UTF_8 ); printBytes ( bytes ); // xe4 xbd xa0 (3 bytes) // decode String decodedStr = new String ( bytes , StandardCharsets . UTF_8 ); System . out . println ( decodedStr ); // 你
Encoding the graphic symbol 你 with UTF-8 grapheme set up returns an array of iii bytes xe4 xbd xa0
, which on decode, returns 你.
Let's exercise the aforementioned with another standard character set UTF_16.
// encode String str = "你" ; byte [] bytes = str . getBytes ( StandardCharsets . UTF_16 ); printBytes ( bytes ); // xfe xff x4f x60 (4 bytes) // decode String decodedStr = new String ( bytes , StandardCharsets . UTF_16 ); Arrangement . out . println ( decodedStr ); // 你
Grapheme set UTF_16 encodes 你 into 4 bytes - xfe xff x4f x60
while UTF_8 manages it with 3 bytes.
Just for heck of it, try to encode 你 with US_ASCII and information technology returns single byte x3f
which decodes to ? grapheme. This is because ASCII is unmarried byte encoding scheme which tin can't handle characters other than English alphabets.
Introducing Unicode
Unicode is coded graphic symbol set (or simply character prepare) capable of representing well-nigh of the writing systems. The recent version of Unicode contains around 138,000 characters covering 150 mod and historic languages and scripts, every bit well as symbol sets and emoji. The below tabular array shows how some characters from different languages are represent in Unicode.
Character | Code Point | UTF_8 | UTF_16 | Language |
---|---|---|---|---|
a | U+0061 | 61 | 00 61 | English |
Z | U+005A | 5a | 00 5a | English |
â | U+00E2 | c3 a2 | 00 e2 | Latin |
Δ | U+0394 | ce 94 | 03 94 | Latin |
ع | U+0639 | d8 b9 | 06 39 | Arabic |
你 | U+4F60 | e4 bd a0 | 4f 60 | Chinese |
好 | U+597D | e5 a5 bd | 59 7d | Chinese |
ಡ | U+0CA1 | e0 b2 a1 | 0c a1 | Kannada |
ತ | U+0CA4 | e0 b2 a4 | 0c a4 | Kannada |
Each character or symbol is represented past an unique Code point. Unicode has one,112,064 code points out of which around 138,000 are presently defined. Unicode code point is represented as U+xxxx where U signifies it equally Unicode. The String.codePointAt(int index)
method returns code point for grapheme.
String str = "你" ; int codePoint = str . codePointAt ( 0 ); System . out . format ( "U+%04X" , codePoint ); // outputs the code indicate - U+4F60
A charset can have one or more encoding schemes and Unicode has multiple encoding schemes such as UTF_8, UTF_16, UTF_16LE and UTF_16BE that maps code point to bytes.
UTF-eight
UTF-8 (viii-chip Unicode Transformation Format) is a variable width character encoding capable of encoding all valid Unicode lawmaking points using one to 4 8-scrap bytes. In the higher up table, we tin can see that the length of encoded bytes varies from ane to 3 bytes for UTF-eight. Majority of web pages use UTF-viii.
The get-go 128 characters of Unicode, which represent i-to-ane with ASCII, are encoded using a single byte with the aforementioned binary value every bit ASCII. The valid ASCII text is valid UTF-8-encoded Unicode as well.
UTF-16
UTF-sixteen (16-bit Unicode Transformation Format) is another encoding scheme capable of handling all characters of Unicode character set. The encoding is variable-length, every bit code points are encoded with 1 or two xvi-bit lawmaking units (i.e minimum ii bytes and maximum 4 bytes).
Many systems such as Windows, Java and JavaScript, internally, uses UTF-xvi. It is likewise oftentimes used for plain text and for give-and-take-processing data files on Windows, only rarely used for files on Unix/Linux or macOS.
Coffee internally uses UTF-xvi. From Coffee 9 onwards, to reduce the memory taken by Cord objects, information technology uses either ISO-8859-1/Latin-1 (i byte per character) or UTF-16 (two bytes per character) based upon the contents of the string. JEPS 254.
However don't confuse the internal charset with Java default charset which is UTF-viii. For example, the Strings alive in heap memory as UTF-16, nonetheless the method Cord.getBytes()
returns bytes encoded as UTF-eight, the default charset.
You tin can use CharInfo.java to display grapheme details of a string.
To summarize:
- Grapheme set is collection of characters. Numbers, alphabets and Chinese characters are examples of character sets.
- Coded character ready is a grapheme ready in which each character has an assigned int value. Unicode, Usa-ASCII and ISO-8859-1 are examples of coded graphic symbol set.
- Code Point is an integer assigned to a character in a coded character fix.
- Character encoding maps between code points of a coded character set and sequences of bytes. One coded graphic symbol prepare may have one or more character encodings . For example, ASCII has 1 encoding scheme while Unicode has multiple encoding schemes - UTF-8, UTF-16, UTF_16BE, UTF_16LE etc.
Java IO
Use char stream IO classes Reader
and Writer
while dealing with text and text files. Every bit already explained, the default charset of Java platform is UTF-8 and text written using Writer grade is encoded in UTF-eight and Reader form reads the text in UTF-viii.
Using java.io parcel, we can write and read a text file in default charset equally below.
String str = "a Z â Δ 你 好 ಡ ತ ع" ; File file = new File ( "x-utf8.txt" ); // write file in default charset (UTF-eight) try ( BufferedWriter out = new BufferedWriter ( new FileWriter ( file ))) { out . write ( str ); } // read file in default charset (UTF-viii) try ( BufferedReader in = new BufferedReader ( new FileReader ( file ))) { String line ; while (( line = in . readLine ()) != nada ) { Arrangement . out . println ( line ); } }
The to a higher place example, uses char stream classes - Writer and Reader - directly that uses default grapheme set (UTF-viii).
To encode/decode in non-default charset utilise byte oriented classes and use a bridge class to catechumen it char oriented class. For instance, to read file as raw bytes use FileInputStream
and wrap it with InputStreamReader
, a bridge that can encode the bytes to chars in specified charset. Similarly for output, utilise OutputStreamWriter
(bridge) and FileOutputWriter
(byte output)
// read in UTF_16LE InputStream byteInStream = new FileInputStream ( file ); Reader encodedCharStream = new InputStreamReader ( byteInStream , StandardCharsets . UTF_16LE ); // write in UTF_16LE OutputStream byteOutStream = new FileOutputStream ( file ); Author decodedCharStream = new OutputStreamWriter ( byteOutStream , StandardCharsets . UTF_16LE );
Following example, writes a file in UTF_16BE charset and reads it back.
String str = "a Z â Δ 你 好 ಡ ತ ع" ; Charset charset = StandardCharsets . UTF_16BE ; File file = new File ( "x-utf16be.txt" ); // write file in non default charset try ( BufferedWriter out = new BufferedWriter ( new OutputStreamWriter ( new FileOutputStream ( file ), charset ))) { out . write ( str ); } // read file in not default charset attempt ( BufferedReader in = new BufferedReader ( new InputStreamReader ( new FileInputStream ( file ), charset ))) { Cord line ; while (( line = in . readLine ()) != null ) { Arrangement . out . println ( line ); } }
Transcoding
Transcoding is the direct digital-to-digital conversion from an encoding to another, such as UTF-8 to UTF-16. We regularly see transcoding in video, audio and epitome files merely rarely with text files.
Imagine, nosotros receive a stream of bytes over the network encoded in CP-1252 (Windows-1252) or ISO 8859-1 and want to relieve it to text file in UTF 8.
There are couple of options to transcode from one charset to some other. The easiest way to transcode it to use String class.
String str = new String ( bytes , StandardCharsets . ISO_8859_1 ); byte [] toBytes = str . getBytes ( StandardCharsets . UTF_8 );
While this quite fast, it suffers when we deal with large fix of byte as heap memory gets allocated to multiple large strings. Meliorate pick is to use java.io classes as shown beneath:

Encounter Transcode.java for transcoding example and Char Server for a crude take on encoding between server and socket.
Play with Unicode in Linux terminal
We can piece of work with text encoding in Linux terminal with some simple commands. Annotation that Linux concluding can brandish ASCII and UTF-viii files merely non UTF-xvi.
# create a text file from encoded bytes $ echo -n -e '\xce\x94' > delta-8.txt $ echo -due north -east '\xfe\xff\x03\x94' > delta-sixteen.txt $ echo -north -east '\xe4\xbd\xa0\x20\xe5\xa5\xbd' > nihou-8.txt $ repeat -northward -e '\xfe\xff\x4f\x60\xfe\xff\x20\x59\x7d' > nihou-xvi.txt # debug a text file equally hex $ hd nihou-16.txt 00000000 fe ff 4f 60 iron ff 20 59 7d |..O`.. Y} | 00000009 # know file encoding scheme $ file -i *.txt delta-16.txt: text/plain; charset =utf-16be delta-8.txt: text/plain; charset =utf-eight nihou-xvi.txt: text/plain; charset =utf-16be nihou-viii.txt: text/plain; charset =utf-8 # transcode from UTF 8 to UTF 16LE $ iconv -f UTF8 -t UTF16LE < nihou-viii.txt > nihou-16le.txt # list encoding schemes $ iconv -l # read a file in default charset $ gedit nihou-8.txt # non-default $ gedit --encoding UTF-16LE nihou-16le.txt
Further Reading
Some skilful posts nearly Unicode usage in Java.
The Absolute Minimum Developers should know well-nigh Unicode and Character Sets
Coffee: a rough guide to character encoding
Don't grade strings containing partial characters from variable-width encodings
Source: https://www.codetab.org/post/java-unicode-basics/
0 Response to "How to Read Binary File in Java and How Convert Into Ascii"
Post a Comment