How to Read Binary File in Java and How Convert Into Ascii

Sep 29, 2019 . 9 min read

Fun with Unicode in Coffee

Normally we don't pay much attention to graphic symbol encoding in Java. Nonetheless, when we crisscross byte and char streams, things tin can go confusing unless we know the charset basics. Many tutorials and posts well-nigh graphic symbol encoding are heavy in theory with footling existent examples. In this post, we try to demystify Unicode with easy to grasp examples.

Encode and Decode

Before diving into Unicode, first allow'southward understand terms - encode and decode. Suppose we capture a video in mpeg format, the encoder in the camera encodes the pixels into bytes and when played dorsum, the decoder coverts back the bytes to pixels. Like process plays out when we create a text file. For example, when alphabetic character H is typed in a text editor, the Os encodes the keystroke every bit byte 0x48 and pass it to editor. The editor holds the bytes in its buffer and pass it on to windowing organisation which decodes and displays the byte 0x48 equally H. When file is saved, 0x48 gets into the file.

In brusk, encoder converts items such every bit pixels, audio stream or characters as binary bytes and decoder reconverts the bytes dorsum original class.

Encode and Decode Coffee String

Let's go ahead and encode some strings in Java.

                          public              class              StringEncoder              {              public              static              void              main              (              String              []              args              )              {              String              str              =              "Howdy"              ;              byte              []              bytes              =              str              .              getBytes              (              StandardCharsets              .              US_ASCII              );              printBytes              (              bytes              );              }              public              static              void              printBytes              (              byte              []              a              )              {              StringBuilder              sb              =              new              StringBuilder              ();              for              (              byte              b              :              a              )              {              sb              .              append              (              Cord              .              format              (              "x%02x "              ,              b              ));              }              Organisation              .              out              .              println              (              sb              );              }              }                      

The String.getBytes() method encodes the string every bit bytes (binary) using US_ASCII charset and printBytes() method outputs bytes in hex format. The hex output 0x48 0x65 0x6c 0x6c 0x6f is binary of course string Hello in ASCII.

Next, allow'southward meet how to decode the bytes back as string.

                          public              form              StringDecoder              {              public              static              void              principal              (              String              []              args              )              {              byte              []              bytes              =              {              0x48              ,              0x65              ,              0x6c              ,              0x6c              ,              0x6f              };              Cord              str              =              new              String              (              bytes              ,              StandardCharsets              .              US_ASCII              );              System              .              out              .              println              (              str              );              }              }                      

Hither we decode byte array filled with 0x48 0x65 0x6c 0x6c 0x6f as a new string. The String course decodes the bytes with US_ASCII charset which is displayed as Hello.

We can omit StandardCharsets.US_ASCII argument in new Cord(bytes) and str.getBytes(). The results volition exist same as default charset of Java is UTF-8 which apply same hex value for English alphabets as US_ASCII.

The ASCII encoding scheme is quite simple where each character is mapped to a single byte, for example, H is encoded as 0x48, eastward as 0x65 and and then on. It can handle English language character set, numbers and command characters such every bit backspace, carriage return etc., but not many western or asian linguistic communication characters etc.

Say Hello in Mandarin

Hello in Mandarin is nĭ hăo. Information technology is written using 2 characters 你 (nĭ) and 好 (hăo). Allow's encode and decode unmarried graphic symbol 你 (nĭ).

                          // encode                                          String              str              =              "你"              ;              byte              []              bytes              =              str              .              getBytes              (              StandardCharsets              .              UTF_8              );              printBytes              (              bytes              );              // xe4 xbd xa0  (3 bytes)                                          // decode                                          String              decodedStr              =              new              String              (              bytes              ,              StandardCharsets              .              UTF_8              );              System              .              out              .              println              (              decodedStr              );              // 你                                                  

Encoding the graphic symbol 你 with UTF-8 grapheme set up returns an array of iii bytes xe4 xbd xa0, which on decode, returns 你.

Let's exercise the aforementioned with another standard character set UTF_16.

                          // encode                                          String              str              =              "你"              ;              byte              []              bytes              =              str              .              getBytes              (              StandardCharsets              .              UTF_16              );              printBytes              (              bytes              );              // xfe xff x4f x60  (4 bytes)                                          // decode                                          String              decodedStr              =              new              String              (              bytes              ,              StandardCharsets              .              UTF_16              );              Arrangement              .              out              .              println              (              decodedStr              );              // 你                                                  

Grapheme set UTF_16 encodes 你 into 4 bytes - xfe xff x4f x60 while UTF_8 manages it with 3 bytes.

Just for heck of it, try to encode 你 with US_ASCII and information technology returns single byte x3f which decodes to ? grapheme. This is because ASCII is unmarried byte encoding scheme which tin can't handle characters other than English alphabets.

Introducing Unicode

Unicode is coded graphic symbol set (or simply character prepare) capable of representing well-nigh of the writing systems. The recent version of Unicode contains around 138,000 characters covering 150 mod and historic languages and scripts, every bit well as symbol sets and emoji. The below tabular array shows how some characters from different languages are represent in Unicode.

Character Code Point UTF_8 UTF_16 Language
a U+0061 61 00 61 English
Z U+005A 5a 00 5a English
â U+00E2 c3 a2 00 e2 Latin
Δ U+0394 ce 94 03 94 Latin
ع U+0639 d8 b9 06 39 Arabic
U+4F60 e4 bd a0 4f 60 Chinese
U+597D e5 a5 bd 59 7d Chinese
U+0CA1 e0 b2 a1 0c a1 Kannada
U+0CA4 e0 b2 a4 0c a4 Kannada

Each character or symbol is represented past an unique Code point. Unicode has one,112,064 code points out of which around 138,000 are presently defined. Unicode code point is represented as U+xxxx where U signifies it equally Unicode. The String.codePointAt(int index) method returns code point for grapheme.

                          String              str              =              "你"              ;              int              codePoint              =              str              .              codePointAt              (              0              );              System              .              out              .              format              (              "U+%04X"              ,              codePoint              );              // outputs the code indicate - U+4F60                                                  

A charset can have one or more encoding schemes and Unicode has multiple encoding schemes such as UTF_8, UTF_16, UTF_16LE and UTF_16BE that maps code point to bytes.

UTF-eight

UTF-8 (viii-chip Unicode Transformation Format) is a variable width character encoding capable of encoding all valid Unicode lawmaking points using one to 4 8-scrap bytes. In the higher up table, we tin can see that the length of encoded bytes varies from ane to 3 bytes for UTF-eight. Majority of web pages use UTF-viii.

The get-go 128 characters of Unicode, which represent i-to-ane with ASCII, are encoded using a single byte with the aforementioned binary value every bit ASCII. The valid ASCII text is valid UTF-8-encoded Unicode as well.

UTF-16

UTF-sixteen (16-bit Unicode Transformation Format) is another encoding scheme capable of handling all characters of Unicode character set. The encoding is variable-length, every bit code points are encoded with 1 or two xvi-bit lawmaking units (i.e minimum ii bytes and maximum 4 bytes).

Many systems such as Windows, Java and JavaScript, internally, uses UTF-xvi. It is likewise oftentimes used for plain text and for give-and-take-processing data files on Windows, only rarely used for files on Unix/Linux or macOS.

Coffee internally uses UTF-xvi. From Coffee 9 onwards, to reduce the memory taken by Cord objects, information technology uses either ISO-8859-1/Latin-1 (i byte per character) or UTF-16 (two bytes per character) based upon the contents of the string. JEPS 254.

However don't confuse the internal charset with Java default charset which is UTF-viii. For example, the Strings alive in heap memory as UTF-16, nonetheless the method Cord.getBytes() returns bytes encoded as UTF-eight, the default charset.

You tin can use CharInfo.java to display grapheme details of a string.

To summarize:

  • Grapheme set is collection of characters. Numbers, alphabets and Chinese characters are examples of character sets.
  • Coded character ready is a grapheme ready in which each character has an assigned int value. Unicode, Usa-ASCII and ISO-8859-1 are examples of coded graphic symbol set.
  • Code Point is an integer assigned to a character in a coded character fix.
  • Character encoding maps between code points of a coded character set and sequences of bytes. One coded graphic symbol prepare may have one or more character encodings . For example, ASCII has 1 encoding scheme while Unicode has multiple encoding schemes - UTF-8, UTF-16, UTF_16BE, UTF_16LE etc.

Java IO

Use char stream IO classes Reader and Writer while dealing with text and text files. Every bit already explained, the default charset of Java platform is UTF-8 and text written using Writer grade is encoded in UTF-eight and Reader form reads the text in UTF-viii.

Using java.io parcel, we can write and read a text file in default charset equally below.

                          String              str              =              "a Z â Δ 你 好 ಡ ತ ع"              ;              File              file              =              new              File              (              "x-utf8.txt"              );              // write file in default charset (UTF-eight)                                          try              (              BufferedWriter              out              =              new              BufferedWriter              (              new              FileWriter              (              file              )))              {              out              .              write              (              str              );              }              // read file in default charset (UTF-viii)                                          try              (              BufferedReader              in              =              new              BufferedReader              (              new              FileReader              (              file              )))              {              String              line              ;              while              ((              line              =              in              .              readLine              ())              !=              nada              )              {              Arrangement              .              out              .              println              (              line              );              }              }                      

The to a higher place example, uses char stream classes - Writer and Reader - directly that uses default grapheme set (UTF-viii).

To encode/decode in non-default charset utilise byte oriented classes and use a bridge class to catechumen it char oriented class. For instance, to read file as raw bytes use FileInputStream and wrap it with InputStreamReader, a bridge that can encode the bytes to chars in specified charset. Similarly for output, utilise OutputStreamWriter (bridge) and FileOutputWriter (byte output)

                          // read in UTF_16LE                                          InputStream              byteInStream              =              new              FileInputStream              (              file              );              Reader              encodedCharStream              =              new              InputStreamReader              (              byteInStream              ,              StandardCharsets              .              UTF_16LE              );              // write in UTF_16LE                                          OutputStream              byteOutStream              =              new              FileOutputStream              (              file              );              Author              decodedCharStream              =              new              OutputStreamWriter              (              byteOutStream              ,              StandardCharsets              .              UTF_16LE              );                      

Following example, writes a file in UTF_16BE charset and reads it back.

                          String              str              =              "a Z â Δ 你 好 ಡ ತ ع"              ;              Charset              charset              =              StandardCharsets              .              UTF_16BE              ;              File              file              =              new              File              (              "x-utf16be.txt"              );              // write file in non default charset                                          try              (              BufferedWriter              out              =              new              BufferedWriter              (              new              OutputStreamWriter              (              new              FileOutputStream              (              file              ),              charset              )))              {              out              .              write              (              str              );              }              // read file in not default charset                                          attempt              (              BufferedReader              in              =              new              BufferedReader              (              new              InputStreamReader              (              new              FileInputStream              (              file              ),              charset              )))              {              Cord              line              ;              while              ((              line              =              in              .              readLine              ())              !=              null              )              {              Arrangement              .              out              .              println              (              line              );              }              }                      

Transcoding

Transcoding is the direct digital-to-digital conversion from an encoding to another, such as UTF-8 to UTF-16. We regularly see transcoding in video, audio and epitome files merely rarely with text files.

Imagine, nosotros receive a stream of bytes over the network encoded in CP-1252 (Windows-1252) or ISO 8859-1 and want to relieve it to text file in UTF 8.

There are couple of options to transcode from one charset to some other. The easiest way to transcode it to use String class.

                          String              str              =              new              String              (              bytes              ,              StandardCharsets              .              ISO_8859_1              );              byte              []              toBytes              =              str              .              getBytes              (              StandardCharsets              .              UTF_8              );                      

While this quite fast, it suffers when we deal with large fix of byte as heap memory gets allocated to multiple large strings. Meliorate pick is to use java.io classes as shown beneath:

Java Unicode charset convert

Encounter Transcode.java for transcoding example and Char Server for a crude take on encoding between server and socket.

Play with Unicode in Linux terminal

We can piece of work with text encoding in Linux terminal with some simple commands. Annotation that Linux concluding can brandish ASCII and UTF-viii files merely non UTF-xvi.

                          # create a text file from encoded bytes              $              echo              -n -e              '\xce\x94'              > delta-8.txt     $              echo              -due north -east              '\xfe\xff\x03\x94'              > delta-sixteen.txt     $              echo              -north -east              '\xe4\xbd\xa0\x20\xe5\xa5\xbd'              > nihou-8.txt     $              repeat              -northward -e              '\xfe\xff\x4f\x60\xfe\xff\x20\x59\x7d'              > nihou-xvi.txt              # debug a text file equally hex              $ hd nihou-16.txt              00000000              fe ff 4f              60              iron ff              20              59              7d              |..O`.. Y}              |              00000009              # know file encoding scheme              $ file -i *.txt      delta-16.txt: text/plain;              charset              =utf-16be     delta-8.txt:  text/plain;              charset              =utf-eight     nihou-xvi.txt: text/plain;              charset              =utf-16be     nihou-viii.txt:  text/plain;              charset              =utf-8              # transcode from UTF 8 to UTF 16LE              $ iconv -f UTF8 -t UTF16LE < nihou-viii.txt > nihou-16le.txt              # list encoding schemes                            $ iconv -l              # read a file in default charset              $ gedit nihou-8.txt              # non-default              $ gedit --encoding UTF-16LE nihou-16le.txt                      

Further Reading

Some skilful posts nearly Unicode usage in Java.

The Absolute Minimum Developers should know well-nigh Unicode and Character Sets

Coffee: a rough guide to character encoding

Don't grade strings containing partial characters from variable-width encodings

daughtrypraces.blogspot.com

Source: https://www.codetab.org/post/java-unicode-basics/

Related Posts

0 Response to "How to Read Binary File in Java and How Convert Into Ascii"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel