Internationalization | Andrew C. Young

Huffman Coding, Unicode, and CJKV Data

Today I wrote a little utility in Java that compresses a file using Huffman coding. Normally Huffman coding works on 8-bit bytes. However, because of my experience dealing with Chinese, Japanese, Korean, and other non-English text I wondered how well the coding method would work on double byte character sets. Specifically, I was curious about compressing UTF-8 text. UTF-8 is a variable length encoding for Unicode data that stores characters using between one and four bytes per character.