Today I wrote a little utility in Java that compresses a file using Huffman coding. Normally Huffman coding works on 8-bit bytes. However, because of my experience dealing with Chinese, Japanese, Korean, and other non-English text I wondered how well the coding method would work on double byte character sets. Specifically, I was curious about compressing UTF-8 text.
UTF-8 is a variable length encoding for Unicode data that stores characters using between one and four bytes per character.
At the end of last semester I finished the first version of Japanese Dependency Vectors (jpdv). I had to give up on using Clojure at the last minute because it was taking me too long to make progress and I needed to have some sort of a working system to turn in for my NLP final project. To accomplish this I rewrote jpdv in Java. It took me about 18
I've been working on a new project I call “Japanese Dependency Vectors” or “ jpdv” for short. It's a program that generates dependency based semantic vector spaces for Japanese text. (There's already an excellent tool for doing this with English, which was written by Sebastian Pado.)
However, jpdv still has a way to go before it works as promised. So far the tool can parse CaboCha formatted XML and produce both a word co-occurrence based vector space and a slightly modified XML representation that better demonstrates the dependency relationships of the words in the text.
This might be proof that I'm crazy.
I'm working on a project for my NLP class that involves generating a semantic vector space for Japanese text, and I decided that this might be a good time to learn one of the LISP dialects. I've been looking at Clojure for a while now, but I hadn't taken the time to learn it before. I must say, I'm quite impressed so far. The fact that reading a Japanese XML document into a data structure “just works” without any tweaking is pretty nice.