First Release of Japanese Dependency Vectors
At the end of last semester I finished the first version of Japanese Dependency Vectors (jpdv). I had to give up on using Clojure at the last minute because it was taking me too long to make progress and I needed to have some sort of a working system to turn in for my NLP final project.
To accomplish this I rewrote jpdv in Java. It took me about 18 hours of solid coding, minus time for food of course.
The software can now generate both context-based and dependency-based vector spaces for Japanese text that has been pre-parsed with CaboCha. It can also generate a similarity matrix for a given vector space using the cosine similarity measurement. I still need to add a path selection function to throw out paths that are too long and a basis element selection function that determines which N basis elements to keep out of all those discovered, but I will add those to the next release. I'm thinking of writing the path selection and basis element selection functions as Groovy scripts so that they can be supplied at run time. This would allow for better customization of the system at run time for a given task.
More information can be found on the GitHub page.
Here is an example similarity matrix generated by the current version of jpdv:
WORD | コンピュータ | 兄弟 | 緑 | 赤い | 電話 | 青い | 黒い |
---|---|---|---|---|---|---|---|
コンピュータ | 1.00000 | 0.06506 | 0.07563 | 0.00000 | 0.07760 | 0.00000 | 0.00000 |
兄弟 | 0.06506 | 1.00000 | 0.19929 | 0.00000 | 0.14947 | 0.00000 | 0.00000 |
緑 | 0.07563 | 0.19929 | 0.99999 | 0.00000 | 0.19833 | 0.00000 | 0.00000 |
赤い | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.01352 |
電話 | 0.07760 | 0.14947 | 0.19833 | 0.00000 | 1.00000 | 0.00000 | 0.00000 |
青い | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 |
黒い | 0.00000 | 0.00000 | 0.00000 | 0.01352 | 0.00000 | 0.00000 | 1.00000 |