CaboCha is a dependency parser for Japanese used by (among other things) the Japanese FrameNet project. Getting it installed and working on my mac turned out to be more work than I had anticipated, so I thought I would post instructions for anyone who might also want to install CaboCha.
- Install XCode and the Developer tools off of your Mac OSX install CD if you haven’t already.
- Install MacPorts if you haven’t already.
- Download this tar file: cabocha-macports.tar.gz
- Extract the tar file.
- Run install.sh from inside the directory that the tar file created.
The install script first installs the following two programs via MacPorts:
- ChaSen (a Japanese part of speech and morphological analyzer)
- MeCab (another Japanese part of speech and morphological analyzer that uses )
Then the install script will download and install the following programs, patching them as necessary:
- TinySVM (A Support Vector Machines library)
- YamCha (A text chunker)
- CaboCha (The Japanese dependency parser we’re interested in)
All off the applications will be installed into the MacPorts directory structure in /opt/local and they should therefore work out of the box without a problem. When you run CaboCha you can choose whether to use ChaSen or MeCab for part of speech and morphological analysis. The default is normally ChaSen, but it seems that ChaSen is no longer being actively maintained. Also, MeCab uses Conditional Random Fields (CRF) which generally perform better than the Hidden Markov Models (HMM) used by ChaSen. Because of this, my install script configures CaboCha to use MeCab as its default part of speech and morphological analyzer. If you want to use ChaSen instead, you can pass CaboCha the “-a chasen” flag or change the install.sh script to use ChaSen instead of MeCab.
Note: This script works for me in Snow Leopard, but I give no guarantees for how well it will work for you. Also, be aware that all of these utilities work with EUC-JP encoded text, not UTF-8 encoded text, so you will need to configure your terminal to use EUC-JP. If you’re using the standard Mac OSX Terminal app, the easiest way to do this is:
- Open the Terminal app.
- Click on the ‘Terminal’ menu and then click on ‘Preferences’.
- Click on ‘Settings’.
- Highlight your favorite terminal profile (I use ‘Pro’ myself).
- Click on the button with the gear on it and select ‘Duplicate Settings’.
- Name the new profile something useful, like ‘EUC-JP’.
- Click on the new profile and then click on the ‘Advanced’ tab.
- Change the ‘Character encoding’ drop down to ‘Japanese (EUC)’.
- Close the preferences window and quit the terminal app to make sure it picks up the changes.
To use your new profile, relaunch the terminal app, click on the ‘Shell’ menu, select ‘New Window’, and then select ‘EUC-JP’ (or whatever you called your new profile). This will open a new terminal window that is using EUC-JP for its character encoding. You might also want to mess with the fonts for your new profile to make the Japanese characters easier to read.
The following websites were especially helpful in showing me what files to modify in order to get TinySVM installed properly: