<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Andrew Young &#187; Japanese</title>
	<atom:link href="http://vaelen.org/category/japan/japanese/feed/" rel="self" type="application/rss+xml" />
	<link>http://vaelen.org</link>
	<description>Computational Linguist, Software Engineer</description>
	<lastBuildDate>Sat, 05 Jun 2010 06:35:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Huffman Coding, Unicode, and CJKV Data</title>
		<link>http://vaelen.org/2010/04/16/huffman-coding-and-unicode/</link>
		<comments>http://vaelen.org/2010/04/16/huffman-coding-and-unicode/#comments</comments>
		<pubDate>Fri, 16 Apr 2010 05:43:53 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=193</guid>
		<description><![CDATA[Today I wrote a little utility in Java that compresses a file using Huffman coding. Normally Huffman coding works on 8-bit bytes. However, because of my experience dealing with Chinese, Japanese, Korean, and other non-English text I wondered how well the coding method would work on double byte character sets. Specifically, I was curious about [...]]]></description>
			<content:encoded><![CDATA[<p>Today I wrote a little utility in Java that compresses a file using <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman coding</a>.  Normally Huffman coding works on 8-bit bytes.  However, because of my experience dealing with Chinese, Japanese, Korean, and other non-English text I wondered how well the coding method would work on double byte character sets.  Specifically, I was curious about compressing <a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8</a> text.<br />
<span id="more-193"></span><br />
UTF-8 is a variable length encoding for Unicode data that stores characters using between one and four bytes per character.  This works great when the text is written mostly (or completely) with the latin alphabet, but other alphabets require two or more bytes per character, which causes the file to become bloated very quickly.  (This is one of the reasons that Asian countries have been slow to adopt Unicode.)  Also, UTF-8 is designed to allow random access to characters in a file despite each character taking up a different number of bytes.  This is a traditional problem with double byte encodings that use shift in/out characters.  With these older encodings, in order to know if a pair of bytes is a single character, the program must also know what the last shift character was.  UTF-8 solves this problem by using the high bit in the byte to mark a byte as part of a multibyte sequence.  The tradeoff for this is that each byte can only store 7 bits worth of data, which is why UTF-8 uses four bytes in the worst case instead of only two.  (Unicode can also have surrogate character pairs regardless of the underlying encoding that is used, but I&#8217;m not going to bother with that in this post.)</p>
<p>Huffman coding is also a variable length encoding.  It uses the distribution of bytes in a given document to assign shorter bit sequences to more common bytes and longer bit sequences to less common bytes.  This means that the most common bytes only take up a few bits (as few as 1 or 2), whereas very uncommon bytes can take up more than 8 bits (and thus take more space to store than they did in the original document.)  However, in the worst case the short bit sequences and long bit sequences will even out and the final product will be a file of the same size as the original.  In practice the final product is usually a file that is much smaller than the original.</p>
<p>Because the Huffman coding algorithm doesn&#8217;t know about multibyte sequences in UTF-8 documents, it doesn&#8217;t take into account the fact that certain bytes (those with the high bit set) only occur together with other bytes and therefore their distribution is not independent.  This means that the <a href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a> of the file is actually smaller than the algorithm thinks it is and therefore the algorithm needs fewer bits to represent the file than it realizes.  In order to make use of multibyte sequences &#8211; not only in UTF-8, but in other multibyte encodings as well &#8211; I modified the algorithm to think in terms of Unicode characters instead of bytes.  By converting the file from its native encoding into a stream of characters before performing the Huffman coding calculations, the entropy of the file is reduced and the file can be stored more compactly.</p>
<p>Since a file written in an Asian language will consist mostly of Unicode characters that translate to two or more bytes in UTF-8, it stands to reason that the standard Huffman coding algorithm won&#8217;t produce as compact a file as a Huffman coding algorithm that works in terms of characters instead of in terms of bytes.  To test this I compressed a Chinese text file with both the standard and improved algorithms and compared the results.  Although my encoding method didn&#8217;t save as much space as other common compression utilities such as GZip, I still saw an improvement when encoding characters instead of raw bytes.</p>
<p>Compressing the raw UTF-8 bytes of a 69KB Chinese file created a 48KB file.  Compressing the underlying characters created a 32KB file, which is a significant improvement.  Likewise, starting with a 46KB <a href="http://en.wikipedia.org/wiki/GBK">GBK</a> encoded version of the same file created a 34KB file when compressing the raw bytes and a 32KB file when compressing characters instead.  (As a matter of fact, compressing characters creates the same file in both cases because the input characters are converted to Unicode internally.)</p>
<p>Here is the raw data from the experiment:</p>
<table>
<tbody>
<tr>
<th>File Description</th>
<th>File Size</th>
</tr>
<tr>
<td>Original File &#8211; UTF-8 Encoded</td>
<td>69 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; Huffman Bytes (UTF-8 Source)</td>
<td>48 KB</td>
</tr>
<tr>
<td>Original File &#8211; <a href="http://en.wikipedia.org/wiki/GBK#Encoding">GBK</a> Encoded</td>
<td>46 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; Huffman Bytes (GBK Source)</td>
<td>34 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; Huffman Characters (GBK or UTF-8 Source)</td>
<td>32 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; GZiped UTF-8</td>
<td>27 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; GZiped GBK</td>
<td>24 KB</td>
</tr>
</tbody>
</table>
<p><span style="font-size: 75%">Note: The &#8220;Huffman Bytes&#8221; files were created using the same algorithm as the &#8220;Huffman Characters&#8221; files, except the input encoding was set to ISO-8859-1 so that the algorithm would ignore multibyte sequences in the input files and work purely with 8-bit bytes.  This produces the same result that the original Huffman coding would produce.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2010/04/16/huffman-coding-and-unicode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>First Release of Japanese Dependency Vectors</title>
		<link>http://vaelen.org/2009/12/28/first-release-of-japanese-dependency-vectors/</link>
		<comments>http://vaelen.org/2009/12/28/first-release-of-japanese-dependency-vectors/#comments</comments>
		<pubDate>Mon, 28 Dec 2009 10:10:26 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[JPDV]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=152</guid>
		<description><![CDATA[At the end of last semester I finished the first version of Japanese Dependency Vectors (jpdv).  I had to give up on using Clojure at the last minute because it was taking me too long to make progress and I needed to have some sort of a working system to turn in for my NLP [...]]]></description>
			<content:encoded><![CDATA[<p>At the end of last semester I finished the first version of Japanese Dependency Vectors (jpdv).  I had to give up on using Clojure at the last minute because it was taking me too long to make progress and I needed to have some sort of a working system to turn in for my NLP final project.</p>
<p>To accomplish this I rewrote jpdv in Java.  It took me about 18 hours of solid coding, minus time for food of course. <img height="20" width="20" class="emoji" src="/wp-includes/images/emoji/e057.png"/></p>
<p>The software can now generate both context-based and dependency-based vector spaces for Japanese text that has been pre-parsed with CaboCha.  It can also generate a similarity matrix for a given vector space using the cosine similarity measurement.  I still need to add a path selection function to throw out paths that are too long and a basis element selection function that determines which N basis elements to keep out of all those discovered, but I will add those to the next release.  I&#8217;m thinking of writing the path selection and basis element selection functions as <a href="http://groovy.codehaus.org/" target="_blank">Groovy</a> scripts so that they can be supplied at run time.  This would allow for better customization of the system at run time for a given task.</p>
<p>More information can be found <a href="http://vaelen.org/software/jpdv/">here</a> and on the <a href="http://github.com/vaelen/jpdv" target="_blank">GitHub page</a>.</p>
<p>Here is an example similarity matrix generated by the current version of jpdv:</p>
<table>
<tr>
<th>WORD</th>
<th>コンピュータ</th>
<th>兄弟</th>
<th>緑</th>
<th>赤い</th>
<th>電話</th>
<th>青い</th>
<th>黒い</th>
</tr>
<tr>
<th>コンピュータ</th>
<td>1.00000</td>
<td>0.06506</td>
<td>0.07563</td>
<td>0.00000</td>
<td>0.07760</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>兄弟</th>
<td>0.06506</td>
<td>1.00000</td>
<td>0.19929</td>
<td>0.00000</td>
<td>0.14947</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>緑</th>
<td>0.07563</td>
<td>0.19929</td>
<td>0.99999</td>
<td>0.00000</td>
<td>0.19833</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>赤い</th>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>1.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.01352</td>
</tr>
<tr>
<th>電話</th>
<td>0.07760</td>
<td>0.14947</td>
<td>0.19833</td>
<td>0.00000</td>
<td>1.00000</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>青い</th>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>1.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>黒い</th>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.01352</td>
<td>0.00000</td>
<td>0.00000</td>
<td>1.00000</td>
</tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/12/28/first-release-of-japanese-dependency-vectors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Japanese Dependency Vectors</title>
		<link>http://vaelen.org/2009/12/03/japanese-dependency-vectors/</link>
		<comments>http://vaelen.org/2009/12/03/japanese-dependency-vectors/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 20:32:35 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[JPDV]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=127</guid>
		<description><![CDATA[I&#8217;ve been working on a new project I call &#8220;Japanese Dependency Vectors&#8221; or &#8220;jpdv&#8221; for short.  It&#8217;s a program that generates  dependency based semantic vector spaces for Japanese text.  (There&#8217;s already an excellent tool for doing this with English, which was written by Sebastian Pado.) However, jpdv still has a way to go before it [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://vaelen.org/wp-content/uploads/2009/12/jpdv-tree11.png"><img class="alignleft size-medium wp-image-136" title="Japanese Dependency Tree" src="http://vaelen.org/wp-content/uploads/2009/12/jpdv-tree11-277x300.png" alt="" width="277" height="300" /></a>I&#8217;ve been working on a new project I call &#8220;Japanese Dependency Vectors&#8221; or &#8220;<a href="http://github.com/vaelen/jpdv" target="_blank">jpdv</a>&#8221; for short.  It&#8217;s a program that generates  <a href="http://en.wikipedia.org/wiki/Dependency_grammar" target="_blank">dependency</a> based semantic <a href="http://en.wikipedia.org/wiki/Vector_space_model" target="_blank">vector spaces</a> for Japanese text.  (There&#8217;s already an excellent <a href="http://www.nlpado.de/~sebastian/dv.html" target="_blank">tool</a> for doing this with English, which was written by <a href="http://www.nlpado.de/~sebastian/software.html" target="_blank">Sebastian Pado</a>.)</p>
<p>However, jpdv still has a way to go before it works as promised.  So far the tool can parse <a href="http://chasen.org/~taku/software/cabocha/" target="_blank">CaboCha</a> formatted XML and produce both a word co-occurrence based vector space and a slightly modified XML representation that better demonstrates the dependency relationships of the words in the text.  The next step is to use the dependency information to produce the vector space that I need.  Unfortunately, I only have until the end of next week to finish it, because I&#8217;m working on this as the final project in my NLP class this semester.  I also plan to use the vector spaces created by the tool to do word sense disambiguation for the <a href="http://semeval2.fbk.eu/" target="_blank">SEMEVAL-2</a> shared task on <a href="http://lr-www.pi.titech.ac.jp/wsd.html" target="_blank">Japanese WSD</a>.</p>
<p>(The image included here was generated by jpdv as a LaTeX file from one of the sentences I&#8217;m using for testing.)</p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/12/03/japanese-dependency-vectors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Emacs, Clojure, and Japanese</title>
		<link>http://vaelen.org/2009/11/28/emacs-clojure-and-japanese/</link>
		<comments>http://vaelen.org/2009/11/28/emacs-clojure-and-japanese/#comments</comments>
		<pubDate>Sat, 28 Nov 2009 07:26:04 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Clojure]]></category>
		<category><![CDATA[JPDV]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=120</guid>
		<description><![CDATA[This might be proof that I&#8217;m crazy: I&#8217;m working on a project for my NLP class that involves generating a semantic vector space for Japanese text, and I decided that this might be a good time to learn one of the LISP dialects.  I&#8217;ve been looking at Clojure for a while now, but I hadn&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>This might be proof that I&#8217;m crazy:</p>
<p><a href="http://vaelen.org/wp-content/uploads/2009/11/emacs-jp-clojure.png"><img class="alignnone size-full wp-image-121" title="Emacs, Clojure, and Japanese" src="http://vaelen.org/wp-content/uploads/2009/11/emacs-jp-clojure.png" alt="" width="568" height="356" /></a></p>
<p><span id="more-120"></span></p>
<p>I&#8217;m working on a <a href="http://github.com/vaelen/jpdv">project</a> for my NLP class that involves generating a <a href="http://en.wikipedia.org/wiki/Vector_space_model" target="_blank">semantic vector space</a> for Japanese text, and I decided that this might be a good time to learn one of the LISP dialects.  I&#8217;ve been looking at <a href="http://clojure.org/" target="_blank">Clojure</a> for a while now, but I hadn&#8217;t taken the time to learn it before.  I must say, I&#8217;m quite impressed so far.  The fact that reading a Japanese XML document into a data structure &#8220;just works&#8221; without any tweaking is pretty nice.  I&#8217;m still coming to grips with functional programming, but I&#8217;m liking it so far.  And the best thing as far as I&#8217;m concerned is that Clojure code can easily be made <a href="http://clojure.org/concurrent_programming" target="_blank">massively parallel</a> thanks to all of its data types being immutable (ala Erlang).  That will help me out a lot when I need to run my code on multi-core / multi-processor machines (which I often have to do because of the sheer amount of data being crunched.)</p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/11/28/emacs-clojure-and-japanese/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Installing CaboCha in Mac OSX with MacPorts</title>
		<link>http://vaelen.org/2009/09/24/installing-cabocha-in-mac-osx-with-macports/</link>
		<comments>http://vaelen.org/2009/09/24/installing-cabocha-in-mac-osx-with-macports/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 03:46:22 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>

		<guid isPermaLink="false">http://www.vaelen.org/?p=64</guid>
		<description><![CDATA[CaboCha is a dependency parser for Japanese used by (among other things) the Japanese FrameNet project. Getting it installed and working on my mac turned out to be more work than I had anticipated, so I thought I would post instructions for anyone who might also want to install CaboCha. Installation Steps: Install XCode and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://chasen.org/~taku/software/cabocha/">CaboCha</a> is a dependency parser for Japanese used by (among other things) the <a href="http://jfn.st.hc.keio.ac.jp/">Japanese FrameNet</a> project.  Getting it installed and working on my mac turned out to be more work than I had anticipated, so I thought I would post instructions for anyone who might also want to install CaboCha.<br />
<span id="more-64"></span></p>
<p>Installation Steps:</p>
<ol>
<li>Install XCode and the Developer tools off of your Mac <span class="caps">OSX </span>install CD if you haven’t already.</li>
<li>Install <a href="http://www.macports.org/">MacPorts</a> if you haven’t already.</li>
<li>Download this tar file: <a href="http://files.vaelen.org/cl/cabocha-macports.tar.gz">cabocha-macports.tar.gz</a></li>
<li>Extract the tar file.</li>
<li>Run install.sh from inside the directory that the tar file created.</li>
</ol>
<p>The install script first installs the following two programs via <a href="http://www.macports.org/">MacPorts</a>:</p>
<ul>
<li><a href="http://chasen-legacy.sourceforge.jp/">ChaSen</a> (a Japanese part of speech and morphological analyzer)</li>
<li><a href="http://mecab.sourceforge.net/">MeCab</a> (another Japanese part of speech and morphological analyzer that uses )</li>
</ul>
<p>Then the install script will download and install the following programs, patching them as necessary:</p>
<ul>
<li><a href="http://chasen.org/~taku/software/TinySVM/">TinySVM</a> (A Support Vector Machines library)</li>
<li><a href="http://chasen.org/~taku/software/yamcha/">YamCha</a> (A text chunker)</li>
<li><a href="http://chasen.org/~taku/software/cabocha/">CaboCha</a> (The Japanese dependency parser we’re interested in)</li>
</ul>
<p>All off the applications will be installed into the MacPorts directory structure in /opt/local and they should therefore work out of the box without a problem.  When you run CaboCha you can choose whether to use ChaSen or MeCab for part of speech and morphological analysis.  The default is normally ChaSen, but it seems that ChaSen is no longer being actively maintained.  Also, MeCab uses Conditional Random Fields (CRF) which generally perform better than the Hidden Markov Models (HMM) used by ChaSen.  Because of this, my install script configures CaboCha to use MeCab as its default part of speech and morphological analyzer.  If you want to use ChaSen instead, you can pass CaboCha the “-a chasen” flag or change the install.sh script to use ChaSen instead of MeCab.</p>
<p><strong>Note:</strong> This script works for me in Snow Leopard, but I give no guarantees for how well it will work for you.  Also, be aware that all of these utilities work with <span class="caps">EUC</span>-JP encoded text, not <span class="caps">UTF</span>-8 encoded text, so you will need to configure your terminal to use <span class="caps">EUC</span>-JP.  If you’re using the standard Mac <span class="caps">OSX</span> Terminal app, the easiest way to do this is:</p>
<ol>
<li>Open the Terminal app.</li>
<li>Click on the ‘Terminal’ menu and then click on ‘Preferences’.</li>
<li>Click on ‘Settings’.</li>
<li>Highlight your favorite terminal profile (I use ‘Pro’ myself).</li>
<li>Click on the button with the gear on it and select ‘Duplicate Settings’.</li>
<li>Name the new profile something useful, like ‘EUC-JP’.</li>
<li>Click on the new profile and then click on the ‘Advanced’ tab.</li>
<li>Change the ‘Character encoding’ drop down to ‘Japanese (EUC)’.</li>
<li>Close the preferences window and quit the terminal app to make sure it picks up the changes.</li>
</ol>
<p>To use your new profile, relaunch the terminal app, click on the ‘Shell’ menu, select ‘New Window’, and then select ‘EUC-JP’ (or whatever you called your new profile).  This will open a new terminal window that is using <span class="caps">EUC</span>-JP for its character encoding.  You might also want to mess with the fonts for your new profile to make the Japanese characters easier to read.</p>
<p>The following websites were especially helpful in showing me what files to modify in order to get TinySVM installed properly:</p>
<ul>
<li><a href="http://www.ai.cs.scitec.kobe-u.ac.jp/~kitamura/moriwiki/index.php?Research%2FCabocha%2BYamcha%2BTinySVM%2BMeCab%E3%81%AE%E5%B0%8E%E5%85%A5">Research/Cabocha+Yamcha+TinySVM+MeCabの導入</a></li>
<li><a href="http://cl.cs.okayama-u.ac.jp/study/installmemo.html">Mac OS X Pantherに言語処理ツール</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/09/24/installing-cabocha-in-mac-osx-with-macports/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The 3 Year Plan</title>
		<link>http://vaelen.org/2009/07/27/the-3-year-plan/</link>
		<comments>http://vaelen.org/2009/07/27/the-3-year-plan/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 19:24:45 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Japan]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Keio University]]></category>
		<category><![CDATA[Linguistics]]></category>

		<guid isPermaLink="false">http://blog.vaelen.org/?p=6</guid>
		<description><![CDATA[Next month I graduate from The University of Texas at Austin with a bachelors degree in linguistics with departmental honors. In September I start my graduate studies in the same department at UT, where I’ll be working on my masters degree with a specialization in computational linguistics. The original plan had been to apply for [...]]]></description>
			<content:encoded><![CDATA[<p>Next month I graduate from <a href="http://www.utexas.edu/">The University of Texas at Austin</a> with a bachelors degree in <a href="http://www.utexas.edu/cola/depts/linguistics/">linguistics</a> with departmental honors.  In September I start my graduate studies in the same department at <span class="caps"><span class="caps">UT, </span></span>where I’ll be working on my masters degree with a specialization in <a href="http://comp.ling.utexas.edu/people/students">computational linguistics</a>.  The original plan had been to apply for the PhD program at UT after completing my masters degree, but now my plans have changed.  The new plan: <a href="http://www.keio.ac.jp/index-en.html">Keio University</a>.</p>
<p><span id="more-6"></span></p>
<p>When I went back to school in the summer of 2004, I declared myself to be double majoring in <a href="http://www.cs.utexas.edu/">computer science</a> and <a href="http://www.laits.utexas.edu/japanese/">Japanese</a>.  However, I discovered two years later – after finishing four semesters of Japanese and a handful of the undergraduate CS courses – that I wasn’t devoted enough to either of those majors to be able to complete a degree while working full time and raising a family.  Luckily though, I took an introductory linguistics class in 2006 and discovered that I really enjoyed linguistics, so I switched my major at first to Japanese and linguistics, and then to English and linguistics, and then finally to just linguistics.  But that didn’t mean that I wasn’t still interested in CS and Japanese.  In fact, I found that my linguistic interests gravitated towards computational linguistics and Japanese linguistics.</p>
<p>I’ve had a long time interest in studying abroad in Japan, which recently led me to start looking for linguistics programs in Japan.  Unfortunately, I don’t speak nearly enough Japanese to attend a Japanese university where all the courses are taught in Japanese.  However, a handful of universities in Japan offer “international” degree programs now where the language of study is English.  The Japanese government is providing grants and scholarships to promote these programs – especially at the graduate level – because they want to bring more international students into Japanese research universities.</p>
<p>While digging through lists of international graduate programs, I came across the program offered by Keio University’s <a href="http://www.st.keio.ac.jp/english/admissions/index.html">Graduate School of Science and Technology</a>.  The admissions information mentioned the need to seek out an advisor before applying to the university and it provided a convenient link to look through the list of professors.  While looking through this list, something interesting caught my eye:</p>
<blockquote><p>Ohara, Kyoko Hirose Associate Professor</p>
<p>Research Area:</p>
<p style="padding-left: 1em;">Cognitive Linguistics / Corpus Linguistics / Lexical Semantics / Contrastive analysis of Japanese and English / Natural Language Processing</p>
</blockquote>
<p><a href="http://vaelen.org/wp-content/uploads/2009/10/300px-keio_university.jpg?w=210"><img class="size-medium wp-image-84  alignleft" title="keio_university" src="http://vaelen.org/wp-content/uploads/2009/10/300px-keio_university.jpg?w=210" alt="Keio University" width="300" height="428" /></a></p>
<p>While I had seen plenty of professors with “Natural Language Processing” listed before, they were usually researching machine translation, speech recognition, or some sort of human-machine interface, usually involving robotics.  However, Ohara-sensei’s research is exactly the sort of thing that I am currently working on.  In fact, she is in the process of producing a <a href="http://jfn.st.hc.keio.ac.jp/index.html">Japanese</a> version of <a href="http://framenet.icsi.berkeley.edu/">FrameNet</a>, which is the annotated corpus I am currently using for my honors thesis research.  Also, I discovered that Keio is Japan’s <a href="http://en.wikipedia.org/wiki/Keio_University">top private university</a> and was the first university in Japan, opening a few years before the <a href="http://en.wikipedia.org/wiki/Meiji_Restoration">Meiji revolution</a> occurred (it is named for the era that came before the <a href="http://en.wikipedia.org/wiki/Meiji_period">Meiji</a> era, which was called the <a href="http://en.wikipedia.org/wiki/Keiō">Keio</a> era).</p>
<p>I emailed Ohara-sensei and spoke with my current advisor about the possibility of studying at Keio after I finish my masters degree.  So far it looks very promising.  My main deficiency is my lack of a solid understanding of Japanese.  Although the program is in English, and my dissertation would be in English, I would still be studying the linguistics of Japanese, and therefore I need a much better grasp of the Japanese language.  However, it will be at least 3 years before I finish my masters degree, which will give me plenty of time to improve my Japanese to a sufficient level.  I also plan to do my master’s thesis on a topic involving Japanese linguistics that will allow me to make use of the Japanese FrameNet project, which should hopefully help my chances of being admitted to Keio.</p>
<p>After confirming that this would indeed something that I could actually do, I spoke with my wife and daughter about it and they are both onboard with the idea, as long as it is actually feasible.  I’ve discovered that the Japanese government has a scholarship available to international graduate students (the same one I mentioned above) that will pay for 3 years of tuition (which is the normal length of time that a PhD takes at Keio) and provide the student with about a $1,000 a month stipend.  The scholarship is competitive, but I will try to get it and hope for the best.  My wife was also won over by the fact that the <a href="http://www.keio-unicorns.com/">school mascot</a> is a <a href="http://www.geocities.jp/unicorns_hp/index.html">unicorn</a>!</p>
<div id="attachment_86" class="wp-caption alignright" style="width: 310px"><a href="http://vaelen.org/wp-content/uploads/2009/10/keio-unicorns.jpg?w=300"><img class="size-medium wp-image-86  " title="keio-unicorns" src="http://vaelen.org/wp-content/uploads/2009/10/keio-unicorns.jpg?w=300" alt="The Keio Mascot" width="300" height="261" /></a><p class="wp-caption-text">The Keio Mascot</p></div>
<p>The campus I would be attending is in Yokohama, so housing should be less expensive than in Tokyo proper.  Also, we hope to teach my daughter as much Japanese as possible before then and then put her in the public school system so that she can experience the culture more closely and hopefully learn to speak fluent Japanese.</p>
<p>Of course, this is all very preliminary at the moment, and nothing is certain.  However, I’m quite serious about it and I really think it is going to happen.  I’m really looking forward to spending a few years in Japan.  It would be a once-in-a-lifetime experience!</p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/07/27/the-3-year-plan/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
