<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Andrew Young &#187; Andrew</title>
	<atom:link href="http://vaelen.org/author/admin/feed/" rel="self" type="application/rss+xml" />
	<link>http://vaelen.org</link>
	<description>Computational Linguist, Software Engineer</description>
	<lastBuildDate>Sat, 05 Jun 2010 06:35:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>A Simple Esperanto Keyboard for iPhone</title>
		<link>http://vaelen.org/2010/06/05/a-simple-esperanto-keyboard-for-iphone/</link>
		<comments>http://vaelen.org/2010/06/05/a-simple-esperanto-keyboard-for-iphone/#comments</comments>
		<pubDate>Sat, 05 Jun 2010 06:35:44 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=232</guid>
		<description><![CDATA[Today I wrote a simple Esperanto keyboard for the iPhone. It&#8217;s really just a little HTML page with some JavaScript and CSS to make it look like an iPhone app when you view it from an iPhone. It looks pretty bad if you view it outside of an iPhone though, because the buttons don&#8217;t line [...]]]></description>
			<content:encoded><![CDATA[<p><div id="attachment_233" class="wp-caption alignright" style="width: 210px"><a href="http://vaelen.org/wp-content/uploads/2010/06/klavaro.jpg"><img src="http://vaelen.org/wp-content/uploads/2010/06/klavaro-200x300.jpg" alt="" title="Klavaro" width="200" height="300" class="size-medium wp-image-233" /></a><p class="wp-caption-text">Klavaro de Esperanto</p></div>Today I wrote <a href="http://vaelen.org/klavaro/" target="_blank">a simple Esperanto keyboard for the iPhone</a>.  It&#8217;s really just a little HTML page with some JavaScript and CSS to make it look like an iPhone app when you view it from an iPhone.  It looks pretty bad if you view it outside of an iPhone though, because the buttons don&#8217;t line up right.  Maybe I&#8217;ll make it more fully featured on non-iPhone browsers, but there are better tools out there for that already.</p>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2010/06/05/a-simple-esperanto-keyboard-for-iphone/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>XML Generation in RPG</title>
		<link>http://vaelen.org/2010/05/19/xml-generation-in-rpg/</link>
		<comments>http://vaelen.org/2010/05/19/xml-generation-in-rpg/#comments</comments>
		<pubDate>Thu, 20 May 2010 01:02:33 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[RPG]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=224</guid>
		<description><![CDATA[The company I work for makes heavy use of their IBM Power i midrange servers (previously known as AS/400 or iSeries servers). A lot of their software is written in the RPG programming language, which IBM originally developed back in the 1960s. The language was originally written to generate reports and lacked many &#8220;modern&#8221; programming [...]]]></description>
			<content:encoded><![CDATA[<p>The company I work for makes heavy use of their IBM Power i midrange servers (previously known as AS/400 or iSeries servers).  A lot of their software is written in the <a href="http://en.wikipedia.org/wiki/IBM_RPG">RPG programming language</a>, which IBM originally developed back in the 1960s.  The language was originally written to generate reports and lacked many &#8220;modern&#8221; programming features, such as IF statements and subroutines, which were added in RPG III.  </p>
<p>Since starting at my current company, I&#8217;ve been trying to learn the current version of RPG, which is RPG IV (aka RPGLE or ILE/RPG).  Most of the running code that I see in RPG is actually written using RPG III syntax despite the fact that RPG IV has been out since 1994.  This is mostly due to the fact that much of it was either generated programmatically or was written before 1994.  My goal in learning RPG isn&#8217;t to become proficient enough to program RPG for a living, but instead to become proficient enough to help our organization transition their existing systems to more modern technologies as needed.  However, my &#8220;outsider&#8221; view of RPG (coming from a Java/Perl/Ruby/etc background) has helped me do some things with it that long time RPG programmers might not think of trying to do.  This is an example of that.<br />
<span id="more-224"></span><br />
Like most large enterprise development shops these days, we use a Service Oriented Architecture (SoA) for most of our newer Java and .NET services.  The problem is integrating our legacy RPG systems into these newer SoA systems.  The cleanest approach is to use data queues (aka JMS queues).  RPG can use these queues natively without having to make use of outside libraries as it would need to do in order to make socket calls.  The problem is how to structure the data that is placed on the queue.  The best thing to do would be to place a SOAP message &#8211; a type of XML document &#8211; onto the queue.  This way the SoA systems could easily pick the message up off the queue and process it just like any other SOAP message they receive.  However, this turns out to be problematic because it is difficult to produce properly formatted XML from RPG without the use of external libraries.</p>
<p>The good news is that there are products available that can help solve this problem.  (For example, <a href="http://www.rpg-xml.com/about.aspx">The RPG XML Suite</a> and <a href="http://www.scottklement.com/presentations/#FREEXML">Scott Klement&#8217;s presentation</a> on using free software to produce XML from RPG.)  However, for various reasons we are unable to use these solutions, and since we have a large pool of RPG programmers who don&#8217;t yet know Java or .NET, there will still be a need to develop applications in RPG and have them interact via XML with our SoA applications.</p>
<p>To help solve this problem, and to help me learn RPG, I wrote a program that lets the programmer generate a sort of &#8220;mini DOM&#8221; for an XML document that can then produce a properly formatted and escaped XML string.  Below is the code for the program.  However, please note that I haven&#8217;t tested it throughly and since I am a novice RPG programmer it is bound to have bugs in it.  I also used lots of &#8220;newer&#8221; RPG programming concepts, such as pointers, dynamic memory allocation, variable length null terminated strings, and subprocedures (which are like subroutines except that they have their own local variables and call stack that allow them to be run recursively.)</p>
<p><code>
<pre>
      // This program implements a simple XML DOM Builder.
      // Written by Andrew Young <andrew@vaelen.org>

     D MAX_NODES       C                   65535

     D NODE_DEF        DS
     D   nodeId                       5U 0 INZ(1)
     D   nodeParent                   5U 0 INZ(0)
     D   nodeType                     1A   INZ('E')
     D   nodeName@                     *   INZ(*NULL)
     D   nodeNameLen                  5U 0 INZ(0)
     D   nodeValue@                    *   INZ(*NULL)
     D   nodeValLen                   5U 0 INZ(0)

     D NODE_LIST       DS
     D nodes                           *   DIM(MAX_NODES)
     D nodeEnd                        5U 0 INZ(1)

     D buildXML        PR         65535A   VARYING
     D                                 *   VALUE

     D createNode      PR              *
     D                                5U 0 VALUE

     D setNodeName     PR             3U 0
     D                                 *   VALUE
     D                            65535A   VALUE VARYING

     D getNodeName     PR         65535A   VARYING
     D                                 *   VALUE

     D setNodeValue    PR             3U 0
     D                                 *   VALUE
     D                            65535A   VALUE VARYING

     D getNodeValue    PR         65535A   VARYING
     D                                 *   VALUE

     D getNode         PR              *
     D                                5U 0 VALUE

     D getParentNode   PR              *
     D                                 *   VALUE

     D nodeTest        PR             3U 0

     D printNode       PR             3U 0
     D                                 *   VALUE

     D displayString   PR             3U 0
     D                            65535A   VARYING VALUE

     D escapeXML       PR         65535A   VARYING
     D                            65535A   VARYING VALUE

     D cleanup         PR

      /FREE
       nodeTest();
       cleanup();
       EXSR doPause;
       *INLR = *ON;
       RETURN;
      /END-FREE

      * This method pauses the output
     C     doPause       BEGSR
     C                   DSPLY                   PAUSE             1
     C                   ENDSR

     P cleanup         B
     D i               S             10I 0 INZ(1)
     D ret             S              3U 0
     D node@           S               *
     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
      /FREE
       DOW i < nodeEnd;
         node@ = getNode(i);
         IF node@ <> *NULL;
           IF node.nodeName@ <> *NULL;
             DEALLOC node.nodeName@;
             node.nodeName@ = *NULL;
           ENDIF;
           IF node.nodeValue@ <> *NULL;
             DEALLOC node.nodeValue@;
             node.nodeValue@ = *NULL;
           ENDIF;
           DEALLOC node@;
           node@ = *NULL;
         ENDIF;
         i = i + 1;
       ENDDO;
      /END-FREE
     P cleanup         E

     P nodeTest        B
     D                 PI             3U 0

     D node@           S               *
     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D nodeIndex       S              5U 0
     D ret             S              3U 0

      /FREE
       node@ = createNode(0);
       ret = setNodeName(node@:'documents');

       node@ = createNode(node.nodeId);
       ret = setNodeName(node@:'document');

       node@ = createNode(node.nodeId);
       ret = setNodeName(node@:'name');
       node.nodeType = 'A';
       ret = setNodeValue(node@:'My First Document');

       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       node.nodeType = 'T';
       ret = setNodeValue(node@:'Some text.');

       node@ = getParentNode(node@);
       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       ret = setNodeName(node@:'document');

       node@ = createNode(node.nodeId);
       node.nodeType = 'A';
       ret = setNodeName(node@:'name');
       ret = setNodeValue(node@:'A Second Document <>&#038;"&#038;"><');

       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       ret = setNodeName(node@:'field');

       node@ = createNode(node.nodeId);
       node.nodeType = 'A';
       ret = setNodeName(node@:'name');
       ret = setNodeValue(node@:'url');

       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       node.nodeType = 'T';
       ret = setNodeValue(node@:'http://www.google.com/');

       node@ = getParentNode(node@);
       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       ret = setNodeName(node@:'description');

       node@ = createNode(node.nodeId);
       node.nodeType = 'T';
       ret = setNodeValue(node@:'Here is a link: ');

       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       setNodeName(node@:'a');

       node@ = createNode(node.nodeId);
       node.nodeType = 'A';
       setNodeName(node@:'href');
       setNodeValue(node@:'http://www.google.com/');

       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       node.nodeType = 'T';
       setNodeValue(node@:'Google');

       node@ = getParentNode(node@);

       node@ = getParentNode(node@);

       node@ = createNode(node.nodeId);
       node.nodeType = 'T';
       setNodeValue(node@:', and here is some text that needs to be escaped: '+
           '"<>"&#038;><&#038;');

       nodeIndex = 1;
       DOW nodeIndex < nodeEnd;
         node@ = getNode(nodeIndex);
         ret = printNode(node@);
         nodeIndex = nodeIndex + 1;
       ENDDO;

       ret = displayString(buildXML(getNode(1)));

       RETURN 0;
      /END-FREE
     P nodeTest        E

       // This routine prints out the value of the current node
     P printNode       B
     D                 PI             3U 0
     D node@                           *   VALUE

     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D i               S              3U 0
      /FREE
       i = displayString('Node: ' + %char(node.nodeId));
       i = displayString('  Parent: ' + %char(node.nodeParent));
       i = displayString('    Type: ' + node.nodeType);
       i = displayString('    Name: ' + getNodeName(node@));
       i = displayString('   Value: ' + getNodeValue(node@));
       RETURN 0;
      /END-FREE
     P printNode       E

       // This routine write the output to standard out using DSPLY.
     P displayString   B
     D                 PI             3U 0
     D output                     65535A   VARYING VALUE

     D message         S             52A   VARYING
     D i               S             10I 0 INZ(0)
     D j               S             10I 0 INZ(0)
      /FREE
       i = 1;
       j = %len(%trim(output));
       DOW i < j;
         message = %subst(output:i);
         DSPLY message;
         i = i + 52;
       ENDDO;
       RETURN 0;
      /END-FREE
     P displayString   E

     P getNode         B
     D                 PI              *
     D nodeIndex                      5U 0 VALUE
      /FREE
       RETURN nodes(nodeIndex);
      /END-FREE
     P getNode         E

     P getParentNode   B
     D                 PI              *
     D node@                           *   VALUE

     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D parentNode@     S               *
      /FREE
       IF node.nodeParent > 0;
         parentNode@ = getNode(node.nodeParent);
       ENDIF;
       RETURN parentNode@;
      /END-FREE
     P getParentNode   E

     P createNode      B
     D                 PI              *
     D parentIndex                    5U 0 VALUE

     D nodeIndex       S              5U 0
     D node@           S               *
     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D i               S              3U 0
      /FREE
       node@ = %alloc(%size(NODE_DEF));
       node.nodeId = nodeEnd;
       node.nodeType = 'E';
       node.nodeParent = parentIndex;
       nodes(nodeEnd) = node@;
       nodeEnd = nodeEnd + 1;
       RETURN node@;
      /END-FREE
     P createNode      E

     P setNodeName     B
     D                 PI             3U 0
     D node@                           *   VALUE
     D value                      65535A   VALUE VARYING

     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D i               S              3U 0
      /FREE
       IF node.nodeName@ <> *NULL;
         // Deallocate currently used space
         DEALLOC node.nodeName@;
         node.nodeName@ = *NULL;
       ENDIF;
       // Allocate new space
       node.nodeNameLen = %len(value);
       node.nodeName@ = %alloc(node.nodeNameLen+1);
       %str(node.nodeName@:node.nodeNameLen+1) =
           %subst(value:1:node.nodeNameLen);
       RETURN 0;
      /END-FREE
     P setNodeName     E

     P getNodeName     B
     D                 PI         65535A   VARYING
     D node@                           *   VALUE

     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D value           S          65535A   VARYING
     D i               S              3U 0
      /FREE
       IF node.nodeName@ <> *NULL;
         value = %str(node.nodeName@:node.nodeNameLen+1);
       ENDIF;
       RETURN value;
      /END-FREE
     P getNodeName     E

     P setNodeValue    B
     D                 PI             3U 0
     D node@                           *   VALUE
     D value                      65535A   VALUE VARYING

     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D i               S              3U 0
      /FREE
       IF node.nodeValue@ <> *NULL;
         // Deallocate currently used space
         DEALLOC node.nodeValue@;
         node.nodeValue@ = *NULL;
       ENDIF;
       // Allocate new space
       node.nodeValLen = %len(value);
       node.nodeValue@ = %alloc(node.nodeValLen+1);
       %str(node.nodeValue@:node.nodeValLen+1) =
           %subst(value:1:node.nodeValLen);
       RETURN 0;
      /END-FREE
     P setNodeValue    E

     P getNodeValue    B
     D                 PI         65535A   VARYING
     D node@                           *   VALUE

     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D value           S          65535A   VARYING
      /FREE
       IF node.nodeValue@ <> *NULL;
         value = %str(node.nodeValue@:node.nodeValLen+1);
       ENDIF;
       RETURN value;
      /END-FREE
     P getNodeValue    E

     P buildXML        B
     D                 PI         65535A   VARYING
     D node@                           *   VALUE

     D childNode@      S               *
     D outp            S          65535A   VARYING
     D i               S             10I 0
     D node            DS                  LIKEDS(NODE_DEF) BASED(node@)
     D childNode       DS                  LIKEDS(NODE_DEF) BASED(childNode@)

      /FREE
       // This function writes out the XML structure for the current node and
       //     its children.
       IF node.nodeType = 'E';
         // Write start of element
         outp = outp + '<' + %trim(getNodeName(node@));
         // Look for child attributes
         i = 1;
         DOW i < nodeEnd;
           childNode@ = getNode(i);
           IF childNode.nodeType = 'A' AND childNode.nodeParent = node.nodeId;
             outp = outp + ' ' + %trim(getNodeName(childNode@)) + '="' +
                 escapeXML(getNodeValue(childNode@)) + '"';
           ENDIF;
           i = i + 1;
         ENDDO;
         outp = outp + '>';
         // Look for child elements and text nodes
         i = 1;
         DOW i < nodeEnd;
           childNode@ = getNode(i);
           IF childNode.nodeParent = node.nodeId;
             IF childNode.nodeType = 'T';
               outp = outp + escapeXML(getNodeValue(childNode@));
             ELSEIF childNode.nodeType = 'E';
               outp = outp + buildXML(childNode@);
             ENDIF;
           ENDIF;
           i = i + 1;
         ENDDO;
         outp = outp + '</' + %trim(getNodeName(node@)) + '>';
       ENDIF;  // node.nodeType = 'E'

       RETURN outp;

      /END-FREE

     P buildXML        E

     P escapeXML       B
     D                 PI         65535A   VARYING
     D input                      65535A   VARYING VALUE

     D inputSize       S              5U 0 INZ(0)
     D i               S              5U 0 INZ(1)
     D currentChar     S              1A   INZ(*BLANKS)
     D outputChar      S              6A   VARYING
     D output          S          65535A   VARYING
      /FREE
       inputSize = %len(input);
       DOW i <= inputSize;
         currentChar = %subst(input:i:1);
         outputChar = currentChar;
         IF currentChar = '&#038;';
           outputChar = '&amp;amp;';
         ELSEIF currentChar = '<';
           outputChar = '&amp;lt;';
         ELSEIF currentChar = '>';
           outputChar = '&amp;gt;';
         ELSEIF currentChar = '"';
           outputChar = '&amp;quot;';
         ENDIF;
         output = output + outputChar;
         i = i + 1;
       ENDDO;
       RETURN output;
      /END-FREE
     P escapeXML       E
</pre>
<p></code></p>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2010/05/19/xml-generation-in-rpg/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Huffman Coding, Unicode, and CJKV Data</title>
		<link>http://vaelen.org/2010/04/16/huffman-coding-and-unicode/</link>
		<comments>http://vaelen.org/2010/04/16/huffman-coding-and-unicode/#comments</comments>
		<pubDate>Fri, 16 Apr 2010 05:43:53 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=193</guid>
		<description><![CDATA[Today I wrote a little utility in Java that compresses a file using Huffman coding. Normally Huffman coding works on 8-bit bytes. However, because of my experience dealing with Chinese, Japanese, Korean, and other non-English text I wondered how well the coding method would work on double byte character sets. Specifically, I was curious about [...]]]></description>
			<content:encoded><![CDATA[<p>Today I wrote a little utility in Java that compresses a file using <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman coding</a>.  Normally Huffman coding works on 8-bit bytes.  However, because of my experience dealing with Chinese, Japanese, Korean, and other non-English text I wondered how well the coding method would work on double byte character sets.  Specifically, I was curious about compressing <a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8</a> text.<br />
<span id="more-193"></span><br />
UTF-8 is a variable length encoding for Unicode data that stores characters using between one and four bytes per character.  This works great when the text is written mostly (or completely) with the latin alphabet, but other alphabets require two or more bytes per character, which causes the file to become bloated very quickly.  (This is one of the reasons that Asian countries have been slow to adopt Unicode.)  Also, UTF-8 is designed to allow random access to characters in a file despite each character taking up a different number of bytes.  This is a traditional problem with double byte encodings that use shift in/out characters.  With these older encodings, in order to know if a pair of bytes is a single character, the program must also know what the last shift character was.  UTF-8 solves this problem by using the high bit in the byte to mark a byte as part of a multibyte sequence.  The tradeoff for this is that each byte can only store 7 bits worth of data, which is why UTF-8 uses four bytes in the worst case instead of only two.  (Unicode can also have surrogate character pairs regardless of the underlying encoding that is used, but I&#8217;m not going to bother with that in this post.)</p>
<p>Huffman coding is also a variable length encoding.  It uses the distribution of bytes in a given document to assign shorter bit sequences to more common bytes and longer bit sequences to less common bytes.  This means that the most common bytes only take up a few bits (as few as 1 or 2), whereas very uncommon bytes can take up more than 8 bits (and thus take more space to store than they did in the original document.)  However, in the worst case the short bit sequences and long bit sequences will even out and the final product will be a file of the same size as the original.  In practice the final product is usually a file that is much smaller than the original.</p>
<p>Because the Huffman coding algorithm doesn&#8217;t know about multibyte sequences in UTF-8 documents, it doesn&#8217;t take into account the fact that certain bytes (those with the high bit set) only occur together with other bytes and therefore their distribution is not independent.  This means that the <a href="http://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a> of the file is actually smaller than the algorithm thinks it is and therefore the algorithm needs fewer bits to represent the file than it realizes.  In order to make use of multibyte sequences &#8211; not only in UTF-8, but in other multibyte encodings as well &#8211; I modified the algorithm to think in terms of Unicode characters instead of bytes.  By converting the file from its native encoding into a stream of characters before performing the Huffman coding calculations, the entropy of the file is reduced and the file can be stored more compactly.</p>
<p>Since a file written in an Asian language will consist mostly of Unicode characters that translate to two or more bytes in UTF-8, it stands to reason that the standard Huffman coding algorithm won&#8217;t produce as compact a file as a Huffman coding algorithm that works in terms of characters instead of in terms of bytes.  To test this I compressed a Chinese text file with both the standard and improved algorithms and compared the results.  Although my encoding method didn&#8217;t save as much space as other common compression utilities such as GZip, I still saw an improvement when encoding characters instead of raw bytes.</p>
<p>Compressing the raw UTF-8 bytes of a 69KB Chinese file created a 48KB file.  Compressing the underlying characters created a 32KB file, which is a significant improvement.  Likewise, starting with a 46KB <a href="http://en.wikipedia.org/wiki/GBK">GBK</a> encoded version of the same file created a 34KB file when compressing the raw bytes and a 32KB file when compressing characters instead.  (As a matter of fact, compressing characters creates the same file in both cases because the input characters are converted to Unicode internally.)</p>
<p>Here is the raw data from the experiment:</p>
<table>
<tbody>
<tr>
<th>File Description</th>
<th>File Size</th>
</tr>
<tr>
<td>Original File &#8211; UTF-8 Encoded</td>
<td>69 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; Huffman Bytes (UTF-8 Source)</td>
<td>48 KB</td>
</tr>
<tr>
<td>Original File &#8211; <a href="http://en.wikipedia.org/wiki/GBK#Encoding">GBK</a> Encoded</td>
<td>46 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; Huffman Bytes (GBK Source)</td>
<td>34 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; Huffman Characters (GBK or UTF-8 Source)</td>
<td>32 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; GZiped UTF-8</td>
<td>27 KB</td>
</tr>
<tr>
<td>Compressed File &#8211; GZiped GBK</td>
<td>24 KB</td>
</tr>
</tbody>
</table>
<p><span style="font-size: 75%">Note: The &#8220;Huffman Bytes&#8221; files were created using the same algorithm as the &#8220;Huffman Characters&#8221; files, except the input encoding was set to ISO-8859-1 so that the algorithm would ignore multibyte sequences in the input files and work purely with 8-bit bytes.  This produces the same result that the original Huffman coding would produce.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2010/04/16/huffman-coding-and-unicode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>First Release of Japanese Dependency Vectors</title>
		<link>http://vaelen.org/2009/12/28/first-release-of-japanese-dependency-vectors/</link>
		<comments>http://vaelen.org/2009/12/28/first-release-of-japanese-dependency-vectors/#comments</comments>
		<pubDate>Mon, 28 Dec 2009 10:10:26 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[JPDV]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=152</guid>
		<description><![CDATA[At the end of last semester I finished the first version of Japanese Dependency Vectors (jpdv).  I had to give up on using Clojure at the last minute because it was taking me too long to make progress and I needed to have some sort of a working system to turn in for my NLP [...]]]></description>
			<content:encoded><![CDATA[<p>At the end of last semester I finished the first version of Japanese Dependency Vectors (jpdv).  I had to give up on using Clojure at the last minute because it was taking me too long to make progress and I needed to have some sort of a working system to turn in for my NLP final project.</p>
<p>To accomplish this I rewrote jpdv in Java.  It took me about 18 hours of solid coding, minus time for food of course. <img height="20" width="20" class="emoji" src="/wp-includes/images/emoji/e057.png"/></p>
<p>The software can now generate both context-based and dependency-based vector spaces for Japanese text that has been pre-parsed with CaboCha.  It can also generate a similarity matrix for a given vector space using the cosine similarity measurement.  I still need to add a path selection function to throw out paths that are too long and a basis element selection function that determines which N basis elements to keep out of all those discovered, but I will add those to the next release.  I&#8217;m thinking of writing the path selection and basis element selection functions as <a href="http://groovy.codehaus.org/" target="_blank">Groovy</a> scripts so that they can be supplied at run time.  This would allow for better customization of the system at run time for a given task.</p>
<p>More information can be found <a href="http://vaelen.org/software/jpdv/">here</a> and on the <a href="http://github.com/vaelen/jpdv" target="_blank">GitHub page</a>.</p>
<p>Here is an example similarity matrix generated by the current version of jpdv:</p>
<table>
<tr>
<th>WORD</th>
<th>コンピュータ</th>
<th>兄弟</th>
<th>緑</th>
<th>赤い</th>
<th>電話</th>
<th>青い</th>
<th>黒い</th>
</tr>
<tr>
<th>コンピュータ</th>
<td>1.00000</td>
<td>0.06506</td>
<td>0.07563</td>
<td>0.00000</td>
<td>0.07760</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>兄弟</th>
<td>0.06506</td>
<td>1.00000</td>
<td>0.19929</td>
<td>0.00000</td>
<td>0.14947</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>緑</th>
<td>0.07563</td>
<td>0.19929</td>
<td>0.99999</td>
<td>0.00000</td>
<td>0.19833</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>赤い</th>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>1.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.01352</td>
</tr>
<tr>
<th>電話</th>
<td>0.07760</td>
<td>0.14947</td>
<td>0.19833</td>
<td>0.00000</td>
<td>1.00000</td>
<td>0.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>青い</th>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>1.00000</td>
<td>0.00000</td>
</tr>
<tr>
<th>黒い</th>
<td>0.00000</td>
<td>0.00000</td>
<td>0.00000</td>
<td>0.01352</td>
<td>0.00000</td>
<td>0.00000</td>
<td>1.00000</td>
</tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/12/28/first-release-of-japanese-dependency-vectors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Japanese Dependency Vectors</title>
		<link>http://vaelen.org/2009/12/03/japanese-dependency-vectors/</link>
		<comments>http://vaelen.org/2009/12/03/japanese-dependency-vectors/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 20:32:35 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[JPDV]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=127</guid>
		<description><![CDATA[I&#8217;ve been working on a new project I call &#8220;Japanese Dependency Vectors&#8221; or &#8220;jpdv&#8221; for short.  It&#8217;s a program that generates  dependency based semantic vector spaces for Japanese text.  (There&#8217;s already an excellent tool for doing this with English, which was written by Sebastian Pado.) However, jpdv still has a way to go before it [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://vaelen.org/wp-content/uploads/2009/12/jpdv-tree11.png"><img class="alignleft size-medium wp-image-136" title="Japanese Dependency Tree" src="http://vaelen.org/wp-content/uploads/2009/12/jpdv-tree11-277x300.png" alt="" width="277" height="300" /></a>I&#8217;ve been working on a new project I call &#8220;Japanese Dependency Vectors&#8221; or &#8220;<a href="http://github.com/vaelen/jpdv" target="_blank">jpdv</a>&#8221; for short.  It&#8217;s a program that generates  <a href="http://en.wikipedia.org/wiki/Dependency_grammar" target="_blank">dependency</a> based semantic <a href="http://en.wikipedia.org/wiki/Vector_space_model" target="_blank">vector spaces</a> for Japanese text.  (There&#8217;s already an excellent <a href="http://www.nlpado.de/~sebastian/dv.html" target="_blank">tool</a> for doing this with English, which was written by <a href="http://www.nlpado.de/~sebastian/software.html" target="_blank">Sebastian Pado</a>.)</p>
<p>However, jpdv still has a way to go before it works as promised.  So far the tool can parse <a href="http://chasen.org/~taku/software/cabocha/" target="_blank">CaboCha</a> formatted XML and produce both a word co-occurrence based vector space and a slightly modified XML representation that better demonstrates the dependency relationships of the words in the text.  The next step is to use the dependency information to produce the vector space that I need.  Unfortunately, I only have until the end of next week to finish it, because I&#8217;m working on this as the final project in my NLP class this semester.  I also plan to use the vector spaces created by the tool to do word sense disambiguation for the <a href="http://semeval2.fbk.eu/" target="_blank">SEMEVAL-2</a> shared task on <a href="http://lr-www.pi.titech.ac.jp/wsd.html" target="_blank">Japanese WSD</a>.</p>
<p>(The image included here was generated by jpdv as a LaTeX file from one of the sentences I&#8217;m using for testing.)</p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/12/03/japanese-dependency-vectors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Emacs, Clojure, and Japanese</title>
		<link>http://vaelen.org/2009/11/28/emacs-clojure-and-japanese/</link>
		<comments>http://vaelen.org/2009/11/28/emacs-clojure-and-japanese/#comments</comments>
		<pubDate>Sat, 28 Nov 2009 07:26:04 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Clojure]]></category>
		<category><![CDATA[JPDV]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://vaelen.org/?p=120</guid>
		<description><![CDATA[This might be proof that I&#8217;m crazy: I&#8217;m working on a project for my NLP class that involves generating a semantic vector space for Japanese text, and I decided that this might be a good time to learn one of the LISP dialects.  I&#8217;ve been looking at Clojure for a while now, but I hadn&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>This might be proof that I&#8217;m crazy:</p>
<p><a href="http://vaelen.org/wp-content/uploads/2009/11/emacs-jp-clojure.png"><img class="alignnone size-full wp-image-121" title="Emacs, Clojure, and Japanese" src="http://vaelen.org/wp-content/uploads/2009/11/emacs-jp-clojure.png" alt="" width="568" height="356" /></a></p>
<p><span id="more-120"></span></p>
<p>I&#8217;m working on a <a href="http://github.com/vaelen/jpdv">project</a> for my NLP class that involves generating a <a href="http://en.wikipedia.org/wiki/Vector_space_model" target="_blank">semantic vector space</a> for Japanese text, and I decided that this might be a good time to learn one of the LISP dialects.  I&#8217;ve been looking at <a href="http://clojure.org/" target="_blank">Clojure</a> for a while now, but I hadn&#8217;t taken the time to learn it before.  I must say, I&#8217;m quite impressed so far.  The fact that reading a Japanese XML document into a data structure &#8220;just works&#8221; without any tweaking is pretty nice.  I&#8217;m still coming to grips with functional programming, but I&#8217;m liking it so far.  And the best thing as far as I&#8217;m concerned is that Clojure code can easily be made <a href="http://clojure.org/concurrent_programming" target="_blank">massively parallel</a> thanks to all of its data types being immutable (ala Erlang).  That will help me out a lot when I need to run my code on multi-core / multi-processor machines (which I often have to do because of the sheer amount of data being crunched.)</p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/11/28/emacs-clojure-and-japanese/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Installing CaboCha in Mac OSX with MacPorts</title>
		<link>http://vaelen.org/2009/09/24/installing-cabocha-in-mac-osx-with-macports/</link>
		<comments>http://vaelen.org/2009/09/24/installing-cabocha-in-mac-osx-with-macports/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 03:46:22 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Linguistics]]></category>

		<guid isPermaLink="false">http://www.vaelen.org/?p=64</guid>
		<description><![CDATA[CaboCha is a dependency parser for Japanese used by (among other things) the Japanese FrameNet project. Getting it installed and working on my mac turned out to be more work than I had anticipated, so I thought I would post instructions for anyone who might also want to install CaboCha. Installation Steps: Install XCode and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://chasen.org/~taku/software/cabocha/">CaboCha</a> is a dependency parser for Japanese used by (among other things) the <a href="http://jfn.st.hc.keio.ac.jp/">Japanese FrameNet</a> project.  Getting it installed and working on my mac turned out to be more work than I had anticipated, so I thought I would post instructions for anyone who might also want to install CaboCha.<br />
<span id="more-64"></span></p>
<p>Installation Steps:</p>
<ol>
<li>Install XCode and the Developer tools off of your Mac <span class="caps">OSX </span>install CD if you haven’t already.</li>
<li>Install <a href="http://www.macports.org/">MacPorts</a> if you haven’t already.</li>
<li>Download this tar file: <a href="http://files.vaelen.org/cl/cabocha-macports.tar.gz">cabocha-macports.tar.gz</a></li>
<li>Extract the tar file.</li>
<li>Run install.sh from inside the directory that the tar file created.</li>
</ol>
<p>The install script first installs the following two programs via <a href="http://www.macports.org/">MacPorts</a>:</p>
<ul>
<li><a href="http://chasen-legacy.sourceforge.jp/">ChaSen</a> (a Japanese part of speech and morphological analyzer)</li>
<li><a href="http://mecab.sourceforge.net/">MeCab</a> (another Japanese part of speech and morphological analyzer that uses )</li>
</ul>
<p>Then the install script will download and install the following programs, patching them as necessary:</p>
<ul>
<li><a href="http://chasen.org/~taku/software/TinySVM/">TinySVM</a> (A Support Vector Machines library)</li>
<li><a href="http://chasen.org/~taku/software/yamcha/">YamCha</a> (A text chunker)</li>
<li><a href="http://chasen.org/~taku/software/cabocha/">CaboCha</a> (The Japanese dependency parser we’re interested in)</li>
</ul>
<p>All off the applications will be installed into the MacPorts directory structure in /opt/local and they should therefore work out of the box without a problem.  When you run CaboCha you can choose whether to use ChaSen or MeCab for part of speech and morphological analysis.  The default is normally ChaSen, but it seems that ChaSen is no longer being actively maintained.  Also, MeCab uses Conditional Random Fields (CRF) which generally perform better than the Hidden Markov Models (HMM) used by ChaSen.  Because of this, my install script configures CaboCha to use MeCab as its default part of speech and morphological analyzer.  If you want to use ChaSen instead, you can pass CaboCha the “-a chasen” flag or change the install.sh script to use ChaSen instead of MeCab.</p>
<p><strong>Note:</strong> This script works for me in Snow Leopard, but I give no guarantees for how well it will work for you.  Also, be aware that all of these utilities work with <span class="caps">EUC</span>-JP encoded text, not <span class="caps">UTF</span>-8 encoded text, so you will need to configure your terminal to use <span class="caps">EUC</span>-JP.  If you’re using the standard Mac <span class="caps">OSX</span> Terminal app, the easiest way to do this is:</p>
<ol>
<li>Open the Terminal app.</li>
<li>Click on the ‘Terminal’ menu and then click on ‘Preferences’.</li>
<li>Click on ‘Settings’.</li>
<li>Highlight your favorite terminal profile (I use ‘Pro’ myself).</li>
<li>Click on the button with the gear on it and select ‘Duplicate Settings’.</li>
<li>Name the new profile something useful, like ‘EUC-JP’.</li>
<li>Click on the new profile and then click on the ‘Advanced’ tab.</li>
<li>Change the ‘Character encoding’ drop down to ‘Japanese (EUC)’.</li>
<li>Close the preferences window and quit the terminal app to make sure it picks up the changes.</li>
</ol>
<p>To use your new profile, relaunch the terminal app, click on the ‘Shell’ menu, select ‘New Window’, and then select ‘EUC-JP’ (or whatever you called your new profile).  This will open a new terminal window that is using <span class="caps">EUC</span>-JP for its character encoding.  You might also want to mess with the fonts for your new profile to make the Japanese characters easier to read.</p>
<p>The following websites were especially helpful in showing me what files to modify in order to get TinySVM installed properly:</p>
<ul>
<li><a href="http://www.ai.cs.scitec.kobe-u.ac.jp/~kitamura/moriwiki/index.php?Research%2FCabocha%2BYamcha%2BTinySVM%2BMeCab%E3%81%AE%E5%B0%8E%E5%85%A5">Research/Cabocha+Yamcha+TinySVM+MeCabの導入</a></li>
<li><a href="http://cl.cs.okayama-u.ac.jp/study/installmemo.html">Mac OS X Pantherに言語処理ツール</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/09/24/installing-cabocha-in-mac-osx-with-macports/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Cubicle</title>
		<link>http://vaelen.org/2009/09/15/new-cubicle/</link>
		<comments>http://vaelen.org/2009/09/15/new-cubicle/#comments</comments>
		<pubDate>Tue, 15 Sep 2009 12:39:34 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Student Life]]></category>

		<guid isPermaLink="false">http://www.vaelen.org/2009/09/15/new-cubicle/</guid>
		<description><![CDATA[I got my new cubicle assignment yesterday. The room is only around 5′×5′ and I share it with another graduate student, but it has a door that locks and a bookshelf where I can keep some of my books.]]></description>
			<content:encoded><![CDATA[<p>I got my new cubicle assignment yesterday. The room is only around 5′×5′ and I share it with another graduate student, but it has a door that locks and a bookshelf where I can keep some of my books.</p>
<p><a href="http://vaelen.org/wp-content/uploads/2009/10/p_1600_1200_533515ad-a8fd-4f0a-a93d-b2696daf10bd.jpeg"><img class="alignnone size-full wp-image-364" src="http://vaelen.org/wp-content/uploads/2009/10/p_1600_1200_533515ad-a8fd-4f0a-a93d-b2696daf10bd.jpeg" alt="" width="225" height="300" /></a></p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/09/15/new-cubicle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Effect of Selectional Preferences on Semantic Role Labeling</title>
		<link>http://vaelen.org/2009/08/19/the-effect-of-selectional-preferences-on-semantic-role-labeling/</link>
		<comments>http://vaelen.org/2009/08/19/the-effect-of-selectional-preferences-on-semantic-role-labeling/#comments</comments>
		<pubDate>Wed, 19 Aug 2009 19:23:04 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Linguistics]]></category>

		<guid isPermaLink="false">http://blog.vaelen.org/?p=3</guid>
		<description><![CDATA[My undergraduate honors thesis has been approved by my advisor and is now available onilne: Andrew Young. The Effect of Selectional Preferences on Semantic Role Labeling. Undergraduate Honors Thesis, The University of Texas at Austin. It ended up being almost 60 pages and around 6000 words (according to a LaTeX word count tool I found.)]]></description>
			<content:encoded><![CDATA[<p>My undergraduate honors thesis has been approved by my advisor and is now available onilne:</p>
<ul>
<li>Andrew Young. <a href="/honors_thesis.pdf" target="_blank">The Effect of Selectional Preferences on Semantic Role Labeling.</a> Undergraduate Honors Thesis, The University of Texas at Austin.</li>
</ul>
<p>It ended up being almost 60 pages and around 6000 words (according to a <a href="http://lwc.sourceforge.net/">LaTeX word count tool</a> I found.)</p>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/08/19/the-effect-of-selectional-preferences-on-semantic-role-labeling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The 3 Year Plan</title>
		<link>http://vaelen.org/2009/07/27/the-3-year-plan/</link>
		<comments>http://vaelen.org/2009/07/27/the-3-year-plan/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 19:24:45 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Japan]]></category>
		<category><![CDATA[Japanese]]></category>
		<category><![CDATA[Keio University]]></category>
		<category><![CDATA[Linguistics]]></category>

		<guid isPermaLink="false">http://blog.vaelen.org/?p=6</guid>
		<description><![CDATA[Next month I graduate from The University of Texas at Austin with a bachelors degree in linguistics with departmental honors. In September I start my graduate studies in the same department at UT, where I’ll be working on my masters degree with a specialization in computational linguistics. The original plan had been to apply for [...]]]></description>
			<content:encoded><![CDATA[<p>Next month I graduate from <a href="http://www.utexas.edu/">The University of Texas at Austin</a> with a bachelors degree in <a href="http://www.utexas.edu/cola/depts/linguistics/">linguistics</a> with departmental honors.  In September I start my graduate studies in the same department at <span class="caps"><span class="caps">UT, </span></span>where I’ll be working on my masters degree with a specialization in <a href="http://comp.ling.utexas.edu/people/students">computational linguistics</a>.  The original plan had been to apply for the PhD program at UT after completing my masters degree, but now my plans have changed.  The new plan: <a href="http://www.keio.ac.jp/index-en.html">Keio University</a>.</p>
<p><span id="more-6"></span></p>
<p>When I went back to school in the summer of 2004, I declared myself to be double majoring in <a href="http://www.cs.utexas.edu/">computer science</a> and <a href="http://www.laits.utexas.edu/japanese/">Japanese</a>.  However, I discovered two years later – after finishing four semesters of Japanese and a handful of the undergraduate CS courses – that I wasn’t devoted enough to either of those majors to be able to complete a degree while working full time and raising a family.  Luckily though, I took an introductory linguistics class in 2006 and discovered that I really enjoyed linguistics, so I switched my major at first to Japanese and linguistics, and then to English and linguistics, and then finally to just linguistics.  But that didn’t mean that I wasn’t still interested in CS and Japanese.  In fact, I found that my linguistic interests gravitated towards computational linguistics and Japanese linguistics.</p>
<p>I’ve had a long time interest in studying abroad in Japan, which recently led me to start looking for linguistics programs in Japan.  Unfortunately, I don’t speak nearly enough Japanese to attend a Japanese university where all the courses are taught in Japanese.  However, a handful of universities in Japan offer “international” degree programs now where the language of study is English.  The Japanese government is providing grants and scholarships to promote these programs – especially at the graduate level – because they want to bring more international students into Japanese research universities.</p>
<p>While digging through lists of international graduate programs, I came across the program offered by Keio University’s <a href="http://www.st.keio.ac.jp/english/admissions/index.html">Graduate School of Science and Technology</a>.  The admissions information mentioned the need to seek out an advisor before applying to the university and it provided a convenient link to look through the list of professors.  While looking through this list, something interesting caught my eye:</p>
<blockquote><p>Ohara, Kyoko Hirose Associate Professor</p>
<p>Research Area:</p>
<p style="padding-left: 1em;">Cognitive Linguistics / Corpus Linguistics / Lexical Semantics / Contrastive analysis of Japanese and English / Natural Language Processing</p>
</blockquote>
<p><a href="http://vaelen.org/wp-content/uploads/2009/10/300px-keio_university.jpg?w=210"><img class="size-medium wp-image-84  alignleft" title="keio_university" src="http://vaelen.org/wp-content/uploads/2009/10/300px-keio_university.jpg?w=210" alt="Keio University" width="300" height="428" /></a></p>
<p>While I had seen plenty of professors with “Natural Language Processing” listed before, they were usually researching machine translation, speech recognition, or some sort of human-machine interface, usually involving robotics.  However, Ohara-sensei’s research is exactly the sort of thing that I am currently working on.  In fact, she is in the process of producing a <a href="http://jfn.st.hc.keio.ac.jp/index.html">Japanese</a> version of <a href="http://framenet.icsi.berkeley.edu/">FrameNet</a>, which is the annotated corpus I am currently using for my honors thesis research.  Also, I discovered that Keio is Japan’s <a href="http://en.wikipedia.org/wiki/Keio_University">top private university</a> and was the first university in Japan, opening a few years before the <a href="http://en.wikipedia.org/wiki/Meiji_Restoration">Meiji revolution</a> occurred (it is named for the era that came before the <a href="http://en.wikipedia.org/wiki/Meiji_period">Meiji</a> era, which was called the <a href="http://en.wikipedia.org/wiki/Keiō">Keio</a> era).</p>
<p>I emailed Ohara-sensei and spoke with my current advisor about the possibility of studying at Keio after I finish my masters degree.  So far it looks very promising.  My main deficiency is my lack of a solid understanding of Japanese.  Although the program is in English, and my dissertation would be in English, I would still be studying the linguistics of Japanese, and therefore I need a much better grasp of the Japanese language.  However, it will be at least 3 years before I finish my masters degree, which will give me plenty of time to improve my Japanese to a sufficient level.  I also plan to do my master’s thesis on a topic involving Japanese linguistics that will allow me to make use of the Japanese FrameNet project, which should hopefully help my chances of being admitted to Keio.</p>
<p>After confirming that this would indeed something that I could actually do, I spoke with my wife and daughter about it and they are both onboard with the idea, as long as it is actually feasible.  I’ve discovered that the Japanese government has a scholarship available to international graduate students (the same one I mentioned above) that will pay for 3 years of tuition (which is the normal length of time that a PhD takes at Keio) and provide the student with about a $1,000 a month stipend.  The scholarship is competitive, but I will try to get it and hope for the best.  My wife was also won over by the fact that the <a href="http://www.keio-unicorns.com/">school mascot</a> is a <a href="http://www.geocities.jp/unicorns_hp/index.html">unicorn</a>!</p>
<div id="attachment_86" class="wp-caption alignright" style="width: 310px"><a href="http://vaelen.org/wp-content/uploads/2009/10/keio-unicorns.jpg?w=300"><img class="size-medium wp-image-86  " title="keio-unicorns" src="http://vaelen.org/wp-content/uploads/2009/10/keio-unicorns.jpg?w=300" alt="The Keio Mascot" width="300" height="261" /></a><p class="wp-caption-text">The Keio Mascot</p></div>
<p>The campus I would be attending is in Yokohama, so housing should be less expensive than in Tokyo proper.  Also, we hope to teach my daughter as much Japanese as possible before then and then put her in the public school system so that she can experience the culture more closely and hopefully learn to speak fluent Japanese.</p>
<p>Of course, this is all very preliminary at the moment, and nothing is certain.  However, I’m quite serious about it and I really think it is going to happen.  I’m really looking forward to spending a few years in Japan.  It would be a once-in-a-lifetime experience!</p>
<div style="padding: 10px; opacity: 0; display: none; background-color: #ffffff; position: fixed; right: 0px; top: 0px;"><img title="Replacing Emoji..." src="data:image/gif;base64,R0lGODlhEAAQAOYAAP////7+/qOjo/39/enp6bW1tfn5+fr6+vX19fz8/Kurq+3t7cDAwLGxscfHx+Xl5fT09LS0tPf398HBwc/Pz+bm5gMDA+Tk5N/f38TExO7u7pqamsLCwtTU1OLi4jw8PKioqLCwsPLy8q2trbKystvb26qqqtnZ2dfX17u7uyYmJs3NzdjY2Lm5uZ6ensvLy66urvv7++zs7FJSUurq6oWFhfb29kpKStzc3AwMDNHR0aSkpCkpKefn511dXb29vaenp8zMzLe3t/Hx8dDQ0FlZWWZmZsrKyqampvDw8ODg4Li4uL+/v+jo6PPz88jIyHp6eqWlpb6+vk5OTsPDw8bGxsXFxRQUFGpqat3d3fj4+NbW1rq6ury8vJCQkG5ubhwcHN7e3paWloKCgoyMjImJiWFhYXR0dFRUVIeHh5OTk0ZGRo6OjldXV39/fzIyMnd3d9ra2nx8fDY2NnFxcUFBQWxsbJSUlHh4eKGhoaKioi0tLSMjI4CAgNLS0qysrCH/C05FVFNDQVBFMi4wAwEAAAAh+QQEBQAAACwAAAAAEAAQAAAHyIAAggADgi1oCYOKghVfHQAbVwkHLSWLAE1vPgBqYAAUAj2KFQQAETw/ZXwrOy8ABwQBA2NFPwg+XjoFUSE2FREgEgAYNTNwNlqCk08CBReKL1GFih0sgyk7USAelxAOEwxHQGxeYmGXIi0kDVKDFzoBixjPgxIZG38xiz8CVCIAAZYICOKtA4QhSrogYAHEhAEAJSoAICDgxIsCDwRsAZDkxDQABkhECJBhBAArUTRcIqDgAQAOCgIggIHiUgBhAFakiGcgkaBAACH5BAQFAAAALAAAAAANAAsAAAdvgACCAAOCG3SFg4IXcDgAX3MDWjdMgzI+bgBnHwB3Fg4ADxoAHGgcUDcnFnSEYmNBEnIuOgwgKjIVABUCcmISB4IHIksCg1tcAYoAHSxBP0IFPcoAEA4TDQ0FTdMiLYMLYcmKGBcABhRIITHKPwKBACH5BAQFAAAALAAAAAAQAAgAAAdkgACCAAOCCmSFg4oAPWIPAGVmA04+XYsASWMuAGxGnDxUigROAERQHRtYKDw1AAZZAQMRIHEGG1wYQQ1rMh1FORoAGgwCEQYxggkQchZvBQGDF0TQiml3gysME1ULl00bTAxHgQAh+QQEBQAAACwDAAAADQAKAAAHZ4AAAQAAUkADhIkAMgUEAEhpAwhjRIkIJgUAIGUAAlM6ihh6KCNkODMuABAYATgHXFQXKEx2MlZTdTYCQjEJhAkIbjwzPwEXRIOKG0CJVQuKhBdpZGIwBU3QADgfPCpTC2HJiSFdiYEAIfkEBAUAAAAsBQAAAAsADgAAB3mAAAA6TAGChwALABwmARIuHYcpABlAAC1QOIcCHg55F3IFADYeAVwUMjhBXkkUXz42MQmCA1piM2dBAYaII6KIiE1jX1hkwAAeRTdrX7yHJA6HMYgBN3x5ig4dEEMsRhd3V21aAicvBQ96UgBbGwkRARkjAFZRioKBACH5BAQFAAAALAgAAQAIAA8AAAdigAoBBy0lAIcjABQCFYcAITI7LwBaFwEPWSFOcWpjNgADBiNQYiyOABxPp4cLG2U1Lo49UF92ZY4FVqsBZipnSgAXJm0EAm9vNmRLFgUAcSQDiT58BI6CF2DNhykBACIJjoEAIfkEBAUAAAAsBgACAAoADgAAB22AABkjABQCPQCJHg4hMjsvAAcEARQyD1khNhURIBIJiQMHTwIhGImnAEeQqKcaI0g7BawyG15eSKwcK6yJAWMzZA8AO0pxQmYEBUVmWiFfbQ4qLgAeRwMDPlMAZzwoqGhTARVrUqhQcAMAnqeBACH5BAQFAAAALAMABQANAAsAAAdygAJCMQkAAAMHTwIFFwAXRAGGkh0sklULkpIQDhMMRwVNmYYaJgohUgsskZlEKJJIbQiZAXpQIDIALR5GYhcYGW4aR301WgATYBFjaCszIQAERAMaPHADZ3UAajNhlh84AF9zAzJGVZIDsgBeWIVahYaBACH5BAQFAAAALAAACAAQAAgAAAdlgBMNDUAoAIeIIi0kDVKIFAIDiIcYF5NDUDl7NpMAKQJUIgAJHzkbBFAbND0dGyIoQCYGAEtZAEcqChtnJ1AcAEknkodDN1MDXmYAI3IVnQAdcxMAZD4BSWUvzwEQhztjkloJiIEAIfkEBAUAAAAsAAAGAA0ACgAAB2SAAIJWGwOChx0sUDMzZkGHhxAOfUVtRRmQgiIthywkhpAYFwBDZHt1Epk/AgNGfGU9Yn8LMihdCCwAR5gdM0shaiV5W5AQX3QBIGUAP1EahxdGKwBINQEiMCiHAakAKS6GBgmBACH5BAQFAAAALAAAAwALAA0AAAdygABPGAA6Ah4OITI7Az5XLiJYGTIPWSEATWx8c04xAAADB58ADmQDo59eWF9wHaifeGs3aEevqCUMp68QSG1GBq8DblMuCw0MQ0NKXQAUFAAYUA5MBQ8CozZeagE/IwBWow81JwATCgEIowESnyspAQCBACH5BAQFAAAALAAAAAAIAA8AAAdhgACCAAmCOoM4b4ccg0N8dQAZACgeAFUWIQ0DM3MKCGhQJ5NYKmgIB4MAHF4DgjtlZGolg2RYWGcoqYIXRAGDEiluZagAAxtQBUkZHRAAfnEAPQInL4MGJBEBkoIECg+qgQA7" alt="Replacing Emoji..." /></div>
]]></content:encoded>
			<wfw:commentRss>http://vaelen.org/2009/07/27/the-3-year-plan/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
