I’ve made a lot of progress over the last three days with my Hadoop configurations and optimising my Pig scripts. I can now run full English indexing with this script, which extracts tuples with the schema:
This index will be useful for disambiguation methods that don’t use a bag-of-words vector space approach, although further processing of the contexts will still be required to obtain features of interest for the particular method.
I also successfully ran the SF –> URI extraction script on the full English wikidump, generating the set of URIs that each surface form can point to. One issue with this script is that some of the URIs are redirects, which needs to be fixed. I’m planning to patch this tomorrow.
I also completed a rough-draft version of a Hadoop/Pig indexing tutorial. This still needs quite a bit more work, but it outlines the major steps.
I realized that the the main issue hurting the performance of my cluster is that the nodes are not balanced in terms of space allocation, so two nodes have much more space than the others. For small jobs, this isn’t a problem, but the small nodes run out of space on big jobs, so the performance slows to a crawl. I’ve enabled compression of intermediate results (map output) and experimented with different block sizes using dfs.block.size, as well as with many different configurations of pig.properties and number of maps/reduces, but this script just can’t seem to finish. I know that one issue is this line:
contexts = FOREACH paragraphs GENERATE
targetUri, FLATTEN(tokens(paragraph)) AS word;
Where each URI is stored as map output once for each of its tokens in the format:
Because the space requirements of this map task are large, some of the mappers run out of space on at this step, start getting blacklisted, and everything fails from there.
So I wrote another script that aggregates all contexts with their URIs first, then tokenizes and counts inside a single UDF. This doesn’t maximize the power of Pig/Hadoop, but for now, it’s the only way I can run indexing on my cluster. This script works well, and I’m testing it on the full Wikidump tonight. If it is sucessful, I’ll try tomorrow morning with the German dump using the Lucene German analyzer.