dataflood

Documenting my project with DBpedia-Spotlight during GSoC 2012.

Days 70-74

Some quick updates on my current work:
I’m struggling with the lift-json api – I can’t seem to read the token values correctly. Going to keep at it until this is fixed.

I also wrote a new Pig script to calculate tfidf vectors for all resources.

–Working through ESA implementation–

– once I have lift working correctly, the InvertedIndex will be ready

–  RD interfaces are finished
–  I chose to index tfidf directly using my Pig script, instead of calculating inside spotlight

– This may require a new Token class that can hold weights, instead of counts (Int)

– it would technically be possible to use both tficf (for query) and tfidf (for doc index) within one system

-Updated spotlight-db
–> current memory indexing system fills up the memory on my machine and the system starts paging, this slows the process down to a crawl

–> I’m in the midst of a coding sprint so expect results soon

Advertisements

Days 64-68

I’ve been working hard this week, and I’ve made good progress. I’m finally able to run an example of LSA disambiguation using the Milne-Witten corpus for testing. There are still lots of issues with this system, mostly related to my custom indexes for the Candidates, Resources, and the Corpus Data:

[1] LsaIndex

[2] CandidateIndex

[3] CorpusData

This example uses only the surface forms contained in the first doc of the Milne-Witten corpus to keep things at a manageable size. This data was extracted using the URI–>context extractor pig script – the parsing occurs when the data is loaded into the index. The accuracy is low (~35%), but there are several things to keep in mind:

  • There are only 100 dimensions in each doc vector in this example – this is pretty cool when you consider that there were more than 80,000 terms in the vocabulary before LSA.
  • Some of the resources included in the Milne-Witten corpus weren’t found in the wikidump by the Pig script. This may be because the script requires that the resource occurs with the surface form at least 10 times, or because the page names in Wikipedia have changed since the test data was created (unlikely).
  • Although there are >80,000 terms, the input to the model is only 33MB total, so the full power of LSA – capturing latent associations between terms – is certainly not realized with such a small dataset.

The easiest way to run this example is to clone this repo, load into IntelliJ and run the class: org.dbpedia.spotlight.evaluation.EvaluateLsaDisambiguation from inside the IDE. You’ll get some debugging info on your console, and the file names are hard-coded until I determine the best way to create a configuration for this system.

The challenge now is integrating this into the existing DBpedia-Spotlight Evaluation framework. Because we’re getting close to the end of GSoC, I’m going to switch to ESA disambiguation, and return to this once I’ve made some progress.

One final note – I’ve added a class in pignlproc that allows for compressed JSON output. This can be used by adding something like:

DEFINE JsonCompressed pignlproc.storage.JsonCompressedStorage();

to Pig scripts that have the newest version of the Pignlproc jar registered.

Days 55-57

I’m right in the middle of scaling the LSA system up, and implementing the models in DBpedia-Spotlight properly. I have my system returning similarity scores for query vectors, returning the ranked resources for the surface form given its context.

A lot of my implementation is still very hackish, and doesn’t fit nicely into the DBpedia-Spotlight abstractions. I’m going to try to move to Scala (mostly) this week, because I think I’m familiar enough now, and because I want my code to fit with the code that others are producing.

This shell script prepares a corpus and runs SVD, producing the output needed to build the LSI index.

I’m working on implementing the ParagraphDisambiguator trait, and I hope to finish a working prototype tomorrow.

 

Days 49-52

I’ve been working on moving my code to my fork of DBpedia-Spotlight, and creating a simple indexing system that I can use to test disambiguation. Some issues I’ve been dealing with:

  • As mentioned, I had to patch the mahout source code, but this has the potential to create some nasty dependency problems in the future. I don’t see any obvious way around this, unless I want to build the index from hadoop sequence files every time (mahout vectors implement Writable and not Serializable, and JDBM requires persisted objects to be Serializable).
  • I’ve translated the mahout vectors into ArrayList<Double> objects and stored in a JDBM persistent TreeMap, using the resource name as the key. The next step is to be able to compare the vectors using cosine similarity. Using the mahout utilities will require translating each vector back into a mahout vector for comparison, which is my initial plan. Later I may look for a better implementation.
  • My goal is to implement the Disambiguator interface in org.dbpedia.spotlight.disambiguate, so I’m looking at how I can create the methods I need to implement.

Days 44-47

I’ve made a lot of progress with Mahout in the last few days, and I’ve modified the Mahout source code to allow for some new functionality.

The issue was that mahout seq2sparse calculates tf-idf, and fills vector cells based upon term frequencies that are read from the output of a previous job. This makes it difficult to add a new document vector with TF-IDF weighting matching df counts from the corpus, and having the same dimensionality as the existing matrix of documents (note that this is _before_ any dimensionality reduction has been applied).

To get around this, I added two options to the utility:

  • –useDict (-ud) – The path to the dictionary file to use when creating vectors
  • –useFreqFile (-uf) useFreqFile     The path to the directory containing the wordcounts

There are several benefits to this approach:

  • Everything still works in Hadoop (Map-Reduce dataflow isn’t affected)
  • We can use almost the same pipeline (with a few different options) to generate and fold query vectors as we used to generate the LSI matrix

After I finish testing with one vector, I’m moving on to storing vectors in JDBM, which will get me one step closer to having a real disambiguation system.

I also created a wiki page to track issues and remaining TODOs for indexing with pignlproc.

To keep things organized while trying not to break anything too badly, I forked the Mahout trunk, and I’ll be mostly committing there until the projects are ready to be pulled together.

Days 41-43

I’ve gotten the Mahout pipeline working, and successfully tested it using the resources that are pointed to by the surface form ‘plant’.

Here is the mahout workflow:

  • mahout seqdirectory – make a directory of text files into a sequencefile
  • mahout seq2sparse – make that sequencefile into vectors. This tool generates the vectors (both tf and tf-idf), the dictionary, and the raw term counts, so it’s really powerful and useful.
  • mahout rowid – this generates the matrix from the seq2sparse output. I used the tf-idf vectors for testing.
  • mahout ssvd – this takes the matrix output of rowid and outputs the SVD components U, SIGMA, and V

One nice thing about working with Mahout is that the results of any of these jobs can be viewed using ‘mahout seqdumper’. This makes it easy to see what is happening step-by-step.

I ran mahout rowsimilarity (using cosine similarity) on the U*SIGMA^(0.5) matrix [1] output by ssvd with dimensionality reduced to k=50 to quickly test if things were working. The most similar docs to each of the resources in the ‘plant’ dataset seem to be pretty accurate, but of course this needs real testing to be confirmed. For example, the top 9 most similar docs in the ‘plant’ dataset for the resource ‘Plant’ (http://en.wikipedia.org/wiki/Plant) are:

  • Flowering_plant, Agriculture, Annual_plant, Flora, Botany, Turtle, Toxin, Plant_sexuality, Herbalism

whereas the top 9 for ‘Factory’ (http://en.wikipedia.org/wiki/Factory) in the ‘plant’  dataset are:

  • Agriculture, Power_station, Botany, Nuclear_power_plant, Flora, Turtle, Gasworks, Desalination, Plant

There are obviously some weird similarities in here, but the document vectors have been reduced to only 50 dimensions (from more than 50,000 terms originally), and the set of resources to choose from only contains 70 docs, so I think this is close enough to confirm that some of the structure of the matrix is being preserved.

I also have the folding process working via the mahout matrix/vector multiplication utils, so now I’m working on building vectors for arbitrary queries using the dictionary and term index output by seq2sparse. The next step is to test disambiguation on a vector built from an arbitrary context (I’ll choose one which clearly uses a certain meaning of ‘plant’ and check the output).

[1] My matrix is doc X term (docs are rows and terms are columns)

Days 38-40

I’ve been working hard to catch up on my proposal, and studying/testing Mahout in depth over the last three days. The lack of comprehensive documentation and the frequent changes in versions make this a real challenge, but I finally have an end-to-end plan laid out for LSA indexing using Mahout, HBase, and Hadoop.

Several Mahout utilities will be used to build TF-IDF vectors, and create a distributed matrix which will then become the input to Mahout’s ssvd tool. This will output the reduced-dimension matrices needed for LSA. The matrix with docs (resources) as columns (rows if transposed) supplies the document vectors, which will be put into HBase so that we can retrieve the vectors we need to disambiguate a spotted surface form.

The context vector of the spotted surface form will be folded into the reduced-dimensional space by multiplying by the decomposed matrices that are output by ssvd. I consulted the Mahout mailing list regarding this approach, and they seem to think that it will work. Diagram of the process and more mathematical background coming soon.

Days 34-37

I’m nearing the end (hopefully) of the first iteration of the Hadoop indexing code. indexer_small_cluster.pig is running well on my cluster, and seems to be  efficient. I’ve made a lot of small optimizations and fixes over the last few days.

  • updated indexer_lucene.pig and indexer_small_cluster.pig with an good set of user-specifiable parameters
  • updated the UDFs for these scripts based upon advice from my mentors – this has made both scripts run more efficiently, as evidenced by a considerable reduction in average completion time (there are other factors, such as network load, that make it difficult to determine exactly how much improvement is due to specific optimizations).

I’ll finish the second iteration of the how-to on the DBpedia-Spotlight wiki tomorrow, then I’ll be focusing on the implementation of the disambiguation methods full-time. I’ve been working on determining the best way to ‘fold’ a query into the vector space created after SVD is complete, as well as several other issues regarding ad-hoc querying of the LSA matrix.

Days 31-33

Things are finally starting to go more smoothly, and I’m making optimisations to the functional indexing system, and making progress on a system for running LSA for disambiguation (using Mahout’s implementation of the Lanczos algorithm for SVD).

Over the last three days:

  • Upgraded Lucene in Pignlproc POM to 3.6.0, and tested with EnglishAnalyzer, GermanAnalyzer, and SpanishAnalyzer. Upgrading the version did not break any of the tests.
  • Also added Google’s Guava 10.0.1 for file utilities, but have since removed the code that uses it, because the Hadoop Distributed Cache works perfectly for my purposes. Leaving Guava in the POM temporarily until a stable version of the indexing UDFs is complete.
  • ran full indexing of English Wikidump in Hadoop using the English_Analyzer (Snowball analyzer is now deprecated), and the DBpedia-Spotlight Stopword list. This analyzer tokenizes, removes stopwords (using a list loaded via Hadoop Distributed Cache), and stems the remaining tokens using the Porter stemmer. The bzipped index is 8.9 Gb (the index produced by the previous iteration of Get_Counts_Lucene is 11Gb).
  • Set up Mahout and ran sample SVD tests using this page as a guide

Changed Get_Count_Lucene.java to work with Hadoop distributed cache instead of trying to package the stopword list in the jar that is sent to each node. This requires a bit more configuration, because the user must now put the stopwords file into HDFS and specify its location for indexing to work. As we move into indexing lots of languages, this could get complicated, so I need to think about the most straightforward way to specify parameters for an indexing job, and to make sure that users have all necessary files in place. Will update the indexing how-to on the DBpedia-Spotlight wiki early tomorrow.

Days 27-30

I’ve made a lot of progress over the last three days with my Hadoop configurations and optimising my Pig scripts. I can now run full English indexing with this script, which extracts tuples with the schema:
(URI_1, context_1)
(URI_1, context_2)

This index will be useful for disambiguation methods that don’t use a bag-of-words vector space approach, although further processing of the contexts will still be required to obtain features of interest for the particular method.

I also successfully ran the SF –> URI extraction script on the full English wikidump, generating the set of URIs that each surface form can point to. One issue with this script is that some of the URIs are redirects, which needs to be fixed. I’m planning to patch this tomorrow.

I also completed a rough-draft version of a Hadoop/Pig indexing tutorial. This still needs quite a bit more work, but it outlines the major steps.

I realized that the the main issue hurting the performance of my cluster is that the nodes are not balanced in terms of space allocation, so two nodes have much more space than the others. For small jobs, this isn’t a problem, but the small nodes run out of space on big jobs, so the performance slows to a crawl. I’ve enabled compression of intermediate results (map output) and experimented with different block sizes using dfs.block.size, as well as with many different configurations of pig.properties and number of maps/reduces, but this script just can’t seem to finish. I know that one issue is this line:

contexts = FOREACH paragraphs GENERATE
          targetUri, FLATTEN(tokens(paragraph)) AS word;

Where each URI is stored as map output once for each of its tokens in the format:

URI_1, token_1
URI_1, token_2

Because the space requirements of this map task are large, some of the mappers run out of space on at this step, start getting blacklisted, and everything fails from there.

So I wrote another script that aggregates all contexts with their URIs first, then tokenizes and counts inside a single UDF. This doesn’t maximize the power of Pig/Hadoop, but for now, it’s the only way I can run indexing on my cluster. This script works well, and I’m testing it on the full Wikidump tonight. If it is sucessful, I’ll try tomorrow morning with the German dump using the Lucene German analyzer.