"A meek endeavor to the triumph" by Sampath Jayarathna

Friday, September 21, 2012

Reading and displaying tf vector content of Apache Mahout SequenceFile

This java code segment reads through the Apache Mahout SequenceFile (generated by the Mahout seq2sparse tool), to display the tokens and their tf values. The code uses the dictionary file to map the token index to token in the dictionary.

You can read the Mahout Wiki on creating vectors from Text Documents. I have created tf vectors from a directory of documents and followed the process explained in the wiki.  You basically need to run Mahout seqdirectory tool to create intermediate SequenceFile and then seq2sparse tool to create the tf vectors and dictionary file. Then you can use my code to examine the content of the SequenceFile and identify which terms in a document get higher term frequency and do further research.

In first few lines, the code reads and populate a HashMap of the dictionary file created by the Mahout seq2sparse tool and next the reader reads the tf-vectors of the SequenceFile and each of the token of the tf vectors are mapped from the dictionary map.  You may need to import some of the followings,

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.math.Vector.Element;

public static void readMahoutSequenceFile()
{
 Configuration conf = new Configuration();
 FileSystem fs;
 SequenceFile.Reader read;
 try {
  fs = FileSystem.get(conf);
  read = new SequenceFile.Reader(fs, new Path("/Sparsedir/dictionary.file-0"), conf);
  IntWritable dicKey = new IntWritable();
  Text text = new Text();
  HashMap dictionaryMap = new HashMap();
  try {
      while (read.next(text, dicKey)) {
         dictionaryMap.put(Integer.parseInt(dicKey.toString()), text.toString());
      }
   } catch (NumberFormatException e) {
       e.printStackTrace();
   } catch (IOException e) {
       e.printStackTrace();
   }
   read.close();
         
   read = new SequenceFile.Reader(fs, new Path("/Sparsedir/tf-vectors/part-r-00000"), conf);
   Text key = new Text();
   VectorWritable value = new VectorWritable();
   SequentialAccessSparseVector vect;
   while (read.next(key, value)) {
        NamedVector namedVector = (NamedVector)value.get();
        vect= (SequentialAccessSparseVector)namedVector.getDelegate();
        for( Element  e : vect ){
           System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
          }
         }
         read.close();        
  } catch (IOException e) {
   // TODO Auto-generated catch block
  e.printStackTrace();
 }
}

2 comments:

SIDDHARTHA said...

Sampath, Your blog is great. I just have a doubt regarding reading the output of dictionary file. Should I create a class and put it in some directory and re-run mvn install to rebuild the mahout dist? How should I run it?

Sampath Jayarathna said...

You can create your class inside the Mahout directory structure. What I did was I just load the Mahout source to eclipse and then create my class inside Mahout core class structure. This way I don't need any jar files to get the necessary library classes. To read the dictionary, you can go upto the first read.close() where I read my dictionary file.