Open Maze: Java

Friday, September 21, 2012

Reading and displaying tf vector content of Apache Mahout SequenceFile

This java code segment reads through the Apache Mahout SequenceFile (generated by the Mahout seq2sparse tool), to display the tokens and their tf values. The code uses the dictionary file to map the token index to token in the dictionary.

You can read the Mahout Wiki on creating vectors from Text Documents. I have created tf vectors from a directory of documents and followed the process explained in the wiki. You basically need to run Mahout seqdirectory tool to create intermediate SequenceFile and then seq2sparse tool to create the tf vectors and dictionary file. Then you can use my code to examine the content of the SequenceFile and identify which terms in a document get higher term frequency and do further research.

In first few lines, the code reads and populate a HashMap of the dictionary file created by the Mahout seq2sparse tool and next the reader reads the tf-vectors of the SequenceFile and each of the token of the tf vectors are mapped from the dictionary map. You may need to import some of the followings,

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.math.Vector.Element;

public static void readMahoutSequenceFile()
{
 Configuration conf = new Configuration();
 FileSystem fs;
 SequenceFile.Reader read;
 try {
  fs = FileSystem.get(conf);
  read = new SequenceFile.Reader(fs, new Path("/Sparsedir/dictionary.file-0"), conf);
  IntWritable dicKey = new IntWritable();
  Text text = new Text();
  HashMap dictionaryMap = new HashMap();
  try {
      while (read.next(text, dicKey)) {
         dictionaryMap.put(Integer.parseInt(dicKey.toString()), text.toString());
      }
   } catch (NumberFormatException e) {
       e.printStackTrace();
   } catch (IOException e) {
       e.printStackTrace();
   }
   read.close();
         
   read = new SequenceFile.Reader(fs, new Path("/Sparsedir/tf-vectors/part-r-00000"), conf);
   Text key = new Text();
   VectorWritable value = new VectorWritable();
   SequentialAccessSparseVector vect;
   while (read.next(key, value)) {
        NamedVector namedVector = (NamedVector)value.get();
        vect= (SequentialAccessSparseVector)namedVector.getDelegate();
        for( Element  e : vect ){
           System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
          }
         }
         read.close();        
  } catch (IOException e) {
   // TODO Auto-generated catch block
  e.printStackTrace();
 }
}