Open Maze: Programming

Showing posts with label Programming. Show all posts

Wednesday, December 30, 2015

[Weka] Attribute Selection/Ranking using Relief Algorithm

Following code snippet will show you how to find attribute ranking of the features from a data set before using in classification applications. I will be using the standard Weka 3.7.13 and the sample data file "weather.numeric.arff" inside your data folder of the Weka. I assume you know how to setup weka.jar files in your development environment.

Attribute means the something as feature in Weka.

This is the content of the sample data file,

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

To perform attribute selection, three elements are required. One is search method, and the second is evaluation method. Both elements need to be initiated and defined in a container class AttributeSelection. The third element is data. So the general framework of setting up attribute selection is like this:

public static void main(String[] args) throws Exception {
// load data
String workingDirectory = System.getProperty("user.dir");
String fs = System.getProperty("file.separator");
String wekadatafile = workingDirectory + fs + "data" + fs + "weather.numeric.arff";
BufferedReader datafile = readDataFile(wekadatafile);
Instances data = new Instances(datafile);

if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
useLowLevel(data, wekadatafile);
}

/**
* uses the low level approach
*/
protected static void useLowLevel(Instances data, String datafile) throws Exception {
System.out.println("\n3. Low-level");
AttributeSelection attsel = new AttributeSelection();
Ranker search = new Ranker();
ReliefFAttributeEval evals = new ReliefFAttributeEval();
attsel.setRanking(true);
attsel.setEvaluator(evals);
attsel.setSearch(search);
attsel.SelectAttributes(data);
// un-comment here to display the results from the ranking
//System.out.println(attsel.toResultsString());

// expand the ranked attributes so you can find the index, name and weight of the features
double[][] ranked = attsel.rankedAttributes();
System.out.println("ranked attributes!!!\n");
for(int i=0;i<ranked.length;i++){
System.out.println(" Feature:"+ data.attribute(index).name() +" weight:"+ ranked[i][1]);
}
}

Output

3. Low-level
ranked attributes!!!
Feature:outlook weight:0.0548
Feature:humidity weight:0.0113
Feature:windy weight:-0.0024
Feature:temperature weight:-0.0314

The overall setup for attribute selection is clear and intuitive. What's not so obvious is that search methods include ranking and sub-setting methods, and correspondingly, evaluation methods have individual evaluation and subset evaluation. Ranking search can't be used together with a subset evaluator, and vice versa.

If you are using Subset evaluation methods like CfsSubsetEval, then you need to use Subset search method like GreedyStepwise etc.

//CfsSubsetEval eval = new CfsSubsetEval();
//GreedyStepwise greedySearch = new GreedyStepwise();
//search.setSearchBackwards(true);
//attsel.setEvaluator(eval);
//attsel.setSearch(greedySearch);

Subset Search Methods:
1. BestFirst
2. GreedyStepwise
3. FCBFSearch (ASU)

Subset Evaluation Methods:
1. CfsSubsetEval
2. SymmetricalUncertAttributeSetEval (ASU)

Individual Search Methods:
1. Ranker

Individual Evaluation Methods:
1. CorrelationAttributeEval
2. GainRatioAttributeEval
3. InfoGainAttributeEval
4. OneRAttributeEval
5. PrincipalComponents (used with a Rander search to perform PCA and data transform
6. ReliefFAttributeEval
7. SymmetricalUncertAttributeEval

Wednesday, August 26, 2015

[Acrobat SDK Plug-in Development] How to extract text from Acrobat Text Highlight Tool?

This is an update to my previous post on Acrobat SDK Plug-in Development: How to create Text Highlight. In this new code, I'll explain far easier way to manipulate Adobe Acrobat Text Highlight tool to get the text from the highlighted area and then use it in your plugin.

This code snippet will explain how to extract text highlighted using the Text Highlight tool plug-in using Acrobat SDK. I assume that you've already know how to implement basic plug-in functionality using Acrobat SDK. The version of the SDK used in this code example is Acrobat XI SDK. I also assume following requirements,

Read PDF 32000-1:2008 12.5.6.10, “Text Markup Annotations”, for further information
wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
You can use PDWordFinderAcquireWordList() method to get the total word list and then use a loop to create the CosObj Dic. Read my previous post about how to extract terms from PDF and create COS Dictionary.

Step 1: If you start with the BasicPlugin.cpp in the Acrobat SDK then you should have the following function when you click on your plugin from the menu bar,

ACCB1 void ACCB2 MyPluginCommand(void *clientData)

{

// get this plugin's name for display

ASAtom NameAtom = ASExtensionGetRegisteredName (gExtensionID);

const char * name = ASAtomGetString(NameAtom);

char str[256];

sprintf(str,"This menu item is added by plugin %s.\n", name);

// try to get front PDF document

AVDoc avDoc = AVAppGetActiveDoc();

//Display words of the pdf file.

PDDoc currentPDDoc =AVDocGetPDDoc(avDoc);

AVPageView currentPageView = AVDocGetPageView (avDoc);

ASInt32 pageNum = AVPageViewGetPageNum(currentPageView);

//Create a PDWordFinderConfigRec object;

PDWordFinderConfigRec pConfig;

//Set the DWordFinderConfigRec object's attributes

memset(&pConfig, 0, sizeof(PDWordFinderConfigRec));

pConfig.recSize = sizeof(PDWordFinderConfigRec);

pConfig.ignoreCharGaps = true;

pConfig.ignoreLineGaps = true;

pConfig.noAnnots = true;

pConfig.noEncodingGuess = true;

//Create a PDWordFinder object

PDWordFinder pdWordFinder = PDDocCreateWordFinderEx(currentPDDoc, WF_LATEST_VERSION, false, &pConfig);

//Create a callback function

PDWordProc wordProc = NULL;

wordProc= ASCallbackCreateProto(PDWordProc, &getHighlightedText);

//Extract and display words highlighted

PDWordFinderEnumWords(pdWordFinder, pageNum, wordProc, NULL);

PDWordFinderDestroy(pdWordFinder);

string strs = pdfCorpus.str();

const char* ps = strs.c_str();

AVAlertNote(ps);

if(avDoc==NULL) {

// if no doc is loaded, make a message.

strcat(str,"There is no PDF document loaded in Acrobat.");

}

else {

// if a PDF is open, get its number of pages

PDDoc pdDoc = AVDocGetPDDoc (avDoc);

int numPages = PDDocGetNumPages (pdDoc);

sprintf(str,"%sThe active PDF document has %d pages.", str, numPages);

}

Step 2: Now use the getHighlightText method to go through all the annotations and get PDTextSelect object.

ACCB1 ASBool ACCB2 getHighlightedText(PDWordFinder wObj, PDWord wInfo, ASInt32 pgNum, void *clientData)

{

char stringBuffer[100];

AVDoc avDoc = AVAppGetActiveDoc();

PDDoc currentPDDoc =AVDocGetPDDoc(avDoc);

CosDoc cd = PDDocGetCosDoc(currentPDDoc);

PDAnnot annot;

PDPage pdpage = PDDocAcquirePage(currentPDDoc, pgNum);

ASInt32 numAnnots =PDPageGetNumAnnots(pdpage);

ASFixedRect boundingRect; // bounding rectangle of the term

char * annBuf;

for(ASInt32 i = 0; i< numAnnots; i++){

annot = PDPageGetAnnot(pdpage, i);

if (ASAtomFromString("Highlight") == PDAnnotGetSubtype(annot))

{

// Gets the annotation's rect

PDAnnotGetRect(annot, &boundingRect);

// Gets the text selection from the annotation's rect

PDTextSelect textSelect = PDDocCreateTextSelect(currentPDDoc, pgNum, &boundingRect);

// create a callback to get the text from highlighted bounding box

PDTextSelectEnumText( textSelect , ASCallbackCreateProto(PDTextSelectEnumTextProc,&pdTextSelectEnumTextProc) , &annBuf );

}

return 0;

}

Step 3: Create a callback function to extract the text from the PDTextSelect object. Here, pdfCorpus is a stringstream so I can use that in another part of the code.

ACCB1 ASBool ACCB2 pdTextSelectEnumTextProc(void* procObj, PDFont font, ASFixed size, PDColorValue color, char* text,ASInt32 textLen)
{
char stringBuffer[200];
strcpy(stringBuffer,text);
pdfCorpus << stringBuffer;
return true ;
}

Monday, July 28, 2014

[Acrobat SDK Plug-in Development] How to extract all the terms from the PDF document and create an index (COS Dictionary)

This code snippet will explain how to extract all the terms from a PDF document and then create a COS dictionary (similar to Java HashMap). You can use this method to collect all the terms and their offset value and use these offset of each term for other purposes such as highlighting terms.

I assume that you've already know how to implement basic plug-in functionality using Acrobat SDK. The version of the SDK used in this code example is Acrobat XI SDK. I also assume following requirements,

Read PDF 32000-1:2008 for further information
wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Offset of a term is the index inside the PDF document. As an example, first term of the PDF document starts with 0, and next 1 so on.

ACCB1 void ACCB2 termExtractor()
{
 // try to get front PDF document 
 AVDoc avDoc = AVAppGetActiveDoc();

 //Display words of the pdf file. 
 PDDoc currentPDDoc =AVDocGetPDDoc(avDoc);
 AVPageView currentPageView = AVDocGetPageView (avDoc);
 ASInt32 pageNum = AVPageViewGetPageNum(currentPageView);

 //Create a PDWordFinderConfigRec object;
 PDWordFinderConfigRec pConfig;
 
 //Set the DWordFinderConfigRec object's attributes
 memset(&pConfig, 0, sizeof(PDWordFinderConfigRec));
 pConfig.recSize = sizeof(PDWordFinderConfigRec);
 pConfig.ignoreCharGaps = true;
 pConfig.ignoreLineGaps = true;
 pConfig.noAnnots = true;
 pConfig.noEncodingGuess = true;

 //Create a PDWordFinder object
 PDWordFinder pdWordFinder = PDDocCreateWordFinderEx(currentPDDoc, WF_LATEST_VERSION, false, &pConfig);

 // Acquire all the terms inside the PDF page. 
 ASInt32 numWords;
 PDWord wordInfo;
 PDWord *pXYSortTable;
 PDWordFinderAcquireWordList(pdWordFinder, pageNum,&wordInfo, &pXYSortTable, NULL, &nWords);

 // Create COS Dictionary to keep track of all the words and their offset. 
 CosDoc cd;
 CosObj Dict;
 cd = PDDocGetCosDoc(currentPDDoc);
 Dict = CosNewDict(cd,false,nWords); 
 PDWord pdNWord = PDWordFinderGetNthWord(pdWordFinder, nWordCounter );
        for(int nWordCounter = 0; nWordCounter < nWords; nWordCounter++)
        {
  // Get the word as a string
  char stringBuffer[125];
  PDWordGetString (pdNWord, stringBuffer, sizeof(stringBuffer));
  pdfCorpus << stringBuffer;
  
  // Add each term into COS Dictionary to use it later with highlighting method
  // Offset is the location of each term in the document. First term offset is 0 and next term is 1 etc. 
  bool keyExist = CosDictKnown(Dict,ASAtomFromString(stringBuffer));
  if( keyExist == true) // To-do: Duplicate term
  {
   // To-do: catch duplilcates
  }
  else // new term
  {
   CosDictPut(Dict,ASAtomFromString(stringBuffer), CosNewInteger(cd,false,nWordCounter)); 
  }
 }

[Acrobat SDK Plug-in Development] How to create Text Highlight

Instead of the below method, if you want to know how to extract text directly using the Acrobat Text Highlight tool, please read my new blog entry: How to extract text from Acrobat Text Highlight Tool?

This code snippet will explain how to create text highlight plug-in using Acrobat SDK. I assume that you've already know how to implement basic plug-in functionality using Acrobat SDK. The version of the SDK used in this code example is Acrobat XI SDK. I also assume following requirements,

Read PDF 32000-1:2008 12.5.6.10, “Text Markup Annotations”, for further information
wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
There already exist a COS Dictionary with all the terms in the PDF document. COS dictionary is similar to a Java HashMap. It stores <Key,Value> pairs and in this case <term,offset> values.
Offset of a term is the index inside the PDF document. As an example, first term of the PDF document starts with 0, and next 1 so on.
You can use PDWordFinderAcquireWordList() method to get the total word list and then use a loop to create the CosObj Dic. Read my previous post about how to extract terms from PDF and create COS Dictionary.

void highlightText()
{  
 // There is a term "explosive" in the Cos Dictionary Dict  and get its offset
 CosObj offset = CosDictGet(Dict,ASAtomFromString("explosive"));
 //Create Highlight HliteEntry object to keep track of the offset (start index)
 //and the length which is how many terms to highlight. 
 HiliteEntry hilite;
 hilite.offset = CosIntegerValue(offset); 
 hilite.length = 1;

 AVDoc currentAVDoc = AVAppGetActiveDoc();
 PDDoc currentPDDoc = AVDocGetPDDoc(currentAVDoc);
 AVPageView currentPageView = AVDocGetPageView(currentAVDoc); 
 ASInt32 pageNum = AVPageViewGetPageNum(currentPageView);
  
 PDEElement pdeElement;
 ASFixedRect boundingRect; // bounding rectangle of the term
 PDPage pdPage = PDDocAcquirePage (currentPDDoc, pageNum);
 PDAnnot pdAnnot;
 
 // Set the color you want to highlight your text
 PDColorValueRec red;
 red.space = PDDeviceRGB;
 red.value[0] = ASInt32ToFixed(1); 
 red.value[1] = 0; 
 red.value[2] = 0; 

 // highlight 
 AVPageViewSetColor(currentPageView, &red); 
 PDTextSelect textSelection = PDTextSelectCreateWordHilite(pdPage,&hilite, 1);

 AVDocSetSelection(currentAVDoc, ASAtomFromString("Text"),(void *)textSelection, true);
 AVPageViewDrawNow(currentPageView);
 AVDocShowSelection (currentAVDoc);
 
 // make text selection and get the bounding rectangle of the selection. 
 PDTextSelect selectedText = static_cast(AVDocGetSelection(currentAVDoc));
 PDTextSelectGetBoundingRect(selectedText,&boundingRect);
  
 // use the bounding rectangle to create a highlight annotation QuadPoints
 // and bounding rectangle Cos objects. We need these 2 to create highlight type. 
 CosObj ArrayObj, RecObj;
 CosDoc cd = PDDocGetCosDoc(currentPDDoc);
 CosObj cosPage = PDPageGetCosObj(pdPage); 
 
 ArrayObj = CosNewArray(cd,false,8);
 CosArrayPut(ArrayObj,0,CosNewFixed(cd,false, boundingRect.right));
 CosArrayPut(ArrayObj,1,CosNewFixed(cd,false, boundingRect.bottom));
 CosArrayPut(ArrayObj,2,CosNewFixed(cd,false, boundingRect.left));
 CosArrayPut(ArrayObj,3,CosNewFixed(cd,false, boundingRect.bottom));
 CosArrayPut(ArrayObj,4,CosNewFixed(cd,false, boundingRect.right));
 CosArrayPut(ArrayObj,5,CosNewFixed(cd,false, boundingRect.top));
 CosArrayPut(ArrayObj,6,CosNewFixed(cd,false, boundingRect.left));
 CosArrayPut(ArrayObj,7,CosNewFixed(cd,false, boundingRect.top));
 
 // Now create bounding rectangle points
 RecObj = CosNewArray(cd,false,4);
 CosArrayPut(RecObj,0,CosNewFixed(cd,false, boundingRect.left));
 CosArrayPut(RecObj,1,CosNewFixed(cd,false, boundingRect.right));
 CosArrayPut(RecObj,2,CosNewFixed(cd,false, boundingRect.bottom));
 CosArrayPut(RecObj,3,CosNewFixed(cd,false, boundingRect.top));



// These are the properties to set in order to create a text highlight
// PDF 32000-1:2008 12.5.6.10, “Text Markup Annotations”, Table 164 and Table 179
 CosObj cosDict = CosNewDict(cd, true, 4);
 CosDictPutKeyString(cosDict, "Subtype", CosNewNameFromString(cd, false, "Highlight"));
 CosDictPutKeyString(cosDict, "QuadPoints",ArrayObj);
 CosDictPutKeyString(cosDict, "Rect", RecObj);
 
 pdAnnot = PDAnnotFromCosObj(cosDict);
 PDPageAddAnnot(pdPage,-2,pdAnnot);
 PDPageNotifyContentsDidChange(pdPage);
 PDAnnotSetColor(pdAnnot, &red);
 AVPageViewDrawNow (currentPageView);
 PDPageRelease (pdPage); 
}

Sunday, March 17, 2013

Configure Bitbucket with Eclipse

Create Bitbucket account and repository
1. Head on over to http://bit-bucket.org and click on the giant yellow “Sign up free” button.
2. Fill out the sign up form.
3. At the top of the Bit Bucket page find the Repositories tab and mouse over it.
4. Select “Create new repository.”
5. Create your repository. (Note: Don’t forget to select “private” if you need a private repository.)
6. Select “Repositories” again from the top menu and click on the name of your newly created repository.
7. This will load a details page and you will see the url that you need to copy for interaction with this repository.

Configure eclipse
8. Fire up eclipse.
9. Find the “Help” menu on the far right of.
10. Select “Install New Software…”
11. The following software install menu should pop up.
12. On the next screen you need to enter a URL for where the plugin can be found.
The mercurial plugin for eclipse is located here:
http://cbes.javaforge.com/update

Import Repository to local
13. Now I’m going to go ahead and assume that someone has already created a repository and you are importing a project. If not than you can skip a few steps and just commit and push your repository.
14. Select the “Mercurial” folder and then “Clone Existing Mercurial Repository”
15. Enter the repository url, your username and password.
16. Go ahead and code away. Make sure to (right click and) commit any new files and folders that you create.

Content from the original article, click here.
In Eclipse select “File” > “Import”
14. Select the “Mercurial” folder and then “Clone Existing Mercurial Repository”

Friday, September 21, 2012

Reading and displaying tf vector content of Apache Mahout SequenceFile

This java code segment reads through the Apache Mahout SequenceFile (generated by the Mahout seq2sparse tool), to display the tokens and their tf values. The code uses the dictionary file to map the token index to token in the dictionary.

You can read the Mahout Wiki on creating vectors from Text Documents. I have created tf vectors from a directory of documents and followed the process explained in the wiki. You basically need to run Mahout seqdirectory tool to create intermediate SequenceFile and then seq2sparse tool to create the tf vectors and dictionary file. Then you can use my code to examine the content of the SequenceFile and identify which terms in a document get higher term frequency and do further research.

In first few lines, the code reads and populate a HashMap of the dictionary file created by the Mahout seq2sparse tool and next the reader reads the tf-vectors of the SequenceFile and each of the token of the tf vectors are mapped from the dictionary map. You may need to import some of the followings,

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.math.Vector.Element;

public static void readMahoutSequenceFile()
{
 Configuration conf = new Configuration();
 FileSystem fs;
 SequenceFile.Reader read;
 try {
  fs = FileSystem.get(conf);
  read = new SequenceFile.Reader(fs, new Path("/Sparsedir/dictionary.file-0"), conf);
  IntWritable dicKey = new IntWritable();
  Text text = new Text();
  HashMap dictionaryMap = new HashMap();
  try {
      while (read.next(text, dicKey)) {
         dictionaryMap.put(Integer.parseInt(dicKey.toString()), text.toString());
      }
   } catch (NumberFormatException e) {
       e.printStackTrace();
   } catch (IOException e) {
       e.printStackTrace();
   }
   read.close();
         
   read = new SequenceFile.Reader(fs, new Path("/Sparsedir/tf-vectors/part-r-00000"), conf);
   Text key = new Text();
   VectorWritable value = new VectorWritable();
   SequentialAccessSparseVector vect;
   while (read.next(key, value)) {
        NamedVector namedVector = (NamedVector)value.get();
        vect= (SequentialAccessSparseVector)namedVector.getDelegate();
        for( Element  e : vect ){
           System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
          }
         }
         read.close();        
  } catch (IOException e) {
   // TODO Auto-generated catch block
  e.printStackTrace();
 }
}