"A meek endeavor to the triumph" by Sampath Jayarathna

Monday, July 28, 2014

[Acrobat SDK Plug-in Development] How to extract all the terms from the PDF document and create an index (COS Dictionary)

This code snippet will explain how to extract all the terms from a PDF document and then create a COS dictionary (similar to Java HashMap). You can use this method to collect all the terms and their offset value and use these offset of each term for other purposes such as highlighting terms.

I assume that you've already know how to implement basic plug-in functionality using Acrobat SDK. The version of the SDK used in this code example is Acrobat XI SDK. I also assume following requirements,
ACCB1 void ACCB2 termExtractor()
{
 // try to get front PDF document 
 AVDoc avDoc = AVAppGetActiveDoc();

 //Display words of the pdf file. 
 PDDoc currentPDDoc =AVDocGetPDDoc(avDoc);
 AVPageView currentPageView = AVDocGetPageView (avDoc);
 ASInt32 pageNum = AVPageViewGetPageNum(currentPageView);

 //Create a PDWordFinderConfigRec object;
 PDWordFinderConfigRec pConfig;
 
 //Set the DWordFinderConfigRec object's attributes
 memset(&pConfig, 0, sizeof(PDWordFinderConfigRec));
 pConfig.recSize = sizeof(PDWordFinderConfigRec);
 pConfig.ignoreCharGaps = true;
 pConfig.ignoreLineGaps = true;
 pConfig.noAnnots = true;
 pConfig.noEncodingGuess = true;

 //Create a PDWordFinder object
 PDWordFinder pdWordFinder = PDDocCreateWordFinderEx(currentPDDoc, WF_LATEST_VERSION, false, &pConfig);

 // Acquire all the terms inside the PDF page. 
 ASInt32 numWords;
 PDWord wordInfo;
 PDWord *pXYSortTable;
 PDWordFinderAcquireWordList(pdWordFinder, pageNum,&wordInfo, &pXYSortTable, NULL, &nWords);

 // Create COS Dictionary to keep track of all the words and their offset. 
 CosDoc cd;
 CosObj Dict;
 cd = PDDocGetCosDoc(currentPDDoc);
 Dict = CosNewDict(cd,false,nWords); 
 PDWord pdNWord = PDWordFinderGetNthWord(pdWordFinder, nWordCounter );
        for(int nWordCounter = 0; nWordCounter < nWords; nWordCounter++)
        {
  // Get the word as a string
  char stringBuffer[125];
  PDWordGetString (pdNWord, stringBuffer, sizeof(stringBuffer));
  pdfCorpus << stringBuffer;
  
  // Add each term into COS Dictionary to use it later with highlighting method
  // Offset is the location of each term in the document. First term offset is 0 and next term is 1 etc. 
  bool keyExist = CosDictKnown(Dict,ASAtomFromString(stringBuffer));
  if( keyExist == true) // To-do: Duplicate term
  {
   // To-do: catch duplilcates
  }
  else // new term
  {
   CosDictPut(Dict,ASAtomFromString(stringBuffer), CosNewInteger(cd,false,nWordCounter)); 
  }
 } 

[Acrobat SDK Plug-in Development] How to create Text Highlight

Instead of the below method, if you want to know how to extract text directly using the Acrobat Text Highlight tool, please read my new blog entry: How to extract text from Acrobat Text Highlight Tool?

This code snippet will explain how to create text highlight plug-in using Acrobat SDK. I assume that you've already know how to implement basic plug-in functionality using Acrobat SDK. The version of the SDK used in this code example is Acrobat XI SDK. I also assume following requirements,
  • Read PDF 32000-1:2008 12.5.6.10, “Text Markup Annotations”, for further information
    wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
  • There already exist a COS Dictionary with all the terms in the PDF document. COS dictionary is similar to a Java HashMap. It stores <Key,Value> pairs and in this case <term,offset> values.
  • Offset of a term is the index inside the PDF document. As an example, first term of the PDF document starts with 0, and next 1 so on. 
  • You can use PDWordFinderAcquireWordList() method to get the total word list and then use a loop to create the CosObj Dic. Read my previous post about how to extract terms from PDF and create COS Dictionary
void highlightText()
{  
 // There is a term "explosive" in the Cos Dictionary Dict  and get its offset
 CosObj offset = CosDictGet(Dict,ASAtomFromString("explosive"));
 //Create Highlight HliteEntry object to keep track of the offset (start index)
 //and the length which is how many terms to highlight. 
 HiliteEntry hilite;
 hilite.offset = CosIntegerValue(offset); 
 hilite.length = 1;

 AVDoc currentAVDoc = AVAppGetActiveDoc();
 PDDoc currentPDDoc = AVDocGetPDDoc(currentAVDoc);
 AVPageView currentPageView = AVDocGetPageView(currentAVDoc); 
 ASInt32 pageNum = AVPageViewGetPageNum(currentPageView);
  
 PDEElement pdeElement;
 ASFixedRect boundingRect; // bounding rectangle of the term
 PDPage pdPage = PDDocAcquirePage (currentPDDoc, pageNum);
 PDAnnot pdAnnot;
 
 // Set the color you want to highlight your text
 PDColorValueRec red;
 red.space = PDDeviceRGB;
 red.value[0] = ASInt32ToFixed(1); 
 red.value[1] = 0; 
 red.value[2] = 0; 

 // highlight 
 AVPageViewSetColor(currentPageView, &red); 
 PDTextSelect textSelection = PDTextSelectCreateWordHilite(pdPage,&hilite, 1);

 AVDocSetSelection(currentAVDoc, ASAtomFromString("Text"),(void *)textSelection, true);
 AVPageViewDrawNow(currentPageView);
 AVDocShowSelection (currentAVDoc);
 
 // make text selection and get the bounding rectangle of the selection. 
 PDTextSelect selectedText = static_cast(AVDocGetSelection(currentAVDoc));
 PDTextSelectGetBoundingRect(selectedText,&boundingRect);
  
 // use the bounding rectangle to create a highlight annotation QuadPoints
 // and bounding rectangle Cos objects. We need these 2 to create highlight type. 
 CosObj ArrayObj, RecObj;
 CosDoc cd = PDDocGetCosDoc(currentPDDoc);
 CosObj cosPage = PDPageGetCosObj(pdPage); 
 
 ArrayObj = CosNewArray(cd,false,8);
 CosArrayPut(ArrayObj,0,CosNewFixed(cd,false, boundingRect.right));
 CosArrayPut(ArrayObj,1,CosNewFixed(cd,false, boundingRect.bottom));
 CosArrayPut(ArrayObj,2,CosNewFixed(cd,false, boundingRect.left));
 CosArrayPut(ArrayObj,3,CosNewFixed(cd,false, boundingRect.bottom));
 CosArrayPut(ArrayObj,4,CosNewFixed(cd,false, boundingRect.right));
 CosArrayPut(ArrayObj,5,CosNewFixed(cd,false, boundingRect.top));
 CosArrayPut(ArrayObj,6,CosNewFixed(cd,false, boundingRect.left));
 CosArrayPut(ArrayObj,7,CosNewFixed(cd,false, boundingRect.top));
 
 // Now create bounding rectangle points
 RecObj = CosNewArray(cd,false,4);
 CosArrayPut(RecObj,0,CosNewFixed(cd,false, boundingRect.left));
 CosArrayPut(RecObj,1,CosNewFixed(cd,false, boundingRect.right));
 CosArrayPut(RecObj,2,CosNewFixed(cd,false, boundingRect.bottom));
 CosArrayPut(RecObj,3,CosNewFixed(cd,false, boundingRect.top));



// These are the properties to set in order to create a text highlight
// PDF 32000-1:2008 12.5.6.10, “Text Markup Annotations”, Table 164 and Table 179
 CosObj cosDict = CosNewDict(cd, true, 4);
 CosDictPutKeyString(cosDict, "Subtype", CosNewNameFromString(cd, false, "Highlight"));
 CosDictPutKeyString(cosDict, "QuadPoints",ArrayObj);
 CosDictPutKeyString(cosDict, "Rect", RecObj);
 
 pdAnnot = PDAnnotFromCosObj(cosDict);
 PDPageAddAnnot(pdPage,-2,pdAnnot);
 PDPageNotifyContentsDidChange(pdPage);
 PDAnnotSetColor(pdAnnot, &red);
 AVPageViewDrawNow (currentPageView);
 PDPageRelease (pdPage); 
}