Unable to extract all text elements from PDF file

The PDF Creator .NET Library enables you to create secure PDF documents on the fly and to view, print and process existing PDF documents. If you have any questions about PDF Creator .Net, please post them here.
Post Reply
m_kc
Posts: 1
Joined: Wed Sep 01 2010

Unable to extract all text elements from PDF file

Post by m_kc » Wed Sep 01 2010

We are currently working on an application which shall be able to extract some text from PDF files, but have encountered some situations where not all of the text in the PDF files is extractable (we are currently using the GetObjectsInRecangle(l, t, r, b, acGetRectObjectsOptimize). For example, if we create a PDF file (e.g. using PDFCreator) from a Word or Notepad containing the following text...

123 Hello World
Hello World 456

..., the IacObject..AttributeByName("Text").Value of the extracted objects only contains numbers - i.e.

123
456

...none of the "Hello World"-text strings are extracted - or at least not in readable characters - "squares" and other special characters (/, %, $) are extracted for some text objects.

We have tried to extract the text using both our own code as well as the sample Amyuni application without luck and it should be noted that the PDF files are displayed correctly in both the Amyuni PDF viewer control and Adobe Reader (in which the text also can be copied to clipboard correctly).

After some investigation we have found that it only applies to PDF documents containing specific (OpenType / TrueType) fonts (e.g. Lucida Console = Notepad's default font in Windows XP and 2000) for which a subset is embedded in PDF file - are Amyuni able to handle extraction of text based on unicode fonts or only ASCII based fonts (i.e. not Lucida Console and the like) ?

Best Regards,

Morten Klitgaard
Lyngsoe Systems

Jose
Amyuni Team
Posts: 549
Joined: Tue Oct 01 2002
Contact:

Re: Unable to extract all text elements from PDF file

Post by Jose » Thu Sep 16 2010

Hello,

This issue was resolved with an updated version of the PDF Creator.NET.

Thanks
Jose
Get PDF Suite, the expert .NET developer toolkit for PDF conversion, creation and editing - www.amyuni.com/pdfsuite

Post Reply