Unable to extract all text elements from PDF file

m_kc · Post by **m_kc** » Wed Sep 01 2010

We are currently working on an application which shall be able to extract some text from PDF files, but have encountered some situations where not all of the text in the PDF files is extractable (we are currently using the GetObjectsInRecangle(l, t, r, b, acGetRectObjectsOptimize). For example, if we create a PDF file (e.g. using PDFCreator) from a Word or Notepad containing the following text...

123 Hello World
Hello World 456

..., the IacObject..AttributeByName("Text").Value of the extracted objects only contains numbers - i.e.

123
456

...none of the "Hello World"-text strings are extracted - or at least not in readable characters - "squares" and other special characters (/, %, $) are extracted for some text objects.

We have tried to extract the text using both our own code as well as the sample Amyuni application without luck and it should be noted that the PDF files are displayed correctly in both the Amyuni PDF viewer control and Adobe Reader (in which the text also can be copied to clipboard correctly).

After some investigation we have found that it only applies to PDF documents containing specific (OpenType / TrueType) fonts (e.g. Lucida Console = Notepad's default font in Windows XP and 2000) for which a subset is embedded in PDF file - are Amyuni able to handle extraction of text based on unicode fonts or only ASCII based fonts (i.e. not Lucida Console and the like) ?

Best Regards,

Morten Klitgaard
Lyngsoe Systems

Post by **Jose** » Thu Sep 16 2010

Hello,

This issue was resolved with an updated version of the PDF Creator.NET.

Thanks
Jose

Amyuni Technologies

Unable to extract all text elements from PDF file

Unable to extract all text elements from PDF file

Re: Unable to extract all text elements from PDF file