Some text objects fail to read correctly

The PDF Creator .NET Library enables you to create secure PDF documents on the fly and to view, print and process existing PDF documents. If you have any questions about PDF Creator .Net, please post them here.
Post Reply
Wesdpl
Posts: 1
Joined: Tue Nov 06 2012

Some text objects fail to read correctly

Post by Wesdpl »

I am using version 4.5.2.9 and have adjusted the vb.net sample code to extract all the text elements from a PDF file, but some of the text read is rubbish e.g. "W\DQGÀQDQFLDODI"

Here is the extract of my code where it is trying to read all the text objects from page 1

For Each obj As Amyuni.PDFCreator.IacObject In arList
'you can access all properties of each object
Dim attr As IacAttribute = obj.Attribute("ObjectType")
Dim oPage As Integer
oPage = obj.PageNumber
If oPage = i Then
Dim oType As Integer
oType = CInt(attr.Value)
Dim oTypeMR As String = ""
Select Case oType
Case 5 : oTypeMR = "Text"
Console.WriteLine(obj.Attribute("Text").Value)
If Pass = 1 Then
Dim oTextText As String
oTextText = obj.Attribute("Text").Value
Dim oTextColor As String
oTextColor = obj.Attribute("TextColor").Value
Dim oTextFont As String
oTextFont = obj.Attribute("TextFont").Value

All of the fonts in the PDF are subsets of fonts e.g. "AZVOLY+Arial Black,20.0000,400,0,0,0,0"

It would appear that depending on the subset, some fonts are read correctly and others don't.

Any help would be appreciated

Regards
Wes
Jose
Amyuni Team
Posts: 553
Joined: Tue Oct 01 2002
Contact:

Re: Some text objects fail to read correctly

Post by Jose »

Hi,

Without looking at the PDF document it makes it difficult to detect the issue.

However, I suggest that you look at the DelimitedText Method. The DelimitedText() function retrieves only the text within a PDF object and not the string formatting (bounding box) of the object.


The link below points to our online help where the DelimitedText is explained further.
http://www.amyuni.com/WebHelp/Amyuni_PD ... ethod_.htm

Thanks
Get PDF Suite, the expert .NET developer toolkit for PDF conversion, creation and editing - www.amyuni.com/pdfsuite
ThomasUttendorfer
Posts: 5
Joined: Fri Dec 02 2016

Re: Some text objects fail to read correctly

Post by ThomasUttendorfer »

Hi,
maybe it helps when you call OptimizeDocument(1) before you get the text attributes.

This function seems to have the side effect that Identity-H font encoding is resolved
which helps to retrieve text-attributes.
Identity-H encoding seems to be used when (Unicode)fonts are partially embedded.

Kind regards
Thomas
Post Reply