Some text objects fail to read correctly

Wesdpl · Post by **Wesdpl** » Tue Nov 06 2012

I am using version 4.5.2.9 and have adjusted the vb.net sample code to extract all the text elements from a PDF file, but some of the text read is rubbish e.g. "W\DQGÀQDQFLDODI"

Here is the extract of my code where it is trying to read all the text objects from page 1

For Each obj As Amyuni.PDFCreator.IacObject In arList
'you can access all properties of each object
Dim attr As IacAttribute = obj.Attribute("ObjectType")
Dim oPage As Integer
oPage = obj.PageNumber
If oPage = i Then
Dim oType As Integer
oType = CInt(attr.Value)
Dim oTypeMR As String = ""
Select Case oType
Case 5 : oTypeMR = "Text"
Console.WriteLine(obj.Attribute("Text").Value)
If Pass = 1 Then
Dim oTextText As String
oTextText = obj.Attribute("Text").Value
Dim oTextColor As String
oTextColor = obj.Attribute("TextColor").Value
Dim oTextFont As String
oTextFont = obj.Attribute("TextFont").Value

All of the fonts in the PDF are subsets of fonts e.g. "AZVOLY+Arial Black,20.0000,400,0,0,0,0"

It would appear that depending on the subset, some fonts are read correctly and others don't.

Any help would be appreciated

Regards
Wes

Post by **Jose** » Mon Jan 14 2013

Hi,

Without looking at the PDF document it makes it difficult to detect the issue.

However, I suggest that you look at the DelimitedText Method. The DelimitedText() function retrieves only the text within a PDF object and not the string formatting (bounding box) of the object.

The link below points to our online help where the DelimitedText is explained further.
http://www.amyuni.com/WebHelp/Amyuni_PD ... ethod_.htm

Thanks

ThomasUttendorfer · Post by **ThomasUttendorfer** » Wed Sep 12 2018

Hi,
maybe it helps when you call OptimizeDocument(1) before you get the text attributes.

This function seems to have the side effect that Identity-H font encoding is resolved
which helps to retrieve text-attributes.
Identity-H encoding seems to be used when (Unicode)fonts are partially embedded.

Kind regards
Thomas

Amyuni Technologies

Some text objects fail to read correctly

Some text objects fail to read correctly

Re: Some text objects fail to read correctly

Re: Some text objects fail to read correctly