I am using version 4.5.2.9 and have adjusted the vb.net sample code to extract all the text elements from a PDF file, but some of the text read is rubbish e.g. "W\DQGÀQDQFLDODI"
Here is the extract of my code where it is trying to read all the text objects from page 1
For Each obj As Amyuni.PDFCreator.IacObject In arList
'you can access all properties of each object
Dim attr As IacAttribute = obj.Attribute("ObjectType")
Dim oPage As Integer
oPage = obj.PageNumber
If oPage = i Then
Dim oType As Integer
oType = CInt(attr.Value)
Dim oTypeMR As String = ""
Select Case oType
Case 5 : oTypeMR = "Text"
Console.WriteLine(obj.Attribute("Text").Value)
If Pass = 1 Then
Dim oTextText As String
oTextText = obj.Attribute("Text").Value
Dim oTextColor As String
oTextColor = obj.Attribute("TextColor").Value
Dim oTextFont As String
oTextFont = obj.Attribute("TextFont").Value
All of the fonts in the PDF are subsets of fonts e.g. "AZVOLY+Arial Black,20.0000,400,0,0,0,0"
It would appear that depending on the subset, some fonts are read correctly and others don't.
Any help would be appreciated
Regards
Wes
Some text objects fail to read correctly
Re: Some text objects fail to read correctly
Hi,
Without looking at the PDF document it makes it difficult to detect the issue.
However, I suggest that you look at the DelimitedText Method. The DelimitedText() function retrieves only the text within a PDF object and not the string formatting (bounding box) of the object.
The link below points to our online help where the DelimitedText is explained further.
http://www.amyuni.com/WebHelp/Amyuni_PD ... ethod_.htm
Thanks
Without looking at the PDF document it makes it difficult to detect the issue.
However, I suggest that you look at the DelimitedText Method. The DelimitedText() function retrieves only the text within a PDF object and not the string formatting (bounding box) of the object.
The link below points to our online help where the DelimitedText is explained further.
http://www.amyuni.com/WebHelp/Amyuni_PD ... ethod_.htm
Thanks
Get PDF Suite, the expert .NET developer toolkit for PDF conversion, creation and editing - www.amyuni.com/pdfsuite
-
- Posts: 5
- Joined: Fri Dec 02 2016
Re: Some text objects fail to read correctly
Hi,
maybe it helps when you call OptimizeDocument(1) before you get the text attributes.
This function seems to have the side effect that Identity-H font encoding is resolved
which helps to retrieve text-attributes.
Identity-H encoding seems to be used when (Unicode)fonts are partially embedded.
Kind regards
Thomas
maybe it helps when you call OptimizeDocument(1) before you get the text attributes.
This function seems to have the side effect that Identity-H font encoding is resolved
which helps to retrieve text-attributes.
Identity-H encoding seems to be used when (Unicode)fonts are partially embedded.
Kind regards
Thomas