The OCR module enables developers to:
Convert non-searchable PDF files into searchable PDFs
Create searchable PDF documents out of various image formats such as multi-page TIFF, JPEG or PNG while applying text recognition on the images
Compress image based PDF documents using high compression JBIG2 or more standard CCITT, JPEG and PNG compression formats
The OCR module can be either licensed independently of our other PDF Components or as an add-on to existing licenses.
The Amyuni OCR module is based on the Tesseract Open Source project with the Amyuni PDF technology being used to process and create the PDF documents. The Tesseract library provides high reliability at a low cost and avoids developers the annoyances related to licensing commercial OCR tools which are often licensed on a per-page basis or at a ridiculously high cost to the developer.
Open multi-page TIFF files directly into PDF Creator for OCR (Optical Character Recognition) and save the documents in PDF format.
Convert image based or non-searchable PDF documents into searchable PDFs.
Apply JBIG2 Compression which heavily reduces size of scanned documents. Other standard compression formats such as CCITT, JPEG or PNG can also be used.
Support for multiple languages such as English, French, Italian, German, Portuguese, Spanish, Dutch and Vietnamese.
Obtain up to 98% accuracy on English language documents.
Extracted text can be either visible or hidden inside the PDF document. In both cases, the text is positioned as close as possible to the original text.
Extracted text can be saved to a regular text file rather than to a PDF file.
Rasterize any PDF document to convert it into an image based searchable or non-searchable PDF.
Benefit from a robust PDF library that can create highly optimized and well-structured PDF documents that can be emailed and viewed by any PDF compatible viewer.
Use Tesseract version 4.1.
32-bit: Windows Vista, XP, 7, 8, 8.1, 10, Windows Server 2000, 2003, 2008 R2, 2012 R2, 2016 and 2019.
64-bit: Windows Vista, XP, 7, 8, 8.1, 10, Windows Server 2000, 2003, 2008 R2, 2012 R2, 2016 and 2019.
This is the main ActiveX control that hosts the Amyuni PDF Library and the interface to the OCR engine.
This is the .NET class library that is equivalent to the PDFCreactiveX.dll ActiveX control. Developers can either use the ActiveX or .NET but do not need to include both.
This file contains the Tesseract OCR engine. This DLL and the Tessdata folder described below should be located in the same folder as PDFCreactiveX.dll.
This folder contains all the dictionaries used by the OCR engine. Each language is supported by 8 dictionary files prefixed with the language name, e.g deu for German. If not all languages are needed, then only the required languages can be distributed, e.g. only the eng and fra prefixed files can be distributed for English and French only support.
All the samples that are provided in this documentation assume that the developer is using the ActiveX version (PDFCreactiveX.dll.) When using the .NET version (acPDFCreatorLib.Net.Dll), the functions are very similar although the code slightly different. Rather than duplicating all the documentation and sample code, we have chosen to provide a complete .NET sample at the end of this documentation.