FullText OCR not recognizing text in PDFs-1516739268

Posted Tue, 23 Jan 2018 19:27:48 GMT by Joe Kaufman Bell Laboratories Inc No longer there

Hey all,

I am starting to develop a file cabinet for our packaging group, sort of like what the Rawlings company did and showed off in that DocuWare marketing video (http://info.docuware.com/free-webinar-rawlings).

The first set of files I am uploading are PDF files that are considered the "print version" or "final print" images. These images are not stored as JPG, TIFF, or PNG, but as PDFs. The packaging supervisor says these PDFs have the text "stripped out of them" to keep the files pure for print usage. I am not sure about the details on that, but it sounded like standard practice which our print vendors dictate (or at least appreciate).

When I uploaded these PDFs to the new cabinet I created (which has full text turned on). nothing got recognized as a fulltext keyword. I changed the OCR file list to a "black" list, keeping out only audio and video files. Though, I did notice that PDF files were not specifically in the list of file types to white- or black- list.

When I convert the PDF files to PNG and upload those, full text finds everything inside and I can search the documents as if they had been scanned in. Works beautifully.

But we need that to work with PDFs. Can DocuWare not fulltext OCR these PDF files that are "image only"? This will be pretty sad if we cannot, as it was one of the biggest reasons I was pushing for this for people (and it was a big selling point in the Rawlings business case scenario).

Let me know what can be done. Greatly appreciated!

Thanks,

Joe Kaufman

Posted Wed, 24 Jan 2018 14:57:22 GMT by Phil Robson DocuWare Corporation Senior Director Professional Services, Americas

Joe,
Generally, we cannot OCR an image based pdf that has been re-printed to a pdf. That said, can you email me a couple of sample documents, and indicate what the document type was before being printed or converted to pdf?

Phil Robson
Senior Director Support - Americas

Posted Wed, 24 Jan 2018 15:17:37 GMT by Joe Kaufman Bell Laboratories Inc No longer there

Phil,

I do not know what the PDF was before, as it is exported from Adobe Illustrator. Not sure what other processing is done to it or what format is native to AI.

I will send you one of these PDFs via email. I am not holding out much hope, as I just uploaded a bunch to Google Drive, and when converting the PDF to a Google Doc, only the stamp annotation text came through. So Google apparently cannot glean any text from the embedded image either.

Thanks,

Joe Kaufman

Posted Wed, 24 Jan 2018 15:22:47 GMT by Phil Robson DocuWare Corporation Senior Director Professional Services, Americas

Without knowing more, if the file was exported from Illustrator as a PDF and then re-printed to a PDF printer then that is the root problem and cannot be resolved. The PDF should be imported to DocuWare without further processing after exporting from the source.

Phil

Posted Wed, 24 Jan 2018 15:39:13 GMT by Joe Kaufman Bell Laboratories Inc No longer there

I am told the font information is stripped out so the rendering is exactly as the printer needs without there being any font rendering issues.

My contact says they could simply start exporting a PDF WITH the font information first, for archival purposes, and then send the stripped PDF on to the final print destination. That would give us full-text back.

To put it another way (so that can understand it myself), the PDF us uber-flattened in a way such that it is basically just an image inside the PDF, and such PDFs do not OCR well unless you convert them to TIFF or PNG, etc.

Apparently Rawlings uses a different format for their final storage, which is why full-text works well for them.

Thanks,

Joe Kaufman

Posted Wed, 24 Jan 2018 15:51:34 GMT by Phil Robson DocuWare Corporation Senior Director Professional Services, Americas

That starts to make sense. You now have 2 options. The source change or my suggestion to print it to PDF again. One thing I did not mention was in the print options in Phantoms PDF creator, I did not elect to reprint the PDF as an image so I really am not sure how these PDF engines work.

Phil

Posted Wed, 24 Jan 2018 16:02:33 GMT by Joe Kaufman Bell Laboratories Inc No longer there

You and me both. *smile* We will figure it out, though. Just another reminder to test various configurations first before implementing a whole solution. Now that we caught this we should be fine moving forward with just a few business workflow tweaks.

Thanks for your help!

Thanks,

Joe Kaufman

FullText OCR not recognizing text in PDFs-1516739268

Get Help