Dropping Unwanted Characters from DW Import OCR Strings

Publicado Thu, 25 Jul 2019 05:30:27 GMT por Patrick Keough American Electronic Imaging Company, Inc. CEO

- I have a customer who wants to store about 12,000 documents and he wants to use DW Import templates to do it. There are two types of documents; checks and letters. The checks are related to the letters by a matching account number and dollar amount (required for Stapler job).
- He wants to extract the account number and the dollar amount contained in both.
- I have two questions...
- FIRST, the 17-digit account number is represented on both document types using hyphens for ease of reading, but the hyphens are not used in the DW file cabinet. The ACCOUNT NUMBER field in DW is a VARCHAR 20 field with no mask assigned (I guess they just manually type the 17 digits when storing a new document to the file cabinet). What is the best way to OCR the account number and store it w/o the hyphens? I cannot find the DW Import template equivalent the way you could configure the DW Windows Desktop Client REC2 module to ignore specified characters.
- I thought perhaps the answer was to create a mask for the field, but nobody seems to understand REGTXT enough to suggest the correct filter syntax. Not sure this would work, but until I have the correct REGTXT I can't teat it.
- SECOND, the letters contain a dollar amount as part of the text. The dollar amounts vary from $.nn up through $nnn.nn. The dollar amount is immediately followed by a comma to set the amount off in the sentence to which it belongs. Again -- in the old DW Windows Desktop REC2 module we could set an OCR zone and have it garb everything between the "$" and the comma -- but no similar function in Web Client DW Import.
- I can use the character count filters because the character counts vary based on the dollar amount.
- I see that there is a substitution function in DW Import template, but I can't figure out how to make it work.
- Any suggestions?

Publicado Thu, 25 Jul 2019 08:49:05 GMT por Phil Robson DocuWare Corporation Senior Director Professional Services, Americas

Patrick,
You won't be able to remove the hyphens. There simply is not sufficient flexibility to drop those characters wherever they appear. Masking won't help, DocuWare does not alter ingoing data to fit a mask. The data has to be formatted as per the mask otherwise it will fail.
As long as this is not a Cloud installation you may need to consider a very carefully written database trigger to remove the hyphens.

As for the $ Amount, obviously you can only capture that if the value is in the same place on every document, but the OCR does have the ability to extract text between known characters. Maybe we need to see one of these documents to advise further.

Phil Robson
Senior Director Client Services, Americas

Publicado Fri, 26 Jul 2019 15:16:59 GMT por Patrick Keough American Electronic Imaging Company, Inc. CEO

The answer was to remove the hyphens in post capture using a MySQL script.

The script I used was:

UPDATE database name SET table name = REPLACE(table name, '-', '');

The first set of quotes inside the parenthetical identifies what I wanted to remove and the second set of parentheticals identifies what I wanted it replaced by, so in this case, remove the hyphen and replace with nothing.

I still have to figure out the correct way to snip OCR using before and after recognized characters,,, not having any luck with it so far.

Publicado Wed, 04 Sep 2019 07:20:57 GMT por Christian Prantl Manager Product

Hello Patrick

there is one trick that might help you in doing this:

In DocuWare Import you can define a "separator" within a textstring and only use a part of this string. For example if you have the number 123-456-789 you define "-" as a separator and then use one of the parts of the string as indexterm. If you do this for every segment you can use the number without hyphens.
Note: this is only working correctly if the number of separators will be the same for all documents.
Please see attached screenshots to have abetter understanding of this function. In the screenshot I connected a text zone three times to an indexfield and each connection (click on edit icon within the Text content of the index field) is defined so that just one part between the separators are used. (Click on select text-> Between separator and define the separator and position)
In this example the textentry in the file cabinet will not be 201/113/40209 but 20111340209

I know this function is not really straightforward for your use case and you also asked about SQL scripts. But I thought this information might help you in any way to set up your system as desired.

Best regards

Christian Prantl
Senior Manager Product

Separator_text_content.PNG (110,1 KB)

Dropping Unwanted Characters from DW Import OCR Strings

Obtenga ayuda