Publicado Fri, 27 Apr 2018 16:44:11 GMT por cody m3as

We are looking to implement FullText (OCR) to search our images for matching text. We are using a .NET application. Where might I find API / technical documents to start me off on researching?

 

Publicado Fri, 27 Apr 2018 17:00:42 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

I am not sure I follow you... If you turn on full-text indexing for a file cabinet then you can search for keywords from the web client and via URL Integration. Why are you needing to do your own customized full-text searching?

If you are interested in OCR from .NET, you can probably start with one of the companies DocuWare itself apparently uses, Nuance (at least according to the "About" page of listed licensing in play). Looks like Nuance makes something called OmniPage:

https://www.nuance.com/print-capture-and-pdf-solutions/training/developer-portal.html

Alternately, if you just do a web search on "OCR" you will find lots of different tools including some freeware ones like FreeOCR.

Is that the kind of thing you are looking for? DocuWare's fulltext indexing is pretty darn good, and it is relatively simple to run searchs against the fulltext field from the Platform SDK, I think.

 

Thanks,

Joe Kaufman

Publicado Mon, 30 Apr 2018 11:59:32 GMT por cody m3as

Thanks for the detailed reply!

We have a Windows Forms application displaying the images from one of our remote DocuWare databases via forms WebBrowser component. It builds up a URI string for the browser to naviagate to, e.g.:

http://Docuware123/Docuware/Platform/WebClient/1/Integration?lc=AAA&p=V&fc=KEY&did=<documentId>

I presume we would just append the FullText parameter we retrieve from user input to the URI? Is there more info on how I would do this?

 

Thanks again!

Publicado Mon, 30 Apr 2018 12:25:35 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

Do you mean you would like to see the actual textshot behind the fulltext indexing of the document, that you want to search via fulltext keywords, or something else?

If you want to do a fulltext search via URL integration, you could do something like:

http://<servername>/DocuWare/Platform/WebClient/Client/Result?fc=<file cabinet GUID>&q=DocuWareFulltext=<keyword>&displayOneDoc=True&orgId=1&_auth=<authorization>

This is a very rough URL -- you would need to study up on using URL integration. I can no longer find a link to the document I found about it, but here is a thread where I discussed overcoming a login issue:

https://www.docuware.com/forum/english-forums/docuware-help-technical-problems/solved-url-integration-powertools-dll-not-working

You can download the URL Creator and play around with that so see how to do various queries. The bottom line is that you need to encode the URL so that characters like the equals sigbn don't interfere, and the auth credentials also need to be encoded in a certain way (and you can go ever further with encrypting the URL if you wish). URL integration can then show a result list or display the document if a single one is returned.

You can also use the Plkatform SDK to do searches and get back Document objects that could then be manipulated further (e.g. downloaded). Using the SDK is a whole additional layer.

If you are wanting access to the actual textshot generated when the server OCRs documents, I do not know how to do that (haven't looked into it).

Thanks,
Joe Kaufman

Publicado Mon, 30 Apr 2018 12:48:30 GMT por cody m3as

I am not sure if we need text shot. We need to supply a fulltext parameter and pull back the subset of images/ documents that contain the search text parameter, presumably using OCR.

For more context, this is for an image search/ search results feature we are building.

Apologies if this isn't clear, thanks again for the prompt reply!

Publicado Mon, 30 Apr 2018 12:54:46 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

Then yeah, URL integration would work. Are you building the forms in .NET? Then I highly recommend learning the Platform SDK and utilizing the DocuWare NuGet packages -- it works quite well. You can do searches on the fulltext fields from there as well and get back Document objects to work with further. You are then in control of how to present and view the documents as opposed to utilizing URL integration which leverages the web client (result list and viewer).

 

Thanks,

Joe Kaufman

Publicado Mon, 30 Apr 2018 17:50:00 GMT por cody m3as

Joe-

After revisiting the design, I have another question that may not pertain to URL integration anymore. The Docuware documents we store are linked to a business GUID we assign.

We have another database that stores the business representation of the records that we filter on for non-Docuware related metadata, which produces a list of GUID results. I believe I'd like to filter, again, on that list of GUIDs in the Docuware database, but this time for any documents linked to that GUID that contain the fulltext search parameter.

At a high level, the algorithm would:

  1. Search against our business data records to produce a list of GUID results matching the business search criteria.
  2. Send the GUIDs over to the Docuware database that presumably already maps those GUIDs to their related documents.
  3. Refine the list of GUIDs once more, but this time applying the fulltext search parameter.

Does this make sense? I think this links into what you mentioned earlier about the TextShot. I am a little lost on that concept since I am new to Docuware.

I am logged into one of our testing Docuware databases and I see a table named "FCAB_<foo>_TMW_TMW_PAGE" that has a column for TEXTSHOT. Would I utilize this in my logic somehow?

To clarify, I am not asking for your solution to the algorithm, but for information on how I could utilize TEXTSHOT/ Docuware specific technologies to achieve this goal. Thanks!

 

Publicado Mon, 30 Apr 2018 18:11:26 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

I am not sure I completely understand the GUID part of the question, but it sounds like you are basically saying you want to do a search that pulls down documents by one of the indexed fields as well as a keyword in the fulltext (OCR'd) portion of the document.

That can be done by simply doing a compund query -- you search both fields at the same time. 

Here is an example of querying via the Platform SDK:

http://help.docuware.com/sdk/platform/html/b8b6ddf4-fee5-4b5c-84cc-046895c8aa5a.htm

Alternately, you could use URL integration and search on multiple fields in one query. Query expressions can involve multiple fields ANDed together. The name of the full text field in any cabinet where full-text is enabled is "DocuWareFulltext". I just used URL Creator to search a file cabinet where fulltext contained a certain item code and another field equaled the word "Customer" -- came back with the exact document I needed. My "free query" in the URL creator looked like:

[DocuWareFulltext] = "22112" AND [PROMO_TYPE] = "Customer"

Play around with the URL creator and do free queries involving that GUID field you mention and free-text at the same time. Should be doable. I do not know how to do a fulltext query on documents already pulled down. As I said, I have never had need of trying to pull the textshot along down with the document and manipulating it further. I would just do the full query all at once and get down exactly what I need. Let the DocuWare server do all the work and just get down relevant hits.

Thanks,
Joe Kaufman

Publicado Mon, 30 Apr 2018 18:59:00 GMT por cody m3as

1. Where may I find the URL Creator software? I could not find it online.

2. What configuration would i have to enable at the administration level to utilize this full text feature this way?

 

Thank you for your patience and help!

Publicado Mon, 30 Apr 2018 19:20:30 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

I believe URL Creator is part of the DocuWare Desktop tools. The same way you would install scanning or import jobs, that's how you install URL Creator.

But if you are wanting to do native SQL Server queries, that's a whole other can of worms, and I don't think URL Integration will get you where you want to be. .NET would definitely be ideal, and the "proof of concept" would be cleaner from a long-term standpoint.

I do not know how to decode the textshots stored in the database. People have asked about it here:

https://www.docuware.com/forum/english-forums/docuware-questions-about-u...

but you may want to start another thread specifically asking how to decode the textshot directly from the database (and it may not be possible -- that is always the risk you run when trying to query directly to the backend database).

As far as Question 2, to enable full text? You just check the checkbox when you create the file cabinet. In the web client configuration it is a checkbox that says, "Fulltext search" on the "General" tab when configuring a cabinet. Turn that on, and the field "Fulltext" becomes available in search dialogs, and that system field is available to query via URL Integration and the Platform SDK API. Note: the server will start chugging when you check that box, as it will start generating textshots for all the documents already in the cabinet, and all documents added from then on.

Thanks,
Joe Kaufman

Publicado Wed, 02 May 2018 14:12:00 GMT por cody m3as

Joe-

Thanks to your help I have deliberated with our technical team and it sounds like we will take the recommended SDK route! We do have some concerns about the expressiveness of the query API and performance, however. Going back to your suggestion:

I am not sure I completely understand the GUID part of the question, but it sounds like you are basically saying you want to do a search that pulls down documents by one of the indexed fields as well as a keyword in the fulltext (OCR'd) portion of the document.

That can be done by simply doing a compund query -- you search both fields at the same time. 

You are pretty much correct in your understanding, but ideally we would like to construct a function signature for the DocuWare SDK/ Server to handle such as:

(listOfGuids, fullText) -> listOfGuids

 

However, browsing the SDK docs, it looks like this may be the best we can do:

public DocumentsQueryResult RunQuery(Dialog dialog)
{

   // NOTE: Test values.
    var businessGuids = new List<Guid>() { Guid.NewGuid(), Guid.NewGuid(), ..., Guid.NewGuid() };
    var fullText = "test";

    var firstQuery = new DialogExpression()
    {
        Condition = new List<DialogExpressionCondition>()
        {
            // QUESTION: What field name should we reference to tell DocuWare to
            // perform the full text search on the documents in the Create method?
            DialogExpressionCondition.Create("TEXTSHOT?", fullText)
        }
    }

    var filteredBusinessGuids = dialog.GetDocumentsResult(firstQuery)
                                      .Items
                                      // QUESTION: How would we get this column value in DWDATA?

                                      // Our business Guid column lives in our FCAB base table as noted

                                     //  in the query earlier.
                                      .Where(i => businessGuids.Contains(i["BUSINESS_GUID"].Item as Guid))
                                      .Select(i => i["BUSINESS_GUID"].Item as Guid);
}

Does this make sense? Please address the questions in the comments in the code if possible.

Additional questions:

1. Will the SDK features we need support DocuWare version: 6.7.0.6960?

2. We have performance concerns because we are running a fulltext search on potentially thousands of documents and bringing them into memory, where we filter again based on our business-filtered Guids. Is there a better way to do this, e.g., passing the list of Guids along with the full text search parameter for the DocuWare server to process? (See the function signature as noted earlier)

Publicado Wed, 02 May 2018 16:13:15 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

The name of the fulltext field is the same one I listed in the URL integration: "DocuWareFulltext"

So, to call the RunQuery() method with more than one argument, I could do something like the following code that sets up variables and then makes the call:

    Dialog dialog = <generate a Dialog object for the search dialog for the cabinet>;
    List<DialogExpressionCondition> exprConditions = new List<DialogExpressionCondition>();
    // *** NOTE *** Wild cards are allowed in expression conditions, emulating a SQL "LIKE".
    exprConditions.Add(DialogExpressionCondition.Create("DocuWareFulltext", "the"));
    exprConditions.Add(DialogExpressionCondition.Create("<OTHER_FIELD_NAME>", "<value to search for>"));
    List<SortedField> sortFields = new List<SortedField>();
    DocumentsQueryResult result = DWFuncs.RunQuery(dialog, exprConditions, sortFields, DialogExpressionOperation.And);

That would do an AND query on the two conditions, that fulltext contains the word "the" and that <OTHER_FIELD_NAME> equals "<value to search for>".

Now, you have a bit of a tricky situation. What you want to search for looks like:

((Guid = val1) OR (Guid = val2) OR (Guid = val3)...) AND (DocuWareFulltext = "<search term>")

As far as I know, this cannot be done. You cannot parenthesize a search in that way and mix your ANDs and ORs. If there is a way, I have not seen it. Even in some parts of DocuWare that allow free queries, using parentheses with regard to the Boolean conditions does not seem doable.

But there is good news. Remember, the DocumentsQueryResult (which is simply a collection of Document objects) is not a bulging mass of all the documents pulled down over the wire. It is just the header information for the documents returned. For example, in the query above when I did a full-text search on the word "the", I got back 1800 Document objects in about six seconds. It isn't as fast as doing straight SQL Server queries on regular indexes, but it is still quick. And each Document object can be checked quickly via a loop:

    for (int i = 0; i < result.Items.Count; i++)
    {
        Document doc = result.Items[0];
        string val = doc.Fields[3].Item.ToString();
    }

Now, this means you have to know which field is which, ordinally (the object does not support named indexing). But if the Guid you are after is in, say, field 6 (zero-indexed), you could scan through all documents and find out which one has a field 6 that matches your list of Guids you are looking for.

In other words, lead with the fulltext search and then filter down from there. Because you can filter down further on the Guid -- you can't filter down further on full-text unless you find a way to pull down and decode the full-text on the query result. If you ever DID find a way to get the text back and do your own search, then you could do an OR query by guids at the start and then analyze those documents.

Like I said, though, DocuWare is pretty dang fast. So if you can lead with the full-text and let DocuWare run that query, that part of the filtering is done. If you don't have any full-text to search for, use an OR-based query with all the business guids joined together and just get all the documents back at once.

I do not know if any of this works in version 6.7 (we are on 6.11). But 6.7 (aka "platform Eagle") is still listed in the API help here:

http://help.docuware.com/sdk/platform-eagle/html/66b2ed1e-2aef-452a-97cd-5014bbf0242b.htm

so that is a good sign...

For completeness, here is my RunQuery() function that I have in namespace "DWFuncs":

==================================

       public static DocumentsQueryResult RunQuery(Dialog dialog, List<DialogExpressionCondition> exprConditions, List<SortedField> sortFields = null, DialogExpressionOperation operation = DialogExpressionOperation.And)
        {
            // *** NOTE *** Wild cards are allowed in expression conditions, emulating a SQL "LIKE".
            LastError = "";
            if (dialog == null)
            {
                LastError = "Parameter 'dialog' cannot be null.";
                return null;
            }
            if ((exprConditions == null) || (exprConditions.Count == 0))
            {
                LastError = "Parameter 'exprConditions' cannot be null or empty.";
                return null;
            }
            DocumentsQueryResult queryResult = null;
            try
            {
                DialogExpression queryExpr = new DialogExpression();
                queryExpr.Count = int.MaxValue;
                queryExpr.Operation = operation;
                queryExpr.Condition = exprConditions;
                if ((sortFields != null) && (sortFields.Count > 0))
                {
                    queryExpr.SortOrder = sortFields;
                }
                else
                {
                    // Sort field list is null or has no elements. So, use a default sort order for results.
                    queryExpr.SortOrder = new List<SortedField>();
                    queryExpr.SortOrder.Add(SortedField.Create("DWSTOREDATETIME"));
                }
                queryResult = dialog.Query.PostToDialogExpressionRelationForDocumentsQueryResult(queryExpr);
            }
            catch (Exception ex)
            {
                LastError = ex.Message;
                queryResult = null;
            }
            return queryResult;
        }

==================================

 

Hope this helps,

Joe Kaufman

 

Publicado Tue, 08 May 2018 16:26:40 GMT por cody m3as

Thanks again Joe. I have successfully integrated the SDK client into our app!

Some follow up questions:

1. Do you have any recommendations on when to call ServiceConnect.Disconnect()?

2. Are there any limits on how many service connections the servers can have open? We have ~5k clients that could potentially be using this feature.

Publicado Tue, 08 May 2018 18:31:00 GMT por Joe Kaufman Bell Laboratories Inc No longer there

Cody,

When I use the Platform SDK from non-.NET tools, I don't even bother to call any disconnects or Logoff resources. Licenses get released after a half hour or so anyway (or something like that)...  Here is a thread about timeouts that sort of discusses things:

https://www.docuware.com/forum/english-forums/docuware-help-technical-problems/license-problem-when-using-platform-sdk

But if you are in a single Platform SDK application, everything can run through a single ServiceConnect object and through one license.

If, however, you are going to have 5,000 executables accessing the Platform SDK, then you are going to burn 5,000 licenses. It is not a matter of what the server can handle, it is a matter of what the license structure can handle. After all, if 5000 clients are accessing DocuWare via the web client, they are all establishing a connection and burning a license, too.

It depends on the architecture of the integration you wrote. Did you write it as something that runs on a web server and uses a single username against the DocuWare server? Then you will only be using one service connection (though that seems a little dicey with regard to licensing -- not sure how the legality of that all works).

If you do an explicit disconnect in a .NET application, the license will still be held open for 120 more seconds. Even if you called disconnects in everyone's app and the process was completely closed, the license is still open. That's why you have to be careful with spawning a lot of operations on separate connections. If you keep using the same ServiceConnect, then you are always only using one license no matter how many times you disconnect and reconnect.

The best way to see how things behave is to play around with your custom integration and watch the connections overview in the DW Admin tool. Watch the "License in use until" field and see how it allots 30 minutes then reduces to 2 minutes when a process disconnects.

You may want to post your question on a new thread, otherwise folks are not going to take a look since connection and licensing issues don't really have anything to do with your OP at this point...

 

Good luck!

Joe Kaufman

You must be signed in to post in this forum.