The Autodoc Storage modules will help you to store documents based on their content and naming. The Stirling services are used to read the content of PDF files, but you can use also other mechanism to extract the documents text.
In all nodes, the msg.payload contains the filename that is processed. The message object will be enriched with the text content of the document and properties depending on the content of the file.
A workflow for a "pdf document" could be:
- Monitor a directory for a droped/new document.
- Lock the document - so no further triggers could appear.
- Set some fixed properties, like own department name or storage base paths.
- Extract the text of the document, to be used for further analyzing. (If no text could be found, enrich the document with text by using the OCR scan Node and repeat this step.)
- Set the auto properties, like languages, date / time of file and document content.
- Search properties by using regular expressions to find information like "invoice" or company names.
- store the document by using the found properties.
- Unlock the file - and delete or move the document, if not needed any longer.
The following elements will be used in the message object while processing.
element | type | content |
---|---|---|
payload | string | holds the name of the source file to be processed. |
textContent | string | OCR parsed document content as string. (Available after read PDF text). |
textProperties | map[name,value] | Properties found, while processing the flow. |
lastMatchContent | string | Partial text of textContent (last search of a regular expression). |
node | short | objective |
---|---|---|
"lock file" | Locks and unlocks the source file | Avoid unexpected removement and supress additional flow triggers. Only when a file has no lock, it can be removed (default) |
"set props" | setting props | Fixed properties can be set, depending on the flow (i.E.) "Input=CommonArea" |
"search props" | searching props | Context specific properties, like parts of the filename or the file content. Can search for existing and non existing entries. |
"auto props" | Automatic props | Set default properties. The language (ENU, DEU, ISO) and timestamp of the document and file can be detected. |
"read PDF" | read pdf text | Extracts the text of a pdf document. (Needs Stirling Service in place) |
"ocr scan" | read/scan pdf text | Initiates a OCR scan on a pdf document and extracts the text. Can replace the input file with an enriched version. You should use this node, after "read PDF" can not detect a valid text. (Needs Stirling Service in place) |
"build PDF" | build a PDF file | Builds a new PDF file from one or more picture (jpg, png). The PDF does NOT contain a valid text, so you should use "scan PDF" after creating. (Needs Stirling Service in place) |
"store File" | store the file | Store the input file in a location by building the target path and name based on the properties found. |
In the folder "examples" you find a sample workflow with some test cases for the node, so you can get familiar with it.
See: "Test cases"