Document Capture Project Tips
In real estate, its … ‘location, location, location’ – with document capture software its ‘preparation, preparation, preparation’ as you’ll soon see.
There’s a lot to think about when planning to implement document capture software so you’ll get it right the first time. There’s of course the document preparation, the software and the hardware – we’ll leave that one to the end.
We’ll assume for now that you’ve already selected your document capture software brand for now (another blog will go into how to select the brand that works best for you). Here are some of the steps to get going:
Where are the documents coming from, that’s a key question? Often they come from multiple sources including paper, of course, but will also include fax documents and attachments from emails and often files that exist in a folder somewhere on the network. Your capture system will hopefully be capable of pulling documents from multiple sources. We rely on docAlpha for this complete input solution, whereas TeleForm, our other solution can’t pull content from existing email emails. Once the document sources have been identified, move on to the next step of building the capture logic to teach the capture system how to classify the document and locate and identify the fielded information to be captured.
Workflow Design & Rules
This step is usually completed by the Designer application. You would identify the field names, the field types (hand print, machine print, bar codes, bubbles or boxes (OMR), image zones and others. Each of the field types will have property settings assigned to define the length of the field, whether its numeric, alpha or alpha numeric with specific character placements. For validations, programmatic logic and database look-ups can often be applied to ensure the system produces the correct values and reduces the likelihood of false positives. Of course you will have defined the page size and orientation (portrait or landscape). Most capture systems have internal logic that will uniquely identify each document by the combination and location of fields and other black assets on the page, like titles, headings, possibly graphical anchors or bar codes. In this way, the capture system can automatically classify (identify) not only the document type and the logic to be applied to it, but also the orientation and ensure the relative positioning of all the data elements in an X and Y coordinate fashion without you having to get technical to do it yourself. Sometimes you may have to add some unique assets to a page when two or more documents look very similar, for instance, if one form is a French equivalent of an existing English form, the similarity might confuse the system. Each capture system has its own way to eliminate that confusion. As part of many forms and documents, for instance invoice processing, the system can usually calculate fields, like columns and rows of numbers to assure the operator the printed number is equal to the calculated number. For those of you that are looking for a system that will accurately read hand writing, that’s cursive writing, look no further – there’s no such technology that can reliably do it – outside of reading hand writing on cheques – but that’s a highly constrained document type with only a few possible combinations, making it a workable project; but the average 8.5” x 11” document with hand writing is a lost cause – don’t even go there.
The target database or databases that you want to populate are the final destination for your data. Usually only meta-data, that’s the captured fielded information, goes in the database. Yes, the file images can be inserted into modern databases, but they’re often such large files, that inserting them as BLOB’s (Binary Large Objects) quickly bloats typical databases and can slow them down. Both docAlpha and TeleForm can export data into multiple databases and database types at the same time. This can be useful if you have part of the data going to one place and the rest going somewhere else for a variety of business reasons. Now that you’ve defined your workflow definitions it’s time to test everything.
Scanning & Testing
Ideally your scanning should be done from the same scanner you’re going to use in production. From your Designing application, hard copy print several samples of the form / document to be captured and fill them out using a black pen, pencil is not advised as it’s often too light and the results of which may not be accurately picked up by the scanner. Fill in the empty forms with sample dummy data for testing purposes.
From here, your capture recognition system will have to be launched to apply the designed rules and interpret your hand print sample dummy data. This is usually an automatic process, sometimes it may require an Admin person to release the data to the next step of Verification.
Verify Your Data
Once the data has been recognized, you’ll be able to see in the Verifier module how well the recognition system read the data points you just entered. You may find that your hand printing need to adapt to the field size, or the slant of the text isn’t well read. This where you see what the system saw and you’ll make adjustments in the Design Workflow to improve on how it reads the data. Typically there are many changes needed. Once data has been accepted, it exported – you’ll have to look at that as well to make sure the data array is presented the way you want it – in the order you’re expecting to see it and in the format you’re expecting. Keep testing and changing the Design Workflow until you’ve been able to get the best results possible. Once you’ve finished your test samples, produce more blank samples and ask your associates to complete the forms to test based on different hand printing styles – the more you test, the more reliable your results will be in a production run.
In the end, there are many dry runs you’ve got to go through to get it right. A single mistake in how you’ve configured the system to interpret can results in thousands of unnecessary keystrokes as you process the thousands or millions of your returned production documents for processing. Don’t be a paper-head, this is a very doable process, but you’ve got to get it right to prove that document capture is in fact viable for your project.