OCR and full text search in EisenVault’s DMS

EisenVault provide you an end-to-end document management solution, which includes scanning your documents. Professional scanning usually produces images in TIFF format. The advantage of the TIFF format is that it retains image quality – TIFF uses “lossless” compression. This makes it easier for Optical Character Recognition (OCR) software to “read” the text in the scanned images more accurately. EisenVault is one of the first DMS solutions to offer Hindi OCR.

OCR Software usually follows a 3 step process:

  1. Text is extracted from TIFF images and a plain text file is created (often in a variant of HTML so that formatting and placement is retained).

  2. The TIFF file is converted into a PDF file, which usually takes up less disk space.

  3. The text file created in step 1 is embedded in the PDF file, with the words placed at the correct spot. The text now becomes selectable in the PDF and can read by machines.

EisenVault’s DMS is able to read the text from the PDF files generated in step 3 above and the search engine indexes the text. This makes full text keyword search possible against OCRed images.

EisenVault comes with built-in OCR support. At the moment we support 3 languages: English, Hindi and Bengali. We plan to include support for all major Indian languages in our next release. Rules can be setup for running OCR in bulk across a large number of uploaded files, or individual files can be uploaded with OCR being run in real time.

