rod goodman

Biographical Sketch

Research Interests

Curriculum Vitae (pdf)

Publication List (pdf)

Online Publications

PhD Alumni

Contact Information

Sailing!

robot

home

Rodney M. Goodman B.Sc., Ph.D., C.Eng., SMIEEE, FIEE.

Keyword Spotting for Cursive Document Retrieval

keyword glass
Title: Keyword Spotting For Cursive Document Retrieval
Authors: Trish Keaton, Rodney Goodman


Abstract: We present one of the first attempts towards automatic retrieval of documents, in the noisy environment of unconstrained, multiple author, handwritten forms. The documents were written in cursive script for which conventional OCR and text retrieval engines are not adequate. We focus on a visual word spotting indexing scheme for scanned documents housed in the Archives of the Indies in Seville, Spain. The framework presented utilizes pattern recognition, learning and information fusion methods, and is motivated from human word-spotting studies.
 
Motivation & Aims
The goal of this research is to develop a visual word spotting and indexing scheme for the archival and retrieval of scanned historical documents housed in the Archives of the Indies in Seville, Spain. These documents were written in cursive script by multiple authors, and are hundreds of years old (many of which date back to Columbus's era). There exists a tremendous need for scholars to constantly search and explore the contents of such archives. However, conventional OCR and text retrieval engines are inadequate for such tasks. Existing OCR systems often rely upon the ability to cleanly segment the words prior to recognition. The documents in our database exhibit many problems which would certainly cause such systems to fail. We must contend with noise introduced by the photocopying and scanning processes, as well as stray marks, underlines, and overlapping words. Under these conditions perfect segmentation would be impossible. We have developed an alternative strategy for the indexing and retrieval of such documents based on learning a set of keyword signatures of particular words of interest.

Our approach applies many standard image processing techniques in the preprocessing of the documents, and the extraction of the spatial characteristics of the words. In addition, we attempt to characterize words via signatures motivated from human word spotting experiments. The recognition strategy is based upon probabilistic signature matching, in which we view the entire word globally, rather than segmenting and recognizing the individual letters of the word. We investigate the ability to use such signatures, together with advanced encoding schemes and learning, to facilitate the spotting of keywords in handwritten cursive documents. Approach
Focus-of-attention :We avoid page segmentation problems by incorporating a focus-of-attention module, to identify candidate locations prior to performing the word-level matching. This step involves normalized cross-correlation of the document image with a set of keyword prototypes (templates) which have been extracted from a training set of documents. A set of candidate locations is extracted, with the different locations ranked by correlation strength. The locations of the top correlation peaks are then passed along to the preprocessing stage.

top

back to Information Processing