Pattern recognition is a commonly encountered problem when computers are required to get information the from the physical world around them. It may be easy enough to get a digital picture via a camera or a scanner into a computer file, but how does the computer know what the data means? Recent advances in commercially available optical character recognition software have provided some affordable solutions, particularly when fonts are similar and the material is relatively clean. Blind people can even purchase a scanner and software which will read aloud to them. However, there are still real limits to what most commercial software can recognize. Most have difficulty when the print is sloppy, small or varies considerably. None offer the ability to recognize arbitrary shapes, symbols or graphics.
Recent studies in pattern recognition with neural networks have been sponsored by the US Post Office to read ZIP codes.(1) Even though they are primarily interested in hand-written digits, the techniques developed are general. Feature extraction from bitmaps is the biggest problem. An approach for feature extraction uses Fourier descriptors of the items to be recognized.(2) One such application, described here, reads a chemical drawing (comprised of characters and graphics) and translates it into a chemical structure database.
Compounds are described in two ways: as a chemical drawing of connected atoms, or as a list of atoms and their connections in a connection table. A connection table can be easily stored on computer, but most printed sources such as books, journals and papers use the more easily recognized drawings. The connection tables uniquely define compounds and can be used to index information in a database. When chemical compound descriptions are placed in a database with other information they can be used for patent searches, environmental studies, toxicology studies, and precursor searching, for example.
Fein-Marquart Associates, Inc. has developed a program which automatically reads printed chemical drawings and translates them into connection tables in a database. The old approach required manual computation of the connection table. Commercially available optical character recognition programs were not able to read the chemical drawings because many use a very small print (6 and 8 point) and there are graphic elements present as well as standard English characters.
The system was developed by Fein-Marguart and uses a neural network trained with BrainMaker Professional to recognize the printed characters and graphics. The system has a 98% recognition success rate according to Joe McDaniel, Senior Staff Member at Fein-Marquart. The chemical drawings are read into a PC from a scanner, some mathematical processing is performed to provide Fourier descriptors which are then fed into a neural network for recognition and translation into bonds and atomic symbols. The output of the neural network is formatted into a connection table and transmitted to a host computer database.
Fourier descriptors are computed by tracing the outline of a character to create a concave hull. This data is stored as a list of x and y coordinates. If one views the x portion of the data as the real and the y as the imaginary portion of a complex data pair, and then performs a Fourier transform on the list, the result will be a list of complex data points representing frequency. Straight lines or big curves can be interpreted from low frequency data, and corners, serif and end-of-lines from high frequency data. Characters and graphics have frequency magnitude and phase signatures which can be recognized by the neural network.
Low frequency data can be interpreted as straight lines or big curves, and high frequency data as corners, serif and end-of-lines. Characters and graphics have frequency magnitude and phase "signatures" which can be recognized by the neural network. The neural network is given the frequency information as input and is trained to translate information into bonds and atomic symbols.
The output of the neural network is formatted into a connection table and transmitted to a host computer database. When chemical compound descriptions are placed in a database with other information, they can be used for patent searches, environmental studies, toxicology studies, and precursor searching.
(1) Y. Le Cun, et al, Back Propagation applied to handwritten zip codes, Neural Computation Vol. 1:541-551, 1989.
(2) G. H. Grunlend, Fourier Process for Hand Printed Character Recognition, IEEE Transactions on Computers vol. C21-2, pages 195-201, February 1972.
This work is funded by the National Cancer Institute and is being undertaken by Joe McDaniel and Jason Balmuth, Fein-Marquart Associates,Inc., Baltimore, MD.