Arabic

Arabic and Computers

  • Home
  • The Arabic Mac
  • Programs
  • Scripts
  • Downloads
  • Jaghbub
  • Eudora Tables
  • Links

  • Test of two Arabic OCR Programs
    December 1994

    by Joseph Norment Bell, University of Bergen, and Petr Zemanek, Charles University, Prague

    From the fourteenth to the eighteenth of December we met in Bergen to experiment with the two of the OCR programs for Arabic that were available in the software market as of November 1994. One of these was TextPert 3.7 Arabic, produced by CTA, Inc., which runs on the Macintosh Arabic system. (System 7.1 was used in the test.) The other was al-Qari' al-Ali (Arabic "Automatic Reader") 1.0, upgraded to 1.1, a version of the program known as MULTREC. It is produced by al-Alamiah Software Co. and runs on al-Nawafidh al-Arabiya, the Arabization program for Windows from the same company. Taking part with us were administrative assistant and librarian Awni Taki Musa and undergraduate student Navid Saminasab.

    The limited time and means at our disposal did not allow us to try out a third program, ICRA 4.0, which is an application for Windows (with Arabic Support) produced by Arab Scientific Software & Engineering Technologies (cf. the communications by Jan Hoogland, Discussion Forum on Personal Computers Arabization, Dec. 21, 1994; Itisalat, Jan. 5, 1995). Another program which has been discussed recently, one using neural-net based software from Mitek Systems in San Diego, was as of late November not yet available, and the company could provide no comparison results.

    Both of the programs we tested were able to recognize certain computer printed texts of good quality with a reasonable degree of accuracy considering the difficulties of the Arabic script. Both were many times slower than comparably priced programs for Latin OCR, also when reading Latin.

    Textpert

    TextPert is a program which is extremely easy to use, but which offers in the normal version no means of influencing character recognition other than adjustment of resolution, brightness, and contrast on the scanner. Thus it was not possible to choose, or to train for, the fonts we were scanning. On very good and simple texts the results were approaching acceptable standards, but on more complicated fonts the program recognized virtually nothing. Moreover, on the computers we used (a PowerBook 180 with 14MB of memory and an LC III with 8MB of memory), the program was not always able to follow the paths between the automatically established zones on the document to be read. When it could not do this, the Macintosh would crash. There is a much faster and three or four times more expensive version of Arabic TextPert which uses a RISC board. We have been told by the company that it does not perform essentially differently from the cheaper version except for speed, but that they may allow access to the engine for certain purposes the user may require. For Macintosh users who only want to scan certain kinds of computer produced documents, TextPert may offer something approaching an acceptable solution, but it is to be hoped that future versions will take into account the need to train for different fonts.

    Al-Qari' Al-Ali

    This program is based on a very powerful algorithm which seems to combine vector and bit-map analysis. In its first upgraded version it offers a number of means, although still not quite enough, of controlling recognition performance. Thus it is possible to select desired level of accuracy and to train for the majority of fonts in Arabic and in most other scripts. The results of an OCR operation can be controlled with a spelling checker that, while far from what one might hope for, is surprisingly good, particularly for controlling words that have run together. To facilitate comparison between the original scanned image and the text document, the spelling checker highlights problem areas simultaneously in both.

    The texts on which we tried al-Qari' al-Ali were for the most part photocopies from works printed in the late nineteenth century in relatively complex fonts (for example Shaykh'zadah's Hashiyah on al-Baydawi printed in Constantinople in 1306/1888-89). There were quite a few breaks between letters, and spaces as often as not occurred in the middle of words, rather than between them. The results were none the less impressive, although anyone interested in scanning texts of this type must be prepared to invest a great deal of time both before and after scanning.

    The text documents we produced using al-Qari' al-Ali were later converted for the Macintosh using a conversion table we made in Paradigma 2.0, a program designed by Espen Aarseth at the University of Bergen. The PC Arabic system handles ligatures and initial and final forms differently from the Macintosh, and word boundaries in the text document produced by al-Qari' al-Ali were often clear on the PC even when there was no actual space between the words. These boundaries disappeared when the text was converted for the Macintosh. Since at this stage in the program's development the adding and subtracting of spaces has to be done manually, it is probably better to carry out this part of the correction process on a PC, even if one intends to continue working on a Macintosh later on. We understand that a Macintosh version of the program is under development, but we have no information about how this particular problem will be handled.

    The very considerable amount of time it takes to train for new fonts, especially hand set fonts with many ligatures, is one of the main problems with al-Qari' al-Ali. Even when teaching Latin fonts the process went slower and the operations were more cumbersome than, for example, in the bit-map program ProLector, which, however, is considerably more expensive. Quicker routines for training fonts would be a great improvement. A feature that al-Qari' al-Ali has which is not in ProLector is the possibility of editing bit-map models within the program and inserting them into a set of previously trained models. (Anyone not concerned with speed and thinking of substituting al-Qari' al-Ali for ProLector, will have to learn how to read the Arabic menus of the program and the operating system.)

    Because the program, although slow, seems so powerful and so promising, we would like to note some problems which we hope the developers will take into account in future upgrades.

  • Manual. Although the manual may look nice, it contains only very superficial information and needs to be entirely redone. An English version would also be helpful.

  • Image Rotation. We did not find, within the program, a tool for gradual rotation of the images to be scanned or read. Such a feature would make it easier to maintain a constant alignment of scanned images so that the program always sees the characters it is to learn or read from the same angle.

  • Recognition Blocks. Al-Qari' al-Ali places groups of connected letters into a green frame and what it thinks are individual letters within the group between horizontally adjustable red lines inside the green frame. Neither the width nor the height of the green frame can be manually adjusted, which means that characteristic elements of a block are on occasion excluded or extraneous information included. Within the green frame, the program lets one know what it is taking as characteristic of a letter by outlining it in blue. It would be helpful, if it is possible, to have a means, in addition to the red lines, of activating or deactivating the blue outline where the program has made a mistake. The program will have certain difficulties with complex fonts until these problems are remedied. For the moment the best guideline seems to be not to override the program's choice when training any more than necessary, since it is not unlikely that it will make the same choice again anyway. When the program has seen a medial letter or ligature as one in isolation because of breaks in the word, for example, the "in isolation" choice at times has to be accepted. Otherwise the program may fail to read the letter or ligature, or read it as something else.

  • Fonts. The program comes with few pre-trained fonts, and those it does provide are computer fonts with few ligatures. Given the amount of time needed to train fonts, the library of pre-trained fonts, particularly non-computer fonts, needs to be greatly expanded. Further, the program lacks an efficient means of visually comparing fonts to be read with the pre-trained fonts, since the font display window in the "create/emend" font library dialogue box gives an inadequate image of small fonts. Lastly, there is in the present version no means of scaling up or down previously trained fonts, which means that every font in every size has to be trained separately. However we have been told by the company that in the next version it will be possible to reproduce fonts in other sizes (plus or minus 2 points in either direction).

  • Confusing Messages. One problem we experienced with al-Qari' al-Ali was that when all the places allowed for the variants of a character in a given position had been used up, the warning that appeared was not always the same. A character may have eleven variants in each position (initial, medial, final, or in isolation). When we tried to teach a twelfth variant, the message occasionally stated that we had exceeded some other limit. The problem may have been insufficient memory in the computer we were using, or it may be in the program. In any event, when using the current version of al-Qari' al-Ali one should be aware of the possibility of inappropriate messages appearing.

  • Ligature Dialogue Boxes. The dialogue boxes for certain combinations of letters, such as "fii," offer only the normal position option, in this case "in isolation" or "final," when in fact in some fonts other positions occur. An "other" button is needed here to allow for the less common options.

  • Ligature List. The window listing optional ligatures gives them in order of creation rather than alphabetically, which in most cases makes it more difficult to find the ligatures one is after. The current method, however, makes it easier to correct mistakes one has just made. If possible, the ligature window should include an optional alphabetical sorting button.

  • Space Markers. Because of the problem with spaces between and within words alluded to above, the al-Muharrir word processor that comes with al-Qari' al-Ali should have an option for marking spaces between groups of letters.

  • Stability. The stability of the program, especially when communicating with the scanner, seems to need improvement. The problem may have been in Arabic Windows or in our hardware. We were using a modest Olivetti 486SX/25MHz with 8MB RAM and a Hewlett-Packard ScanJet IIcx. We tested incidentally some of al-Alamiah's other software, in particular the word processor Al Ostaz, the Koran database for Arabic Windows, and the hadith databases for Arabic DOS. All of these were impressive products which should receive a warm welcome in any milieu, academic or religious, with a special interest in the Arabic and Islamic heritage.

    Joseph N. Bell
    Petr Zemanek


    This review is preserved in electronic form and in hard copy in the archive of electronic publications of the Section for Middle Eastern Languages and Cultures, University of Bergen.

    Archived 14.4.95


    Back
  • Home | The Arabic Mac | Downloads | Index
    Responsible for this Web page is Knut S. Vikør.
    Last updated Thursday, 04-Nov-2010 09:55:20 CET