< Back to previous page

Project

A Practical Solution to OCR in Persian and Arabic Print

The advent of computers has freed human beings from complicated and time-consuming calculations. Further developments of programming languages have also paved the way for making more competent machines insofar that the emergence of artificial intelligence has equipped computers with the power of reinforcing humans to make worldwide communication with peace of mind. A noteworthy example of such utilization is Computer-Assisted Translation (CAT) tools and Machine Translation (MT) as the fastest way of rendering diverse languages.

The present research builds up an essential prerequisite to a multi-dimensional target, which not only grounds the basis of translation industry, but also makes a demanding contribution to the modern study of language on the whole.

Through the expanding process of digitization as counted among the duties of Digital Humanities, a new horizon is opened to academia to gain access to a variety of sources in the form of digital editions substituting traditional research methods. Using the technology of Optical Character Recognition (OCR), digitally scanned images turn into editable texts, which open the path for computers to make further analyses. Indeed this is OCR, which intersects the currently present bulk of data in image form and the needed text input for CAT and MT. For the sake of OCR, there have been extensive and practical efforts on Latin-based languages but on the side of Persian and Arabic, in spite of having rich impacts on world literature, there has been no proper movement on that. Even the quality of integrated OCR systems for Latin-based languages in major CAT software is not accurate enough to count on for having a flawless translation.

The current research proposes a new model which receives Persian and Arabic print material in digital image form as the input and extracts its editable text as the output which itself would be the required input for performing any kind of digital textual analysis. Regarding the use of continuous-format letters in Persian and Arabic languages as well as their originality and ancient history, letters are used in various forms for expressing different emotions and themes. Such varieties can be seen in the length of each letter, being short or long, the curve figure, etc. This change in the pattern and structure can be expanded to the extent that even the human intelligence is in some cases incapable of a correct diagnosis. Due to this varied range and complexity in each class, designing an OCR system for these languages shall be carried out independently and take into account different dimensions

In order to solve this problem, image classification is made using Adaptive-Boosting (AdaBoost) machine learning. In addition, instead of using image pixels of characters, Histogram of Gradient (HOG) and Shape Context features are first extracted from the target image. These features combine together and form a unit feature vector. In both training and classification phases, this feature vector is used. It should be noted that all upcoming experiments are carried out in the latest version of Python environment.

In short, the final output of this research would be an OCR solution for Persian and Arabic languages with the support of most commonly used fonts in order to provide the demand of all scholars of the two languages for performing digital textual analysis in general and integrate into MT and CAT tools in particular.

Date:2 May 2018 →  2 May 2022
Keywords:OCR, NLP, CAT tools
Disciplines:Multimedia processing, Biological system engineering, Signal processing, Instructional sciences, Literary studies, Theory and methodology of language studies
Project type:PhD project