Arab Name Transcription Engine
Powered by DAN 3.0 (demo)
This Arab Name Transcription Engine demo is powered by the following Arabic lexical resources developed by The CJK Dictionary Institute, Inc.:
Our Database of Arab Names is based on authoritative resources and has undergone extensive proofreading and expansion by a team of Arabic native speakers, and based on about 25 million names derived from a large variety of sources, including websites, corpora, books, dictionaries, phone books, and encyclopedias. It currently covers over seven million names and name variants.
DAN covers Arab personal names in both the roman and Arabic scripts and includes numerous orthographic variants and other attributes such as web frequency, name type codes and normalized forms. Based on authoritative linguistic resources, DAN continues to grow substantially. For more information, see this white paper (pdf).
Our Database of Arab Names in Arabic (DANA) is a one-of-a-kind resource which covers several hundred thousand Arabic script variants and common spelling mistakes. A key feature of DANA is that every Arabic name is normalized and vocalized to produce a database of error-free, fully sanitized Arabic canonical forms. The vocalization is performed by a team of editors with the aid of tools and interfaces designed to achieve maximum efficiency. The canonical forms are used both as a basis for creating accurate romanized variants for DAN as well as Arabic orthographic variants for DANA.
The process of automatically converting unvocalized Arabic to a Roman script representation, called romanization, is a challenging task to which there is no definitive solution.
CJKI's team of experts on Arabic orthography and phonology has developed a versatile system that performs a wide range of computational linguistic tasks such as phonetic and phonemic transcription, transliteration, name variant generation, vocalization, code conversion and language identification. Though the focus is on processing Arabic names, it can for the most part be applied to processing Arabic texts in general.
A more detailed discussion of the issues involved regarding Arabic romanization and arabization can be found here.
You may also be interested to read the following papers authored by CJKI's CEO, Jack Halpern:
We specialize in the compilation of large-scale Chinese, Japanese, Korean, and Arabic lexical resources. Our lexical databases currently cover about 13 million entries for general vocabulary, proper nouns and technical terms, and companies including Yahoo, IBM, Baidu, Fujitsu, and Microsoft, can attest that we put great emphasis on quality in compiling our databases, and that we fine-tune our data to meet the specific needs of our customers by studying those needs in-depth using our expertise in linguistics and lexicography.