Arab Name Transcription Engine

This Arab Name Transcription Engine demo is powered by the following Arabic lexical resources developed by The CJK Dictionary Institute, Inc.:

Database of Arab Names (DAN)

Our Database of Arab Names is based on authoritative resources and has undergone extensive proofreading and expansion by a team of Arabic native speakers, and based on about 25 million names derived from a large variety of sources, including websites, corpora, books, dictionaries, phone books, and encyclopedias. It currently covers over seven million names and name variants.

DAN covers Arab personal names in both the roman and Arabic scripts and includes numerous orthographic variants and other attributes such as web frequency, name type codes and normalized forms. Based on authoritative linguistic resources, DAN continues to grow substantially. For more information, see this white paper (pdf).

Database of Arab Names in Arabic (DANA)

Our Database of Arab Names in Arabic (DANA) is a one-of-a-kind resource which covers several hundred thousand Arabic script variants and common spelling mistakes. A key feature of DANA is that every Arabic name is normalized and vocalized to produce a database of error-free, fully sanitized Arabic canonical forms. The vocalization is performed by a team of editors with the aid of tools and interfaces designed to achieve maximum efficiency. The canonical forms are used both as a basis for creating accurate romanized variants for DAN as well as Arabic orthographic variants for DANA.

Methodology

The process of automatically converting unvocalized Arabic to a Roman script representation, called romanization, is a challenging task to which there is no definitive solution.

CJKI's team of experts on Arabic orthography and phonology has developed a versatile system that performs a wide range of computational linguistic tasks such as phonetic and phonemic transcription, transliteration, name variant generation, vocalization, code conversion and language identification. Though the focus is on processing Arabic names, it can for the most part be applied to processing Arabic texts in general.

A more detailed discussion of the issues involved regarding Arabic romanization and arabization can be found here (pdf file).

You may also be interested to read the following, authored by CJKI's CEO, Jack Halpern:

Lexicon-Driven Approach to the Recognition of Arabic Named Entities (pdf file)

About The CJK Dictionary Institute, Inc.

We specialize in the compilation of large-scale Chinese, Japanese, Korean, and Arabic lexical resources, which currently cover over 50 million entries for general vocabulary, proper nouns and technical terms. We contribute to AI and natural language processing (NLP) technology, including machine translation, speech technology and named entity recognition, by providing high-quality lexical resources to many of the world’s leading IT companies, including Amazon and Google.