Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer

Nojoumian, Peyman

Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer

dc.contributor.author	Nojoumian, Peyman
dc.contributor.supervisor	Hirschbühler, Paul
dc.contributor.supervisor	Inkpen, Diana
dc.date.accessioned	2011-08-12T20:14:14Z
dc.date.available	2011-08-12T20:14:14Z
dc.date.created	2011
dc.date.issued	2011
dc.degree.discipline	arts
dc.degree.level	doctorate
dc.degree.name	phd
dc.description.abstract	Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work.
dc.embargo.terms	immediate
dc.faculty.department	Linguistique / Linguistics
dc.identifier.uri	http://hdl.handle.net/10393/20158
dc.identifier.uri	http://dx.doi.org/10.20381/ruor-4725
dc.language.iso	en
dc.publisher	Université d'Ottawa / University of Ottawa
dc.subject	Persian
dc.subject	Persian computational linguistics
dc.subject	diacritizer
dc.subject	morphological analyzer
dc.subject	heterophonic homograph
dc.subject	disambiguation
dc.title	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
dc.type	Thesis
thesis.degree.discipline	arts
thesis.degree.level	Doctoral
thesis.degree.name	phd
uottawa.department	Linguistique / Linguistics

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: Nojoumian_Peyman_2011_thesis.pdf
Taille:: 2.86 MB
Format:: Adobe Portable Document Format
Description:: PhD Thesis

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 4.21 KB
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

- Thèses, 2011 - // Theses, 2011 -