2 Replies - 1491 Views - Last Post: 03 March 2013 - 09:50 PM Rate Topic: -----

#1 Yomna Salah  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 1
  • Joined: 01-March 13

Getting a root of an arabic word

Posted 01 March 2013 - 04:50 PM

I have a python code that take an arabic word and get the root and also remove diacretics. but i have a problem with the output. For example : when the input is "العربيه" the output is:"عرب" but when the input is "كاتب" the output is:"ب", and when the input is "يخاف" the output is " خف".

This is my code:

# -*- coding=utf-8 -*-

import re
from arabic_const import *
import Tashaphyne
from Tashaphyne import *
import enum
from enum import Enum
search_type=Enum('unvoc_word','voc_word','root_word')

HARAKAT_pat = re.compile(ur"[" + u"".join([FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA]) + u"]")
HAMZAT_pat = re.compile(ur"[" + u"".join([WAW_HAMZA, YEH_HAMZA]) + u"]");
ALEFAT_pat = re.compile(ur"[" + u"".join([ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW, HAMZA_ABOVE, HAMZA_BELOW]) + u"]");
LAMALEFAT_pat = re.compile(ur"[" + u"".join([LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE]) + u"]");
#--------------------------------------
def strip_tashkeel(w):
        "strip vowel from a word and return a result word"
        return HARAKAT_pat.sub('', w)

#strip tatweel from a word and return a result word
#--------------------------------------
def strip_tatweel(w):
        "strip tatweel from a word and return a result word"
        return re.sub(ur'[%s]' % TATWEEL,       '', w)


#--------------------------------------
def normalize_hamza(w):
        "strip vowel from a word and return a result word"
        w = ALEFAT_pat.sub(ALEF, w)
        return HAMZAT_pat.sub(HAMZA, w)

#--------------------------------------
def normalize_lamalef(w):
        "strip vowel from a word and return a result word"
        return LAMALEFAT_pat.sub(u'%s%s' % (LAM, ALEF), w)

#--------------------------------------
def normalize_spellerrors(w):
        "strip vowel from a word and return a result word"
        w = re.sub(ur'[%s]' % TEH_MARBUTA,      HEH, w)
        return re.sub(ur'[%s]' % ALEF_MAKSURA,  YEH, w)


def normalize_text(word,searchtype):
        word = strip_tashkeel(word)
        word = strip_tatweel(word)
        word = normalize_lamalef(word)
        word = normalize_hamza(word)
        word = normalize_spellerrors(word)
        if searchtype==search_type.root_word.index:
           ArListem=ArabicLightStemmer();
           stem=ArListem.lightStm(word);
           word=ArListem.get_root();
        print word
        return word
#---------------------------------------------


and this is the test code:

**from task import normalize_text
normalize_text(u'كاتب',2)


and the output is: ب

Is This A Good Question/Topic? 1
  • +

Replies To: Getting a root of an arabic word

#2 tlhIn`toq  Icon User is offline

  • Please show what you have already tried when asking a question.
  • member icon

Reputation: 5678
  • View blog
  • Posts: 12,216
  • Joined: 02-June 10

Re: Getting a root of an arabic word

Posted 03 March 2013 - 08:13 AM

You do realize this is an English speaking site, right?
You would probably have better luck on an Arab speaking site because those readers would understand what the problem is with what you've shown us. To us its just squiggles and gibberish. Your description of the problem for input and output means nothing to us, sorry.
Was This Post Helpful? 1
  • +
  • -

#3 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: Getting a root of an arabic word

Posted 03 March 2013 - 09:50 PM

View PosttlhIn`toq, on 03 March 2013 - 10:13 AM, said:

You do realize this is an English speaking site, right?
You would probably have better luck on an Arab speaking site because those readers would understand what the problem is with what you've shown us. To us its just squiggles and gibberish. Your description of the problem for input and output means nothing to us, sorry.


Well said. If Arabic's system is remotely similar to English, then getting root words is going to be tricky.

This post has been edited by atraub: 03 March 2013 - 09:55 PM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1