سنڌي ٻوليءَلاءِڪمپيوٽرائزڊ لسانياتي اوزار

Computational Linguistic Tools for Sindhi Language

Sindhi language is one of the oldest languages of the World having fifty-two alphabetical letters and space to adopt several other languages lexicons. This language is written, read and spoken all over the World. Sindhi language is complex grammatically and rich morphologically. The grammar of Sindhi language is not the same as the grammar of English and other languages even the meaning and sense of understating of Sindhi lexicons are different. The diacritics used in Sindhi text is changing the meaning, number and gender of the Sindhi lexicons.

Sindhi NLP (Natural Languages process) deals with Sindhi language and solves computational linguistics and NLP problems of Sindhi language. Most of the research studies concentrate on English text tokenization, tagging, syntactic parsing, Sentiments analysis and etc. A variety of reliable resources are available for English Language including text and tools for syntactic parsing and sentiment analysis. The online resources for languages other than English like Sindhi, are limited even in this digital era. The style, grammatical structure and domain of Sindhi language is different from the other languages of the World. To work on Sindhi text for word tokenization, part of Speech tagging, syntactic parsing, Sindhi WordNet, Text Corpus and sentiments analysis is not same as to work on the English text.

Main focus of this research lies on understanding the challenges, issues of working on development of Sindhi text Parser, Text corpus analysis, morphological analysis, statistical analysis, Sindhi and Universal parts of speech tagging and word Tokenization. There is a little number of lexicons available in Sindhi WorldNet therefore, all types of the words and sentences cannot be analyzed and described.

Online Sindhi Text Parser: A language parser is the program that describes the structure of text of any language. It is a natural language process tool that segments the sentence into several tokens according to its grammatical structure. This segmentation is called word tokenization. Sindhi online syntactic parser is a program that presents the Sindhi text with proper segmentation, grammatical tagging, syntactic parsing, statistical and morphological analysis. Sindhi online parser uses UPOS (Universal Part of Speech) and SPOS (Sindhi Part of Speech) to tag and syntactically parse the Sindhi text. There are four algorithms designed to develop this tool. The tokenization algorithm, which splits the Sindhi text into independent tokens and assign them sequence numbers. The POS tagging algorithm tags the UPOS and SPOS to Sindhi text after proper segmentation. Syntactic parsing algorithm identifies Sindhi words and assigns phrase and UPOS to them. The statistical algorithm measure the execution time, number of tokens, frequencies of phrase, UPOS and morphological forms of Sindhi words.

Sindhi WordNet: Sindhi WordNet is a lexical database for the Sindhi language Nouns, Adjectives, Adverbs and Verbs. It groups Sindhi words into sets of synonyms called synsets, provides short definitions, examples and records a number of relations among these synonym sets or their members. The word stemming, hyponyms and lemmatization of Sindhi words are available in Sindhi WordNet. WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. // First, WordNet links are not just word forms, strings of letters, but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. // Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus donot follow any explicit pattern other than meaning similarity. The main relation among words in WordNet is generated by synonyms. For example; relation between love and affection words or in Sindhi پيار and محبت. Synonyms words denote the same concept and are interchangeable in many contexts which are grouped into un-ordered sets called synsets. Each synset of WordNet is linked to another synset by means of a small number of “conceptual relations.” Hyponym is a word with a broad meaning constituting a category into which words with more specific meanings fall; super-ordinate. For example, colour is a hypernym of red, blue and others. Meronym, shows the relation like arm is part of human body. Parts are inherited from their super-ordinates: if a body has arms, then a hands are part of arms as well. Parts are not inherited “upward” as they may be characteristic only of specific kinds of things rather than the class as a whole.

Sindhi Lemma and Stemming: Computational Linguistics define the Lemmatisation as the process of determining the lemma of a word based on its intended meaning and identification of part of speech. Lemma is basic form of word which makes sense of understanding. While stemming is different than the Lemma because stemmer shows the word without knowledge of the context. In various languages, words seem in many inflected forms. For example, in English, the verb 'to laugh' may appear as 'Laugh', 'Laughed', 'Laughs', 'Laughing'. The base form is 'Laugh', that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word. The Lemma of Sindhi words removes the inflection of Sindhi words and shows the basic form of word like "هيءُ" , "هيءَ" , "هي". The base form is "هي" which shows Sindhi part of speech "ضمير" (Zameer) called determiner / pronoun. Even verb "کلڻ", may appear as "کليو" , "کلي", "کل". The base form is "کل" which is lemma. So this lexeme is associated with Sindhi part of speech "فعل" (Fael) called Noun.

Stemming is method of decreasing or reducing the word to its root form. It removes the inflections, affixes and suffixes of words which are added to get useful meaning. For example, a Sindhi word مل (Meet) is a root or stem word for other associated Sindhi words ملي ، ملڻ ، مليو، ملندا، ملندي ، ملنديون ، مليون ، مليا and ملندوَ. or another Sindhi word آھ is root word for آهي، آهن، آهين and .آهيون

Mazhar Ali Dootio

Computational linguist / NLP Researcher

Mazhar Ali Dootio is doing his doctoral research in the field of computer science at Shaheed Zulifqar Ali Bhutto Institute of Science and Technology (SZABIST) Campus Karachi, Sindh Pakistan. He is working as Assistant Professor in Department of Computer Science and Information Technology, Benazir Bhutto Shaheed University Karachi, Sindh Pakistan. His field of research for doctoral studies is Natural Languages Porcess (NLP). He has developed several NLP / computational linguistics tools for Sindhi language because English language based tools donot support the Sindhi language.
The presented NLP / computational linguistics tools are the basic requirement for his research study on 'Sentiment analsis for Sindhi text'. These tools may be beneficial for those persons who want to conduct research studies on Sindhi language. Tools may be beneficial in understating the Sindhi language grammar, lexicon structures, lemma, stemming words. Sentiment analysis tool is significant tool for extraction of opinions and sentiments.
His research areas are Computational linguistics / NLP, Data Mining , Machine Learning, Deep learning and Artificial intelligence.

Social Media Links:

Dr. Asim Imdad Wagan

Research Supervisor

Engr. Dr. Asim Imdad Wagan is Associate Professor of Computer Science and Dean Academics at Mohammad Ali Jannah University Karachi Sindh Pakistan. He is a member of IEEE, and member of the Pakistan Engineering Council. He has earned both his PhD and MS in Computer Science at the National Institute of Applied Sciences of Lyon (INSA de Lyon, France). He is an expert in various areas of Computer Science. His research interests include Machine Learning, Data Science, Computer Vision, Image Processing, Soft Computing, Deep Learning, Specialties: Image, Video and 3D Processing and Analysis, Self-Driving Car, Image and Video Classification, Document Processing.
Dr. Wagan is supervisor of my doctorial research studies on Computational Linguistics issues of Sindhi language at SZABIST Karachi campus, Sindh Pakistan. He provided me with very good supervision, proper guidance, constant support, encouragement and environment for working on this research study.

Social Media Links:
Copyright © 2017-2019, All Rights Reserved.