سنڌي ٻوليءَلاءِڪمپيوٽرائزڊ لسانياتي اوزار

Computational Linguistic Tools for Sindhi Language

The Sindhi language is one of the oldest languages of the world having fifty-two alphabetical letters and space to adopt several other languages lexicons. This language is written, read, and spoken all over the world. The Sindhi language is complex grammatically and rich morphologically. The grammar of Sindhi language is not the same as the grammar of English and other languages even the meaning and sense of understating of Sindhi lexicons are different. The diacritics used in Sindhi text is changing the meaning, number, and gender of the Sindhi lexicons.

Sindhi NLP (Natural Language processing) deals with Sindhi language and solves computational linguistics and NLP problems of Sindhi language. Most of the research studies concentrate on English text tokenization, tagging, syntactic parsing, sentiments analysis, etc., therefore, a variety of reliable resources are available for the English Language including text and tools for syntactic parsing and sentiment analysis. The online resources for languages other than English like Sindhi are limited even in this digital era. The style, grammatical structure, and domain of Sindhi language are different from the other languages of the world, hence; to work on Sindhi text for word tokenization, Part of Speech tagging, syntactic parsing, Sindhi WordNet, Text Corpus and sentiments analysis is not the same as to work on the English text. The main focus of this research lies in understanding the challenges, and issues of working on the development of Sindhi text Parser, tagging system, text corpus analysis, morphological analysis, statistical analysis, and word Tokenization.

Online Sindhi Text Parser: A language parser is the program that describes the structure of the text of any language. Sindhi online syntactic parser is a program that presents the Sindhi text with proper segmentation, grammatical tagging, syntactic parsing, statistical and morphological analysis. Sindhi Online parser uses UPOS (Universal Part of Speech) and SPOS (Sindhi Part of Speech) to tag the Sindhi text. There are four algorithms designed to develop this tool. The tokenization algorithm, which splits the Sindhi text into independent tokens and assigns them sequence numbers. The POS tagging algorithm tags the UPOS and SPOS to Sindhi text after proper segmentation. Syntactic parsing algorithm identifies Sindhi words and assigns phrases and UPOS to them. The statistical algorithm measure the execution time, number of tokens, frequencies of phrase, UPOS, and morphological forms of Sindhi words.

Sindhi WordNet: Sindhi WordNet is a lexical database for the Sindhi language Nouns, Adjectives, Adverbs, and Verbs. It groups Sindhi words into sets of synonyms called synsets, provides short definitions, examples, and records several relations among these synonym sets or their members. The word stemming, hyponyms, and lemmatization of Sindhi words are available in Sindhi WordNet. WordNet superficially resembles a thesaurus, in that it groups words based on their meanings. However, there are some important distinctions. // First, WordNet links are not just word forms, strings of letters, but specific senses of words. As a result, words that are found close to one another in the network are semantically disambiguated. // Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus do not follow any explicit pattern other than meaning similarity. The main relation among words in WordNet is generated by synonyms. For example; the relation between love and affection words or in Sindhi پيار and محبت. Synonyms words denote the same concept and are interchangeable in many contexts which are grouped into unordered sets called synsets. Each synset of WordNet is linked to another synset employing a small number of “conceptual relations.” A hyponym is a word with a broad meaning constituting a category into which words with more specific meanings fall; super-ordinate. For example, color is a hypernym of red, blue, and others. Meronym shows the relation like the arm is part of the human body. Parts are inherited from their super-ordinates: if a body has arms, then hands are part of arms as well. Parts are not inherited “upward” as they may be characteristic only of specific kinds of things rather than the class as a whole.

Sindhi Lemma and Stemming: Computational Linguistics defines the Lemmatisation as the process of determining the lemma of a word based on its intended meaning and identification of the part of speech. A lemma is a basic form of the word which makes sense of understanding. While stemming is different than the Lemma because the stemmer shows the word without knowledge of the context. In various languages, words seem in many inflected forms. For example, in English, the verb 'to laugh' may appear as 'Laugh', 'Laughed', 'Laughs', 'Laughing'. The base form is 'Laugh', that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word. The Lemma of Sindhi words removes the inflection of Sindhi words and shows the basic form of the word like "هيءُ", "هيءَ", "هي". The base form is "هي" which shows Sindhi part of speech "ضمير" (Zameer) called determiner/pronoun. Even verb "کلڻ", may appear as "کليو", "کلي", "کل". The base form is "کل" which is the lemma. So this lexeme is associated with Sindhi part of speech "فعل" (Fael) called Noun.

Stemming is a method of decreasing or reducing the word to its root form. It removes the inflections, affixes, and suffixes of words which are added to get the useful meaning. For example, a Sindhi word مل (Meet) is a root or stem word for other associated Sindhi words ملي ، ملڻ ، مليو، ملندا، ملندي ، ملنديون ، مليون ، مليا and ملندوَ. or another Sindhi word آھ is the root word for آهي، آهن، آهين and .آهيون

Mazhar Ali Dootio

Computational linguist / NLP Researcher

Mazhar Ali Dootio did his doctoral research in the field of computer science from Shaheed Zulifqar Ali Bhutto Institute of Science and Technology (SZABIST) Campus Karachi, Sindh Pakistan under the supervision of Prof. Dr. Asim Imdad Wagan. Engr. Dr. Wagan is Professor of Computer Science and Dean Academics at Mohammad Ali Jannah University Karachi Sindh Pakistan. Dr. Wagan has earned both his PhD and MS in Computer Science at the National Institute of Applied Sciences of Lyon (INSA de Lyon, France).

Dr. Mazhar is working as Assistant Professor in Faculty of Computer Science and Information Technology, Benazir Bhutto Shaheed University Karachi, Sindh Pakistan. His field of research for doctoral studies is Natural Languages Process (NLP), a sub field of Artificial Intelligence. He has developed several NLP / computational linguistics models and resources for the Sindhi language because the English language-based tools don't support the Sindhi language.

The presented NLP / computational linguistics tools are used for his research study on 'Sentiment analysis for Sindhi text'. These resources may be utilized by researchers / linguists or persons who want to conduct their research studies on the Sindhi language in the field of computational linguistics / natural langauge processing. Resources may be beneficial in understating and analyzing the Sindhi language grammar, lexicon structures, lemma, stemming words and other morphological features of Sindhi language. Sentiment analysis tools are significant tools for the extraction of opinions and sentiments from Sindhi language opinionated text.
The research areas of Dr. Mazhar are Artificial Intelligence including Computational linguistics / NLP, Machine Learning, Deep Learning, and Data Mining, .

Copyright © 2017-2020, All Rights Reserved.