Maha is a text processing library specially developed to deal with Arabic text.

Overview



CI Documentation Status codecov Discord Downloads License PyPI version Code style: black Checked with mypy PyPI - Python Version

An Arabic text processing library intended for use in NLP applications


Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments
  • Time: Add the ability to parse Hijri dates

    Time: Add the ability to parse Hijri dates

    What does this pull request change?

    Closes #27.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 6
  • Added distance to dimension parsing

    Added distance to dimension parsing

    What does this pull request change?

    Resolves #15.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    parsing highlight 
    opened by TRoboto 5
  • Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    What does this pull request change?

    This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

    Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 4
  • Add pyupgrade to pre-commit and upgrade to future-style type annotations

    Add pyupgrade to pre-commit and upgrade to future-style type annotations

    What does this pull request change?

    Upgrades to new type annotations style.

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    maintenance 
    opened by TRoboto 3
  • Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    What does this pull request change?

    • Removes datasets module.
    • Datasets are now hosted here

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    breaking changes deprecation 
    opened by TRoboto 3
  • Add the ability to parse names from text

    Add the ability to parse names from text

    What does this pull request change?

    Adds #24. Depends on #40

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 3
  • Add a deprecation system

    Add a deprecation system

    What does this pull request change?

    • Closes #23
    • Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    development 
    opened by saedx1 3
  • Prepare for the next release of Maha (v0.3.0)

    Prepare for the next release of Maha (v0.3.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.3.0.
    • Bumped pypi version to v0.3.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Ordinal: Add support to `بعد` in ordinal parsing

    Ordinal: Add support to `بعد` in ordinal parsing

    What does this pull request change?

    Closes #48.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Numeral: Add support for hierarchical parsing

    Numeral: Add support for hierarchical parsing

    What does this pull request change?

    Closes #25

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Prepare for the next release of Maha (v0.2.0)

    Prepare for the next release of Maha (v0.2.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.2.0.
    • Bumped pypi version to v0.2.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Update ci.yml

    Update ci.yml

    Check the support for python 3,10

    What does this pull request change? It checks if the library is supporting python 3.10.

    • ...

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [ ] tox passes
    opened by PAIN-BARHAM 1
  • Add the option to ignore Harakat when removing or replacing

    Add the option to ignore Harakat when removing or replacing

    What problem are you trying to solve?

    Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

    Examples (if relevant)

    Current:

    >> from maha.cleaners.functions import remove
    >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة")
    >> output
    يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى
    

    Suggested:

    >> from maha.cleaners.functions import remove
    >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True)
    >> output
    يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى
    

    Definition of Done

    • It must adhere to the coding style used in the defined cleaner functions.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by xaleel 1
  • Wrong parsed name using name dimension

    Wrong parsed name using name dimension

    What happened?

    The name parser extracted wrong name likes : بي, شكرا.

    Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

    I expect to extract the names on the name dataset only.

    Python version

    3.8

    What operating system are you using?

    Linux

    Code to reproduce the issue

    >>> from maha.parsers.functions import parse_dimension
    >>> text = `أريد البحث في سجل الإنفاق الخاص بي`
    >>> extracted = parse_dimension(text, names=True)
    [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]
    

    Relevant log output

    No response

    bug parsing 
    opened by PAIN-BARHAM 0
  • Add feature to parse duration period

    Add feature to parse duration period

    What problem are you trying to solve?

    Parsing the duration from the text that has the difference between the two dates.

    Examples (if relevant)

    >>> from maha.parsers.functions import parse_dimension
    >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value
    >>> output
    DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)
    
    

    Definition of Done

    • It must adhere to the coding style used in the defined dimensions, duration dimension.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by PAIN-BARHAM 1
  • Adding the parser functionality to Processors

    Adding the parser functionality to Processors

    What problem are you trying to solve?

    Adding the parser functionality to Processors to parse different dimensions.

    Examples (if relevant)

    >>> from pathlib import Path
    >>> import maha
    >>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
    >>> data = resource_path.read_text()
    >>> print(data)
    
    الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
    طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
    يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
    مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
    لما حد يسالني بتختفي كتير لية =..
    زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
    #Windows11 is on the horizon. What feature are you looking forward to
    Get vaccinate #savethesaviour
    Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit
    
    >>> from maha.processors import FileProcessor
    >>> proc = FileProcessor(resource_path)
    >>> parsed = proc.parse_dimension(time=True)
    [Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
     Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
     Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]
    
    

    Definition of Done

    • It must adhere to the coding style.
    • The implementation should cover most use cases.
    • Adding tests.
    good first issue feature request parsing 
    opened by PAIN-BARHAM 0
Releases(v0.3.0)
Owner
Mohammad Al-Fetyani
Machine Learning Engineer
Mohammad Al-Fetyani
Script and models for clustering LAION-400m CLIP embeddings.

clustering-laion400m Script and models for clustering LAION-400m CLIP embeddings. Models were fit on the first million or so image embeddings. A subje

Peter Baylies 22 Oct 04, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

xcfeng 39 Dec 14, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Mlcode - Continuous ML API Integrations

mlcode Basic APIs for ML applications. Django REST Application Contains REST API

Sujith S 1 Jan 01, 2022
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

♗CσNϙUҽRσR_MҽSƙEƚҽҽR 1 Oct 31, 2021
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

PEGASUS library Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised

Google Research 1.4k Dec 22, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022