Get list of common stop words in various languages in Python

Overview

Python Stop Words

Overview

Get list of common stop words in various languages in Python.

Build Status Coverage Status PyPI Version PyPI Status License PyPI Py_versions

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Indonesian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

  • Python 2.7
  • Python 3.4
  • Python 3.5
  • Python 3.6
  • Python 3.7
Comments
  • Enforces packaging of eggs into folders.

    Enforces packaging of eggs into folders.

    We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

    This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

    Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

    opened by hfjn 10
  • add indonesian stop word list

    add indonesian stop word list

    Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

    opened by frankdevans 4
  • can you handle a text?

    can you handle a text?

    hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

    thank you very much!!

    question 
    opened by PapaMadeleine2022 2
  • Python 3 support

    Python 3 support

    List of improvements:

    • Tests
    • Python 3 support
    • Dev installation via zc.buildout
    • Continuous integration via Travis

    Can you make a new release once the branch merged ?

    Regards

    enhancement 
    opened by Fantomas42 2
  • languages.json is missing, if you don't git clone with `--recursive`

    languages.json is missing, if you don't git clone with `--recursive`

    languages.json is still missing, if you don't clone with --recursive

    $ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

    opened by marcindulak 1
  • Update submodule to the latest

    Update submodule to the latest

    Include the stops for newly added languages

    https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7

    enhancement 
    opened by norkans7 1
  • Decode error AND Add catalan language to LANGUAGE_MAPPING

    Decode error AND Add catalan language to LANGUAGE_MAPPING

    1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

    2. Decode error

    stop_words = [line.strip().decode('utf-8')
                 for line in language_file.readlines()]
    

    Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

    The workaround is to reorder the call:

    stop_words = [line.decode('utf-8').strip()
                 for line in language_file.readlines()]
    
    opened by dmiro 1
  • Defining custom stop words in NLTK

    Defining custom stop words in NLTK

    Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

    Hope you can help me thanks.

    opened by AllikDaniel 0
  • Example not work on python 3.7.0

    Example not work on python 3.7.0

    It return empty []

    from stop_words import get_stop_words
    
    stop_words = get_stop_words('en')
    stop_words = get_stop_words('english')
    
    from stop_words import safe_get_stop_words
    
    stop_words = safe_get_stop_words('unsupported language')
    print(stop_words)
    
    opened by nadavvin 2
Releases(2018.7.23)
  • 2018.7.23(Jul 23, 2018)

    2018.7.23

    • Fixed #14: languages.json is missing, if you don't git clone with --recursive.
    • Feature: Support latest version of Python (3.7+).
    • Feature #22: Enforces packaging of eggs into folders.
    • Update the stop-words repository to get the latest languages.
    • Fixed Travis failing and tests due to bootstrap.

    PyPI: https://pypi.org/project/stop-words/2018.7.23/

    To install:

    $ pip install stop-words==2018.7.23
    
    Source code(tar.gz)
    Source code(zip)
  • 2015.2.23.1(Feb 23, 2015)

  • 2015.2.23(Feb 23, 2015)

    2015.2.23


    • Feature: Using the cache is optional
    • Feature: Filtering stopwords

    Special thanks to Taras Labiak @kissarat

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Source code(tar.gz)
    Source code(zip)
  • 2015.2.21(Feb 21, 2015)

    2015.2.21


    • Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json
    • Fix: Made paths OS-independent

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Special thanks to Taras Labiak @kissarat

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.31(Feb 1, 2015)

  • 2015.1.22(Jan 22, 2015)

    2015.1.22


    • Feature: Tests
    • Feature: Python 3 support
    • Feature: Dev installation via zc.buildout
    • Feature: Continuous integration via Travis

    pypi: https://pypi.python.org/pypi/stop-words/2015.1.22

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.19(Jan 19, 2015)

Owner
Alireza Savand
I am Alireza Savand, a Software Architect.
Alireza Savand
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

44 Jan 06, 2023
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

SafeGraph 251 Jan 01, 2023
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 04, 2023
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022