utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags.

Related tags

Text Processingutoken
Overview

utoken

utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags. The tokenizer comes with a companion detokenizer. Initial public release of beta version 0.1.0 on Oct. 1, 2021.

Example

Input

Capt. O'Connor's car can't've cost $100,000.

Output

Capt. O'Connor 's car can n't 've cost $ 100,000 .

Optional annotation output

The ouput below is in the more human-friendly annotation format. Default format is the more computer-friendly JSON.

::line 1 ::s Capt. O'Connor's car can't've cost$100,000.
::span 0-5 ::type ABBREV ::sem-class military-rank ::surf Capt.
::span 6-14 ::type LEXICAL ::sem-class person-last-name ::surf O'Connor
::span 14-16 ::type DECONTRACTION ::surf 's
::span 17-20 ::type WORD-B ::surf car
::span 21-23 ::type DECONTRACTION ::surf can
::span 23-26 ::type DECONTRACTION ::surf n't
::span 26-29 ::type DECONTRACTION-R ::surf 've
::span 30-34 ::type WORD-B ::surf cost
::span 34-35 ::type PUNCT ::sem-class currency-unit ::surf $
::span 35-42 ::type NUMBER ::surf 100,000
::span 42-43 ::type PUNCT-E ::surf .

Usage   (click below for details)

utokenize (command line interface to tokenize a file)
python -m utoken.utokenize [-h] [-i INPUT-FILENAME] [-o OUTPUT-FILENAME] [-a ANNOTATION-FILENAME] 
                           [--annotation_format ANNOTATION_FORMAT] [-p PROFILE-FILENAME] 
                           [--profile_scope PROFILE_SCOPE] [-d DATA_DIRECTORY] [--lc LANGUAGE-CODE] 
                           [-f] [-v] [-c] [--simple] [--version]
  
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  -a ANNOTATION-FILENAME, --annotation_file ANNOTATION-FILENAME
                        (optional output)
  --annotation_format ANNOTATION_FORMAT
                        (default: 'json'; alternative: 'double-colon')
  -p PROFILE-FILENAME, --profile PROFILE-FILENAME
                        (optional output for performance analysis)
  --profile_scope PROFILE_SCOPE
                        (optional scope for performance analysis)
  -d DATA_DIRECTORY, --data_directory DATA_DIRECTORY
                        (default: standard data directory)
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  -f, --first_token_is_line_id
                        First token is line ID (and will be exempt from any tokenization)
  -v, --verbose         write change log etc. to STDERR
  -c, --chart           build annotation chart, even without annotation output
  --simple              prevent MT-style output (e.g. @-@). Note: can degrade any detokenization
  --version             show program's version number and exit

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides.

detokenize (command line interface to detokenize a file)
python -m utoken.detokenize [-h] [-i INPUT-FILENAME] [-o OUTPUT-FILENAME] [-d DATA_DIRECTORY] 
                            [--lc LANGUAGE-CODE] [-f] [-v] [--version]
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  -d DATA_DIRECTORY, --data_directory DATA_DIRECTORY
                        (default: standard data directory)
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  -f, --first_token_is_line_id
                        First token is line ID (and will be exempt from any tokenization)
  -v, --verbose         write change log etc. to STDERR
  --version             show program's version number and exit

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides.

utokenize_string (Python function call to tokenize a string)
from utoken import utokenize
  
tok = utokenize.Tokenizer(lang_code='eng')  # Initialize tokenizer, load resources
print(tok.utokenize_string("Dont worry!"))
print(tok.utokenize_string("Sold,for $9,999.99 on ebay.com."))

Output:

Do n't worry !
Sold , for $ 9,999.99 on ebay.com .

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides.

detokenize_string (Python function call to detokenize a string)
from utoken import detokenize

detok = detokenize.Detokenizer(lang_code='eng')  # Initialize detokenizer, load resources
print(detok.detokenize_string("Do n't worry !"))
print(detok.detokenize_string("Sold , for $ 9,999.99 on ebay.com ."))

Output:

Don't worry!
Sold, for $9,999.99 on ebay.com.

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides.

installation
pip install utoken

or

git clone https://github.com/uhermjakob/utoken.git

Design

  • A universal tokenizer/word segmenter, i.e. designed to work with a wide variety of scripts and languages.
  • Preserves special tokens such as URLs, XML tags, email addresses, hashtags, handles, filenames and more.
  • Modular, expandable architecture, with language-independent and language-specific rules and lists.
  • Written in Python, with both command line interface (to tokenize a file) and Python function call (to tokenize a string).
  • Maintains a chart data structure with detailed additional information that can also serve as a basis for further processing.
  • First public release on Oct. 1, 2021: beta version 0.1.0
  • Written by Ulf Hermjakob, USC Information Sciences Institute, 2021

Limitations

  • Currently excluded: no-space scripts like Chinese and Japanese
  • Large set of resource entries (data file) currently for English only; limited resource entries for 60+ other languages
  • Languages tested so far: Amharic, Arabic, Armenian, Assamese, Bengali, Bulgarian, Catalan, Czech, Dutch, English, Farsi, Finnish, French, Georgian, German, Greek (Ancient/Koine/Modern), Gujarati, Hebrew (Ancient/Modern), Hindi, Hungarian, Indonesian, Italian, Kannada, Kazakh, Korean, Lao, Lithuanian, Malayalam, Marathi, Norwegian, Odia, Pashto, Polish, Portuguese, Quechua, Romanian, Russian, Somali, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Turkish, Urdu, Uyghur, Vietnamese, Welsh, Xhosa, Yoruba, Zulu
    • For languages in bold: large-scale testing of thousands to hundreds of thousands of sentences per language.
    • For other modern languages: a few hundred sentences from 100 Wikipedia articles per language.
    • For Ancient Hebrew and Koine Greek: a few hundred verses each from the Bible's Old and New Testament respectively.
    • For Ancient Greek: a few hundred sentences from Homer's Odyssey and Plato's Republic.

Requirements

More topics   (click below for details)

What gets split and what not

What gets split

  • Contractions: John'sJohn 's; we'vewe 've; can'tcan n't; won'twill n't
  • Quantities into number and unit: 5,000km²5,000 km²
  • Ordinal numbers into number and ordinal particle: 350th350 th
  • Non-lexical hyphenated expressions: peace-lovingpeace @-@ loving
  • Name initials: J.S.BachJ. S. Bach

What stays together

Mark-up of certain punctuation (e.g. @-@) and option --simple

Mark-up of certain punctuation (e.g. @-@)

For many application such as machine translation, tokenization is important, but should be reversed when producing the final output. In some cases, this is relatively straight forward, so . and , typically attach to the word on the left and ( attaches to the word on the right. In other cases, it can generally be very hard to decide how to detokenize, so we add a special tag such as @ during tokenization in order to guide later dekonization. A @ on one or both sides of punctuation indicates that in the original text, the punctuation and neighboring word were together. To look at it in another way, the tokenizer basically upgrades the non-directional " to an open "@ or close @" delimiter.

Example: ("Hello,world!")   Tokenized: ( "@ Hello , world ! @" )   Detokenized: ("Hello, world!")

If later detokenization is not import and you want to suppress any markup with @, call utokenizer.py with the option --simple

Example: ("Hello,world!")   Tokenized (simple): ( " Hello , world ! " )   Detokenized: (" Hello, world! ")

Option --first_token_is_line_id

Option --first_token_is_line_id

In some applications, the text to be tokenized is preceded by a sentence ID at the beginning of each line and tokenization should not be applied to those sentence IDs.
Option --first_token_is_line_id, or -f for short, suppresses tokenization of those sentence IDs.

  • Example input: GEN:1:1 In the beginning, God created the heavens and the earth.
  • utokenize.pl tokenization: GEN @:@ 1 @:@ 1 In the beginning , God created the heavens and the earth .
  • utokenize.pl -f tokenization: GEN:1:1 In the beginning , God created the heavens and the earth .
Why is tokenization hard?

Why is tokenization hard?

Tokenization is more then just splitting a sentence along spaces, as a lot of punctuation such as commas and periods are attached to adjacent words. But we can't just blindly split off commas and periods, as this would break numbers such as 12,345.60, abbreviations such as Mr. or URLs such as www.usc.edu.

Tokenization data files

Tokenization data files

utokenize includes a number of data files to supports its operation:

  • tok-resource.txt includes language-independent tokenization resource entries, especially for punctuation, abbreviations (e.g. km²) and names (especially those with hyphens, spaces and other non-alpha characters)
  • tok-resource-eng-global.txt contains tokenization resource entries for English that are also loaded for other languages. This is helpful as foreign texts often code-switch to English.
  • tok-resource-eng.txt contains tokenization resource entries for English that are not shared, including those that would not work in other languages. For example, in English, dont in a non-standard version of don't and is tokenized into do n't, but in French, dont (of which) is a regular word that should be left alone.
  • detok-resource.txt includes resources for detokenization. The file is also used by the tokenizer to mark up certain punctuation with attachment tags such as @-@.
  • There are numerous other tok-resource-xxx.txt files for other languages, some larger than others. Some languages such as Farsi just don't use contractions and abbreviations with periods that much, so there are few entries. Others files might benefit from additional contributions.
  • top-level-domain-codes.txt contains a list of suffixes such as .com, .org, .uk, .tv to support tokenization of URLs and email address.

Exmaples of resource entries:

::punct-split ! ::side end ::group True ::comment multiple !!! remain grouped as a single token
::contraction can't ::target can n't ::lcode eng
::repair wo n't ::target will n't ::lcode eng ::problem previous tokenizer
::abbrev No. ::exp number ::lcode eng ::sem-class corpus-component ::case-sensitive True ::right-context \s*\d
::lexical T-shirt ::lcode eng ::plural +s
::misspelling accomodate ::target accommodate ::lcode eng ::suffix-variations e/ed;es;ing;ion;ions

::markup-attach - ::group True ::comment hyphen-minus ::example the hyphen in _peace-loving_ will be marked up as ```@-@```
::auto-attach th ::side left ::left-context \d ::lcode eng ::example 20th
Speed

Speed

210,000 characters per second (real time) on a 39k sentence English AMR corpus on a 2021 MacBook Pro using a single CPU. Parallelization is trivial as sentences are tokenized independent of each other.

Testing

Testing

utoken has been tested on 66 corpora in 55 languages and 18 scripts (as of Oct. 16, 2021). Tests include

  • Manual review of lots of tokenization
  • Comparison to other tokenizers: Sacremoses and ulf-tokenizer
  • Tokenization analysis scripts:
    • wildebeest (text normalization and cleaning; analysis of types of characters used, encoding issues)
    • aux/tok-analysis.py (looks for a number of potential problems such as tokens with mixed letters/digits, mixed letters/punctuation, potential abbreviations separated from period)
  • Comparisons to previous versions of all test corpora before release.
Related software

Related software

Future work — Feedback and contributions welcome

Future work — Feedback and contributions welcome

Plans include

  • Building resources, testing and fine-tuning of additional languages such as Hausa and Serbian.
  • Adding new special entity types such as IPA pronunciations, geographic coordinates, complex IDs such as 403(k).
  • Semi-supervised learning of lexical and abbreviation resources from large corpora.
You might also like...
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

 A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file
A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file

A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

A Python app which can convert normal text to Handwritten text.
A Python app which can convert normal text to Handwritten text.

Text to HandWritten Text ✍️ Converter Watch Tutorial for this project Usage:- Clone my repository. Open CMD in working directory. Run following comman

strbind - lapidary text converter for translate an text file to the C-style string

strbind strbind - lapidary text converter for translate an text file to the C-style string. My motivation is fast adding large text chunks to the C co

A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

TextStatistics - Get a text file wich contains English text

TextStatistics This program get a text file wich contains English text. The program analyses the text, and print some information. For this program I

Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

Comments
  • RecursionError: maximum recursion depth exceeded while calling a Python object

    RecursionError: maximum recursion depth exceeded while calling a Python object

    ....[truncated].....
     File "/python3.9/site-packages/utoken/utokenize.py", line 775, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 1531, in tokenize_lexical_according_to_resource_entries
        return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 775, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 1038, in tokenize_complex_names
        return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 775, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 1647, in tokenize_mt_punctuation
        return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 775, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 1580, in tokenize_punctuation_according_to_resource_entries
        return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 775, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 1657, in tokenize_post_punct
        return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 775, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/python3.9/site-packages/utoken/utokenize.py", line 1707, in tokenize_main
    RecursionError: maximum recursion depth exceeded while calling a Python object
    

    I ran utoken on a bunch of files, and some of them are very large, so I don't know for sure which exact data/lines made your code go into infinite recursion.

    @uhermjakob is it possible to debug from the stack trace without having the sample data? I will try to produce locate the lines which caused this problem (it's a needle in the hay!)

    opened by thammegowda 4
  • Preserving char-offsets

    Preserving char-offsets

    Hi @uhermjakob , thanks a lot for making the tokenizer public.

    We are using utoken in one of our projects where we have the requirement that each token is associated with the offset in the original text. Currently, we have it working in the following manner:

    from utoken import utokenize
    
    text = 'Hello world!' 
    
    tokenizer = utokenize.Tokenizer()
    chart = utokenize.Chart(s=text, snt_id='id-0')
    tokenizer.next_tok(None, text, chart, {}, 'eng', None)
    tokens, offsets = [], []
    for tok in chart.tokens:
        s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
        tokens.append(text[s:e])
        offsets.append((s, e))
    
    print(tokens, offsets)
    

    This works fine and we get the correct output:

    ['Hello', 'world', '!'] [(0, 5), (6, 11), (11, 12)]
    

    However, when we change the text to include repeated punctuations, we run into an error. To reproduce, I am just changing the text from Hello world! to ; 200 times:

    from utoken import utokenize
    
    text = ';' * 200  # this text causes the error. 
    
    tokenizer = utokenize.Tokenizer()
    chart = utokenize.Chart(s=text, snt_id='id-0')
    tokenizer.next_tok(None, text, chart, {}, 'eng', None)
    tokens, offsets = [], []
    for tok in chart.tokens:
        s, e = tok.span.spans[0].hard_from, tok.span.spans[0].hard_to
        tokens.append(text[s:e])
        offsets.append((s, e))
    
    print(tokens, offsets)
    

    The first and last few lines of the call stack are:

    Traceback (most recent call last):
      File "/Users/shantanu/PycharmProjects/isi-better/t-phrase/tests/test_ulf_token.py", line 7, in <module>
        tokenizer.next_tok(None, text, chart, {}, 'eng', None)
      File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 820, in next_tok
        s = next_tokenization_function(s, chart, ht, lang_code, line_id, offset)
      File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 962, in normalize_characters
        return self.next_tok(this_function, s, chart, ht, lang_code, line_id, offset)
    ...
    ...
    File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 734, in rec_tok
        tokenizations.append(calling_function(pre, chart, ht, lang_code, line_id, offset1))
      File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 1652, in tokenize_punctuation_according_to_resource_entries
        return self.rec_tok([token], [start_position], s, offset, 'PUNCT-E',
      File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 734, in rec_tok
        tokenizations.append(calling_function(pre, chart, ht, lang_code, line_id, offset1))
      File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 1652, in tokenize_punctuation_according_to_resource_entries
        return self.rec_tok([token], [start_position], s, offset, 'PUNCT-E',
      File "/Users/shantanu/anaconda3/envs/auto_tt/lib/python3.8/site-packages/utoken/utokenize.py", line 714, in rec_tok
        n_chars = len(self.current_orig_s)
    TypeError: object of type 'NoneType' has no len()
    

    Is our current usage to keep track of char-offsets incorrect because of which we are running into this issue? Is there a different way to tokenize and keep track of char-offsets within utoken?

    Thanks.

    opened by spookyQubit 2
  • Freezing  on `@@`

    Freezing on `@@`

    utoken freezes on @@ in input

    echo "hello there @-@  ?" | utokenize
    hello there @-@ ?
    
    echo "hello there @@  ?" | utokenize
    

    --[FROZEN]--

    bug 
    opened by thammegowda 1
  • Add progress bar

    Add progress bar

    progressbar is disabled by default, but it can be enabled by -pb or --progress-bar

    if input in STDIN, it doesn't know total bytes, but if input is file, then it correctly gets total

    opened by thammegowda 0
Releases(v0.1.8)
  • v0.1.8(Oct 20, 2021)

    Added protection against monster sentences that might cause to exceed Python's recursion depth limit (even without an actual infinite loop). An example monster sentence had 11,051 characters, 3,368 words, 511 symbols. The new "circuit breaker" solution monitors recursion depth, with uroman stopping tokenizing a sentence once a limit is reached. This affects only one sentence at a time. The sentence is output in its fullness, but might be tokenized only partially. There are sub circuit breakers for (1) symbol and (2) number tokenization. An alert (to STDERR) is a preliminary warning (without action), a warning (to STDERR) indicates actual cessation of full tokenization. The alerts serve to collect monster sentences that can later be used in developing sentence chunking to overcome any tokenization recursion problems in the first place.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.7(Oct 19, 2021)

    • Fixed regex bug by adding missing re.escape for cases such as ಡಾ|| (pre-name-title with vertical bars instead of regular double danda).
    • Added four more languages: Estonian, Latvian, Slovenian, Danish
    Source code(tar.gz)
    Source code(zip)
  • v0.1.6(Oct 18, 2021)

    Tested and improved for 10 more languages: Marathi, Polish, Urdu, Romanian, Hungarian, Finnish, Catalan, Gujarati, Armenian, Slovak Updated Detokenizer.__init__ arg lang_code_s to lang_code Incremental improvements

    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Oct 13, 2021)

    • Tested and improved for 10 more languages: Czech, Italian, Quechua, Telugu, Vietnamese, Welsh, Xhosa, Yoruba, Odia, Norwegian
    • Several incremental improvements.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Oct 5, 2021)

  • v0.1.1(Oct 4, 2021)

    • Now available: pip install utoken
    • Added: Indonesian, Tamil
    • Added ::lcode-notfunctionality to tok-resource files (to exclude one or more languages from a specific rule)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Oct 1, 2021)

Owner
Ulf Hermjakob
Ulf Hermjakob
CowExcept - Spice up those exceptions with cowexcept!

CowExcept - Spice up those exceptions with cowexcept!

James Ansley 41 Jun 30, 2022
汉字转拼音(pypinyin)

汉字拼音转换工具(Python 版) 将汉字转为拼音。可以用于汉字注音、排序、检索(Russian translation) 。 基于 hotoo/pinyin 开发。 Documentation: http://pypinyin.rtfd.io/ GitHub: https://github.co

Huang Huang 4.2k Jan 03, 2023
Deasciify-highlighted - A Python script for deasciifying text to Turkish and copying clipboard

deasciify-highlighted is a Python script for deasciifying text to Turkish and copying clipboard.

Ümit Altıntaş 3 Mar 18, 2022
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

Brandon 5.6k Jan 03, 2023
基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端

Pytex-for-MCM 基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端。

3 May 17, 2021
Migrates translations to the REDCap native Multi-Language Management system

Automates much of the process of moving translations from the old Multilingual external module to the newer built-in Multi-Language Management (MLM) page.

UCI MIND 3 Sep 27, 2022
Correcting typos in a word based on the frequency dictionary

Auto-correct text Correcting typos in a word based on the frequency dictionary. This algorithm is based on the distance between words according to the

Anton Yakovlev 2 Feb 05, 2022
Phone Number formatting for PlaySMS Platform - BulkSMS Platform

BulkSMS-Number-Formatting Phone Number formatting for PlaySMS Platform - BulkSMS Platform. Phone Number Formatting for PlaySMS Phonebook Service This

Edwin Senunyeme 1 Nov 08, 2021
Umamusume story patcher with python

umamusume-story-patcher How to use Go to your umamusume folder, usually C:\Users\user\AppData\LocalLow\Cygames\umamusume Make a mods folder and clon

8 May 07, 2022
从flomo导出的笔记中生成词云

flomo-word-cloud 从flomo导出的笔记中生成词云 如何使用? 将本项目克隆到你的电脑上,使用如下的命令,安装所需python库 pip install -r requirements.txt 在项目里新建一个file文件夹,把所有从flomo导出的html文件放入其中 运行main

Hannnk 9 Dec 30, 2022
Bidirectionally transformed strings

bistring The bistring library provides non-destructive versions of common string processing operations like normalization, case folding, and find/repl

Microsoft 352 Dec 19, 2022
An extension to detect if the articles content match its title.

Clickbait Detector An extension to detect if the articles content match its title. This was developed in a period of 24-hours in a hackathon called 'H

Arvind Krishna 5 Jul 26, 2022
Map Reduce Wordcount in Python using gRPC

This project is implemented in Python using gRPC. The input files are given in .txt format and the word count operation is performed.

Divija 4 Dec 05, 2022
PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different languages

PyMultiDictionary PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different

Pablo Pizarro R. 19 Dec 26, 2022
Widevine KEY Extractor in Python

Widevine Client 3 This was originally written by T3rry7f. This repo is slightly modified version of his repo. This only works on standard Windows! Usa

Vank0n (SJJeon) 68 Dec 29, 2022
This repos is auto action which generating a wordcloud made by Twitter.

auto_tweet_wordcloud This repos is auto action which generating a wordcloud made by Twitter. Preconditions Install Python dependencies pip install -r

tubone(Yu Otsubo) 0 Apr 29, 2022
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

Antti Haapala 1.2k Dec 16, 2022
StealBit1.1 and earlier strings and config extraction scripts

StealBit1.1 and earlier scripts Use strings_decryptor.py to extract RC4 encrypted strings from a StealBit1.1 sample(s). Use config_extractor.py to ext

Soolidsnake 5 Dec 29, 2022
Converts a Bangla numeric string to literal words.

Bangla Number in Words Converts a Bangla numeric string to literal words. Install $ pip install banglanum2words Usage

Syed Mostofa Monsur 3 Aug 29, 2022