当前位置:网站首页>NLP - monocleaner
NLP - monocleaner
2022-06-27 13:54:00 【Yizhi code】
List of articles
About monocleaner
monocleaner It is a tool for testing the fluency of monolingual sentences .
It is suggested that linux Upper use monocleaner, because monocleaner The dependency package of FastSpell stay Mac Installation failed on ( If you succeed , Welcome to tell me how to install ), So it's not recommended to Mac Upper use .
- Training tools available
monocleaner-train, You can also use language packs directly . - You can use
monocleaner-downloadTools to download the latest data , You can also visit https://github.com/bitextor/monocleaner-data/releases/latest download .
install
python3.7 -m pip install monocleaner
Dependencies
- Most of the dependencies will be in monocleaner During installation , Will automatically download ;
- KenLM, It needs to be installed in advance . May refer to : https://blog.csdn.net/lovechris00/article/details/125424808
- monocleaner Also depends on FastSpell, This library is in macOS Cannot install on , therefore monoclear Only in linux Upper use .
FastSpell : https://github.com/mbanon/fastspell
FastSpell Depend onpython-devandlibhunspell-dev( install :sudo apt install python-dev libhunspell-dev) - If you need to support similar languages such as similar Listed , Need to install
hunspell-es(sudo apt-get install hunspell-es), Or download external resources , such as :https://github.com/wooorm/dictionaries/tree/main/dictionaries
You can also give Hunspell Dictionary folder configuration path .- If you use pip install , Set in the
venv/lib/python3.7/site-packages/fastspell/config/hunspell.yaml - If you use
setup.pyinstall , Configure in/config/hunspell.yaml - If you run directly with code , The default address is :
/usr/share/hunspell.
- If you use pip install , Set in the
After successful installation , An executable will be generated monocleaner, monocleaner-train, monocleaner-download There are two files in python/installation/prefix/bin Next
such as :
stay Mac On , My papers are in /Library/Frameworks/Python.framework/Versions/3.7/bin/ Next
stay linux On , I use ananconda Medium python, So the executable file is in /home/newtranx/anaconda3/bin below
View version information and help
$ monocleaner -v
monocleaner Version 1.1.0 # 2021-03-07 # Add lang ident column # Jaume Zaragoza
$ monocleaner -h
usage: monocleaner [-h] [--scol SCOL] [--disable_lang_ident] [--disable_hardrules] [--disable_minimal_length] [--score_only]
[--add_lang_ident] [--annotated_output] [--debug] [-q] [-v]
model_dir [input] [output]
positional arguments:
model_dir Model directory to store LM file and metadata.
input Input file. If omitted, read from 'stdin'.
output Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
optional arguments:
-h, --help show this help message and exit
--scol SCOL Sentence column (starting in 1)
--disable_lang_ident Disables language identification in hardrules
--disable_hardrules Disables the hardrules filtering (only monocleaner fluency scoring is applied)
--disable_minimal_length
Don't apply minimal length (3 words) rule --score_only Only print the score for each sentence, omit all fields --add_lang_ident Add another column with the identified language if it's not disabled.
--annotated_output Add hardrules annotation for each sentence
--debug
-q, --quiet
-v, --version show version of this script and exit
Scoring Scoring
- monocleaner It is mainly used to test the fluency of monolingual sentences .
- The fluency of each sentence is rated at 0–1 Within the interval . The higher the score, the more fluent .
- Beyond the continuous rating , Some dead writing rules also rate sentences that are obviously problematic as 0.
- The input file must have one sentence per line .
- The number of lines in the output file is consistent with that in the input file , There will be an extra column of points .
The syntax format of the tool is as follows :
monocleaner [-h]
[--disable_minimal_length]
[--disable_hardrules]
[--score_only]
[--annotated_output]
[--add_lang_ident]
[--debug]
[-q]
model_dir [input] [output]
Parameter description
- Positional arguments:
model_dir: Folder for model storageinput: Enter the address of the file . If this item is omitted , Will read from the terminal interaction .output: The output file , Use tab As a separator .
- Optional parameters :
--score_only: Output scores only .( The default is False)--add_lang_ident: If effective , According to the given language , Add other columns .--disable_hardrules: ( Just in the fluency score ) Cancel hardrules.( The default is False)--disable_minimal_length: The minimum length rule does not apply .( The default is False)
- journal :
-q, --quiet: Silent log mode ( The default is False)--debug: Debug log mode ( The default is False)-v, --version: Display version information
Examples of use :
Input command +
$ monocleaner xx/monocleaner/models/en
2022-06-25 13:17:35,372 - WARNING - Downloading FastText model...
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:18:01,280 - INFO - Start scoring text
hello, this my name is
hello, this my name is 0.676
hello, this is my name
hello, this is my name 0.706
Only show ratings
$ monocleaner --score_only xx/monocleaner/models/en
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:23:13,298 - INFO - Start scoring text
hi, I wanna fly to the sky!
0.603
you're beautiful in white
0.800
Use monocleaner-download Download data
monocleaner-download It seems that I didn't check the version , Enter the command to see the instructions
$ monocleaner-download --version
Wrong number of arguments: --version
Script to download Bicleaner language packs.
Usage: monocleaner-download <lang> <download_path>
<lang> Language code.
<download_path> Path where downloaded language pack should be placed.
Then we can download the data as much as we like
$ monocleaner-download es xx/monocleaner/models/
PS: I don't see it at the moment zh data . have access to monocleaner-train Train one .
You can also go to https://github.com/bitextor/monocleaner-data/releases/latest download , Or check the existing language support .
monocleaner-train Training data
$ monocleaner-train -h
usage: monocleaner-train [-h] -l LANGUAGE [--dev_size DEV_SIZE]
[--lm_type {
PLACEHOLDER,CHARACTER}]
[--tokenizer_command TOKENIZER_COMMAND] [--debug]
[-q]
train model_dir
- positional arguments:
train: Training dataset file , Monolingual data in one line .model_dir: Model directory to store LM file and metadata. Model folder , Used to store LM Files and metadata .
- optional arguments:
-h,--help: show this help message and exit-l LANGUAGE, --language LANGUAGE: Language code of the model.--dev_size DEV_SIZE: Number of sentences used to estimate mean and stddev perplexity on noisy and clean text. Extracted from training the training corpus.--lm_type {PLACEHOLDER,CHARACTER}--tokenizer_command TOKENIZER_COMMAND: Tokenizer command to replace Moses tokenizer when using PLACEHOLDER LMType.--debug-q,--quiet
I didn't do any training here , So I will not explain the training results and problems encountered here . Have a chance to make it up .
Yizhi 2022-06-25( 6、 ... and )
边栏推荐
- Daily 3 questions (1): find the nearest point with the same X or Y coordinate
- Axi bus
- External memory
- NAACL 2022 | TAMT:通过下游任务无关掩码训练搜索可迁移的BERT子网络
- Bidding announcement: Oracle all-in-one machine software and hardware maintenance project of Shanghai R & D Public Service Platform Management Center
- Infiltration learning diary day20
- Deep understanding of bit operations
- Gaode map IP positioning 2.0 backup
- OpenSSF安全计划:SBOM将驱动软件供应链安全
- 图书管理系统
猜你喜欢
随机推荐
Prometheus 2.26.0 new features
How to solve the problem of missing language bar in win10 system
IJCAI 2022 | greatly improve the effect of zero sample learning method with one line of code. Nanjing Institute of Technology & Oxford proposed the plug and play classifier module
Daily question 3 (2): check binary string field
jvm 参数设置与分析
CMOS级电路分析
【业务安全-01】业务安全概述及测试流程
Summary of redis master-slave replication principle
Axi bus
【PHP代码注入】PHP语言常见可注入函数以及PHP代码注入漏洞的利用实例
Domestic database disorder
Firewall foundation Huawei H3C firewall web page login
Deep understanding of bit operations
SFINAE
Yyds dry goods inventory solution sword finger offer: cut rope (advanced version)
Step by step expansion of variable parameters in class templates
OpenSSF安全计划:SBOM将驱动软件供应链安全
MySQL index and its classification
SFINAE
JVM performance tuning and monitoring tools -- JPS, jstack, jmap, jhat, jstat, hprof








