当前位置:网站首页>NLP - monocleaner
NLP - monocleaner
2022-06-27 13:43:00 【伊织code】
关于 monocleaner
monocleaner 是用于检测单语句子的流畅度的工具。
建议在 linux 上使用monocleaner,由于monocleaner 的依赖包 FastSpell 在 Mac上安装失败(如果你成功了,欢迎告知我安装方式),所以不建议在 Mac 上使用。
- 提供了训练工具
monocleaner-train, 同时你也可以直接使用语言包。 - 你可以使用
monocleaner-download工具下载最新的数据,也可以访问 https://github.com/bitextor/monocleaner-data/releases/latest 下载。
安装
python3.7 -m pip install monocleaner
依赖项
- 大部分依赖项会在 monocleaner 安装的时候,会自动下载;
- KenLM,需要提前安装。可参考: https://blog.csdn.net/lovechris00/article/details/125424808
- monocleaner 也依赖于 FastSpell, 这个库在 macOS 上没法安装,所以 monoclear 只能在linux 上使用。
FastSpell : https://github.com/mbanon/fastspell
FastSpell 依赖于python-dev和libhunspell-dev(安装:sudo apt install python-dev libhunspell-dev) - 如果你需要支持相似的语言如 similar 所列出,需要安装
hunspell-es(sudo apt-get install hunspell-es), 或者下载外部资源,比如:https://github.com/wooorm/dictionaries/tree/main/dictionaries
你也可以给 Hunspell 字典文件夹配置路径。- 如果你使用 pip安装,设置在
venv/lib/python3.7/site-packages/fastspell/config/hunspell.yaml - 如果你使用
setup.py安装,配置在/config/hunspell.yaml - 如果你直接使用代码运行,默认地址为:
/usr/share/hunspell。
- 如果你使用 pip安装,设置在
安装成功后,会生成可执行文件monocleaner, monocleaner-train, monocleaner-download 个文件在 python/installation/prefix/bin 下
比如:
在Mac上,我的文件在 /Library/Frameworks/Python.framework/Versions/3.7/bin/ 下
在 linux 上,我使用 ananconda 中的python,所以可执行文件在 /home/newtranx/anaconda3/bin 下方
查看版本信息和帮助
$ monocleaner -v
monocleaner Version 1.1.0 # 2021-03-07 # Add lang ident column # Jaume Zaragoza
$ monocleaner -h
usage: monocleaner [-h] [--scol SCOL] [--disable_lang_ident] [--disable_hardrules] [--disable_minimal_length] [--score_only]
[--add_lang_ident] [--annotated_output] [--debug] [-q] [-v]
model_dir [input] [output]
positional arguments:
model_dir Model directory to store LM file and metadata.
input Input file. If omitted, read from 'stdin'.
output Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
optional arguments:
-h, --help show this help message and exit
--scol SCOL Sentence column (starting in 1)
--disable_lang_ident Disables language identification in hardrules
--disable_hardrules Disables the hardrules filtering (only monocleaner fluency scoring is applied)
--disable_minimal_length
Don't apply minimal length (3 words) rule --score_only Only print the score for each sentence, omit all fields --add_lang_ident Add another column with the identified language if it's not disabled.
--annotated_output Add hardrules annotation for each sentence
--debug
-q, --quiet
-v, --version show version of this script and exit
打分 Scoring
- monocleaner 主要用于检测单语句子的流畅度。
- 每个句子的流畅度评分在 0–1 区间之内。分数越高越流畅。
- 在连续的评分之外,一些写死的规则也会将明显有问题的句子评分为0。
- 输入文件必须每行一个句子。
- 输出文件的行数和输入文件行数一致,会多一列分数值。
工具的运行语法格式如下:
monocleaner [-h]
[--disable_minimal_length]
[--disable_hardrules]
[--score_only]
[--annotated_output]
[--add_lang_ident]
[--debug]
[-q]
model_dir [input] [output]
参数说明
- Positional arguments:
model_dir: 模型存储的文件夹input: 输入文件的地址。如果省略此项,将从终端交互中读取。output: 输出文件,使用 tab 作为分隔符。
- 可选参数:
--score_only: 只输出分数。(默认为 False)--add_lang_ident: 如果有效,根据给定的语言,添加其他列。--disable_hardrules: (只是在流畅度评分中)取消 hardrules。(默认为 False)--disable_minimal_length: 不适用最小长度规则。(默认为 False)
- 日志:
-q, --quiet: 静默日志模式 (默认为 False)--debug: 调试日志模式 (默认为 False)-v, --version: 显示版本信息
使用示例:
输入 command +
$ monocleaner xx/monocleaner/models/en
2022-06-25 13:17:35,372 - WARNING - Downloading FastText model...
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:18:01,280 - INFO - Start scoring text
hello, this my name is
hello, this my name is 0.676
hello, this is my name
hello, this is my name 0.706
只显示评分
$ monocleaner --score_only xx/monocleaner/models/en
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:23:13,298 - INFO - Start scoring text
hi, I wanna fly to the sky!
0.603
you're beautiful in white
0.800
使用 monocleaner-download 下载数据
monocleaner-download 好像没有查看版本一说,输入命令可以看到使用说明
$ monocleaner-download --version
Wrong number of arguments: --version
Script to download Bicleaner language packs.
Usage: monocleaner-download <lang> <download_path>
<lang> Language code.
<download_path> Path where downloaded language pack should be placed.
那么我们可以尽情下载数据了
$ monocleaner-download es xx/monocleaner/models/
PS: 目前没有看到 zh 数据。可以使用 monocleaner-train 训练一个。
你也可以前往 https://github.com/bitextor/monocleaner-data/releases/latest 下载,或查看已有的语言支持。
monocleaner-train 训练数据
$ monocleaner-train -h
usage: monocleaner-train [-h] -l LANGUAGE [--dev_size DEV_SIZE]
[--lm_type {
PLACEHOLDER,CHARACTER}]
[--tokenizer_command TOKENIZER_COMMAND] [--debug]
[-q]
train model_dir
- positional arguments:
train: 训练数据集文件,一行一句单语数据。model_dir: Model directory to store LM file and metadata. 模型文件夹,用于存储 LM 文件和元数据。
- optional arguments:
-h,--help: show this help message and exit-l LANGUAGE, --language LANGUAGE: Language code of the model.--dev_size DEV_SIZE: Number of sentences used to estimate mean and stddev perplexity on noisy and clean text. Extracted from training the training corpus.--lm_type {PLACEHOLDER,CHARACTER}--tokenizer_command TOKENIZER_COMMAND: Tokenizer command to replace Moses tokenizer when using PLACEHOLDER LMType.--debug-q,--quiet
这里我没有做训练,所以不在这里说明训练结果和遇到的问题之类的。有机会再补上。
伊织 2022-06-25(六)
边栏推荐
- Prometheus 2.26.0 新特性
- How to choose LAN instant messaging software
- my.ini文件配置
- Differences in perspectives of thinking
- Cool in summer
- Yuweng information, a well-known information security manufacturer, joined the dragon lizard community to build an open source ecosystem
- 【业务安全-02】业务数据安全测试及商品订购数量篡改实例
- Learning records of numpy Library
- 一次性彻底解决 Web 工程中文乱码问题
- 万物互联时代到来,锐捷发布场景化无线零漫游方案
猜你喜欢

【PHP代码注入】PHP语言常见可注入函数以及PHP代码注入漏洞的利用实例

Cesium realizes satellite orbit detour

Half find (half find)

Cool in summer

Hardware development notes (VII): basic process of hardware development, making a USB to RS232 module (VI): creating 0603 package and associating principle graphic devices

Summary and Thinking on interface test automation

NAACL 2022 | TAMT:通过下游任务无关掩码训练搜索可迁移的BERT子网络

全球芯片市场或陷入停滞,中国芯片逆势扩张加速提升自给率

关于接口测试自动化的总结与思考

Does Xinhua San still have to rely on ICT to realize its 100 billion enterprise dream?
随机推荐
High efficiency exponentiation
To understand again is the person in the song
Privacy computing fat offline prediction
【OS命令注入】常见OS命令执行函数以及OS命令注入利用实例以及靶场实验—基于DVWA靶场
The second part of the travel notes of C (Part II) structural thinking: Zen is stable; all four advocate structure
Can flush open an account for stock trading? Is it safe?
Quick news: Huawei launched the Hongmeng developer competition; Tencent conference released the "Wanshi Ruyi" plan
Type 'image' is not a subtype of type 'imageprovider < object > solution
清华&商汤&上海AI&CUHK提出Siamese Image Modeling,兼具linear probing和密集预测性能!...
A statistical problem of shell script
命令行编辑器 sed 基础用法总结
微服务如何拆分
Using FRP tool to realize intranet penetration
jvm 性能调优、监控工具 -- jps、jstack、jmap、jhat、jstat、hprof
Kotlin函数使用示例教程
7 killer JS lines of code
Implementation of recruitment website based on SSM
A method to realize automatic renaming of pictures uploaded by WordPress
How to use 200 lines of code to implement Scala's Object Converter
Half find (half find)