当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext
Multi classification of unbalanced text using AWS sagemaker blazingtext
2020-11-06 01:22:00 【InfoQ】
background
Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .
BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .
This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .
Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
- [event center azure event hub] interpretation of error information found in event hub logs
- ES6 essence:
- Python crawler actual combat details: crawling home of pictures
- The practice of the architecture of Internet public opinion system
- Filecoin的经济模型与未来价值是如何支撑FIL币价格破千的
- 阿里云Q2营收破纪录背后,云的打开方式正在重塑
- Python3 e-learning case 4: writing web proxy
- What problems can clean architecture solve? - jbogard
- Just now, I popularized two unique skills of login to Xuemei
- 大数据应用的重要性体现在方方面面
猜你喜欢
教你轻松搞懂vue-codemirror的基本用法:主要实现代码编辑、验证提示、代码格式化
Face to face Manual Chapter 16: explanation and implementation of fair lock of code peasant association lock and reentrantlock
Python Jieba segmentation (stuttering segmentation), extracting words, loading words, modifying word frequency, defining thesaurus
Didi elasticsearch cluster cross version upgrade and platform reconfiguration
速看!互联网、电商离线大数据分析最佳实践!(附网盘链接)
前端基础牢记的一些操作-Github仓库管理
axios学习笔记(二):轻松弄懂XHR的使用及如何封装简易axios
Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
EOS创始人BM: UE,UBI,URI有什么区别?
Tool class under JUC package, its name is locksupport! Did you make it?
随机推荐
High availability cluster deployment of jumpserver: (6) deployment of SSH agent module Koko and implementation of system service management
IPFS/Filecoin合法性:保护个人隐私不被泄露
Real time data synchronization scheme based on Flink SQL CDC
Swagger 3.0 天天刷屏,真的香嗎?
The choice of enterprise database is usually decided by the system architect - the newstack
一篇文章带你了解SVG 渐变知识
比特币一度突破14000美元,即将面临美国大选考验
华为云“四个可靠”的方法论
I'm afraid that the spread sequence calculation of arbitrage strategy is not as simple as you think
前端基础牢记的一些操作-Github仓库管理
2019年的一个小目标,成为csdn的博客专家,纪念一下
html
Troubleshooting and summary of JVM Metaspace memory overflow
容联完成1.25亿美元F轮融资
做外包真的很难,身为外包的我也无奈叹息。
hadoop 命令总结
htmlcss
Natural language processing - wrong word recognition (based on Python) kenlm, pycorrector
ES6 essence:
Summary of common algorithms of binary tree