当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext
Multi classification of unbalanced text using AWS sagemaker blazingtext
2020-11-06 01:22:00 【InfoQ】
background
Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .
BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .
This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .
Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
- High availability cluster deployment of jumpserver: (6) deployment of SSH agent module Koko and implementation of system service management
- 全球疫情加速互联网企业转型,区块链会是解药吗?
- Don't go! Here is a note: picture and text to explain AQS, let's have a look at the source code of AQS (long text)
- 向北京集结!OpenI/O 2020启智开发者大会进入倒计时
- html
- 比特币一度突破14000美元,即将面临美国大选考验
- H5 makes its own video player (JS Part 2)
- Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思
- 人工智能学什么课程?它将替代人类工作?
- Want to do read-write separation, give you some small experience
猜你喜欢
随机推荐
Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
The choice of enterprise database is usually decided by the system architect - the newstack
This article will introduce you to jest unit test
采购供应商系统是什么?采购供应商管理平台解决方案
6.5 request to view name translator (in-depth analysis of SSM and project practice)
ES6学习笔记(五):轻松了解ES6的内置扩展对象
Analysis of react high order components
How long does it take you to work out an object-oriented programming interview question from Ali school?
Aprelu: cross border application, adaptive relu | IEEE tie 2020 for machine fault detection
Architecture article collection
PHP应用对接Justswap专用开发包【JustSwap.PHP】
Existence judgment in structured data
做外包真的很难,身为外包的我也无奈叹息。
Calculation script for time series data
怎么理解Python迭代器与生成器?
Thoughts on interview of Ali CCO project team
Polkadot series (2) -- detailed explanation of mixed consensus
IPFS/Filecoin合法性:保护个人隐私不被泄露
ES6学习笔记(四):教你轻松搞懂ES6的新增语法
中国提出的AI方法影响越来越大,天大等从大量文献中挖掘AI发展规律