当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext
Multi classification of unbalanced text using AWS sagemaker blazingtext
2020-11-06 01:22:00 【InfoQ】
background
Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .
BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .
This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .
Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
- 一篇文章带你了解CSS3 背景知识
- Subordination judgment in structured data
- Process analysis of Python authentication mechanism based on JWT
- Nodejs crawler captures ancient books and records, a total of 16000 pages, experience summary and project sharing
- JVM memory area and garbage collection
- 至联云分享:IPFS/Filecoin值不值得投资?
- EOS创始人BM: UE,UBI,URI有什么区别?
- 熬夜总结了报表自动化、数据可视化和挖掘的要点,和你想的不一样
- ES6学习笔记(五):轻松了解ES6的内置扩展对象
- 比特币一度突破14000美元,即将面临美国大选考验
猜你喜欢
![[C / C + + 1] clion configuration and running C language](/img/5b/ba96ff4447b150f50560e5d47cb8d1.jpg)
[C / C + + 1] clion configuration and running C language

大数据应用的重要性体现在方方面面

Examples of unconventional aggregation

I'm afraid that the spread sequence calculation of arbitrage strategy is not as simple as you think

Brief introduction of TF flags

助力金融科技创新发展,ATFX走在行业最前列

Mongodb (from 0 to 1), 11 days mongodb primary to intermediate advanced secret

快快使用ModelArts,零基础小白也能玩转AI!

IPFS/Filecoin合法性:保护个人隐私不被泄露

Calculation script for time series data
随机推荐
I've been rejected by the product manager. Why don't you know
Subordination judgment in structured data
Flink的DataSource三部曲之二:内置connector
Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思
Didi elasticsearch cluster cross version upgrade and platform reconfiguration
一篇文章带你了解CSS3图片边框
How long does it take you to work out an object-oriented programming interview question from Ali school?
How do the general bottom buried points do?
How to become a data scientist? - kdnuggets
一篇文章带你了解CSS对齐方式
Mongodb (from 0 to 1), 11 days mongodb primary to intermediate advanced secret
Not long after graduation, he earned 20000 yuan from private work!
快快使用ModelArts,零基礎小白也能玩轉AI!
小程序入门到精通(二):了解小程序开发4个重要文件
The difference between Es5 class and ES6 class
After reading this article, I understand a lot of webpack scaffolding
IPFS/Filecoin合法性:保护个人隐私不被泄露
[C / C + + 1] clion configuration and running C language
Python Jieba segmentation (stuttering segmentation), extracting words, loading words, modifying word frequency, defining thesaurus
html