当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext
Multi classification of unbalanced text using AWS sagemaker blazingtext
2020-11-06 01:22:00 【InfoQ】
background
Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .
BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .
This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .
Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
- Python + appium automatic operation wechat is enough
- Python3 e-learning case 4: writing web proxy
- 大数据应用的重要性体现在方方面面
- 中小微企业选择共享办公室怎么样?
- 100元扫货阿里云是怎样的体验?
- 6.4 viewresolver view parser (in-depth analysis of SSM and project practice)
- 至联云分享:IPFS/Filecoin值不值得投资?
- What is the difference between data scientists and machine learning engineers? - kdnuggets
- Classical dynamic programming: complete knapsack problem
- The choice of enterprise database is usually decided by the system architect - the newstack
猜你喜欢
Windows 10 tensorflow (2) regression analysis of principles, deep learning framework (gradient descent method to solve regression parameters)
怎么理解Python迭代器与生成器?
vue-codemirror基本用法:实现搜索功能、代码折叠功能、获取编辑器值及时验证
带你学习ES5中新增的方法
In order to save money, I learned PHP in one day!
It's so embarrassing, fans broke ten thousand, used for a year!
I think it is necessary to write a general idempotent component
中小微企业选择共享办公室怎么样?
加速「全民直播」洪流,如何攻克延时、卡顿、高并发难题?
Do not understand UML class diagram? Take a look at this edition of rural love class diagram, a learn!
随机推荐
基於MVC的RESTFul風格API實戰
Python + appium automatic operation wechat is enough
数据产品不就是报表吗?大错特错!这分类里有大学问
Using consult to realize service discovery: instance ID customization
Don't go! Here is a note: picture and text to explain AQS, let's have a look at the source code of AQS (long text)
Can't be asked again! Reentrantlock source code, drawing a look together!
前端都应懂的入门基础-github基础
H5 makes its own video player (JS Part 2)
hadoop 命令总结
Every day we say we need to do performance optimization. What are we optimizing?
Working principle of gradient descent algorithm in machine learning
全球疫情加速互联网企业转型,区块链会是解药吗?
一篇文章带你了解CSS3 背景知识
DevOps是什么
Summary of common algorithms of linked list
Let the front-end siege division develop independently from the back-end: Mock.js
Thoughts on interview of Ali CCO project team
如何玩转sortablejs-vuedraggable实现表单嵌套拖拽功能
Process analysis of Python authentication mechanism based on JWT
ES6学习笔记(五):轻松了解ES6的内置扩展对象