当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext
Multi classification of unbalanced text using AWS sagemaker blazingtext
2020-11-06 01:22:00 【InfoQ】
Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .
BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .
This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .
Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .
- Python + appium automatic operation wechat is enough
- Python3 e-learning case 4: writing web proxy
- 大数据应用的重要性体现在方方面面
- 中小微企业选择共享办公室怎么样?
- 100元扫货阿里云是怎样的体验?
- 6.4 viewresolver view parser (in-depth analysis of SSM and project practice)
- 至联云分享:IPFS/Filecoin值不值得投资?
- What is the difference between data scientists and machine learning engineers? - kdnuggets
- Classical dynamic programming: complete knapsack problem
- The choice of enterprise database is usually decided by the system architect - the newstack
Windows 10 tensorflow (2) regression analysis of principles, deep learning framework (gradient descent method to solve regression parameters)
In order to save money, I learned PHP in one day!
It's so embarrassing, fans broke ten thousand, used for a year!
I think it is necessary to write a general idempotent component
Do not understand UML class diagram? Take a look at this edition of rural love class diagram, a learn!
Python + appium automatic operation wechat is enough
Using consult to realize service discovery: instance ID customization
Don't go! Here is a note: picture and text to explain AQS, let's have a look at the source code of AQS (long text)
Can't be asked again! Reentrantlock source code, drawing a look together!
H5 makes its own video player (JS Part 2)
hadoop 命令总结
Every day we say we need to do performance optimization. What are we optimizing?
Working principle of gradient descent algorithm in machine learning
一篇文章带你了解CSS3 背景知识
Summary of common algorithms of linked list
Let the front-end siege division develop independently from the back-end: Mock.js
Thoughts on interview of Ali CCO project team
Process analysis of Python authentication mechanism based on JWT