当前位置:网站首页>Multi classification of unbalanced text using AWS sagemaker blazingtext
Multi classification of unbalanced text using AWS sagemaker blazingtext
2020-11-06 01:22:00 【InfoQ】
background
Text classification (Text Classification) It belongs to the field of natural language processing , It refers to the process that the computer maps a text containing information to a given category or several categories of topics in advance . But in reality , We often encounter imbalances in the categories of data samples (class imbalance) The phenomenon , It seriously affects the final result of text classification . The so-called sample imbalance refers to a given data set, some categories of data more , Some data categories are few , And the data category samples with more data proportion and data category samples with small proportion reach a large proportion .
BlazingText yes AWS SageMaker A built-in algorithm for , Provides Word2vec And text classification algorithm highly optimized implementation . This article uses Sagemaker BlazingText It realizes the text multi classification . On the problem of sample imbalance , Back translation and EDA Two methods are used to over sample a small number of samples , The back translation method calls AWS Translate The service was translated and retranslated , and EDA Methods mainly use synonyms to replace 、 Insert randomly 、 Random exchange 、 Random deletion deals with text data . This article also uses AWS SageMaker Automatic parametric optimization for BlazingText The text classification algorithm based on the algorithm finds the optimal hyperparameter .
This article is based on DBpedia The public dataset generated by processing contains 14 Unbalanced text data of categories , And did not do any sample imbalance processing Baseline Experiment and include back translation and EDA Oversampling experiments of two methods .
Link to the original text :【https://www.infoq.cn/article/xbSAYuJcQrm048GHl5dJ】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
- Serilog原始碼解析——使用方法
- Analysis of react high order components
- Don't go! Here is a note: picture and text to explain AQS, let's have a look at the source code of AQS (long text)
- 速看!互联网、电商离线大数据分析最佳实践!(附网盘链接)
- Filecoin的经济模型与未来价值是如何支撑FIL币价格破千的
- Python download module to accelerate the implementation of recording
- Skywalking series blog 5-apm-customize-enhance-plugin
- JVM memory area and garbage collection
- 比特币一度突破14000美元,即将面临美国大选考验
- 快快使用ModelArts,零基础小白也能玩转AI!
猜你喜欢

一篇文章带你了解CSS 渐变知识

ES6学习笔记(五):轻松了解ES6的内置扩展对象

Filecoin最新动态 完成重大升级 已实现四大项目进展!

中国提出的AI方法影响越来越大,天大等从大量文献中挖掘AI发展规律

Thoughts on interview of Ali CCO project team

一篇文章带你了解SVG 渐变知识

使用 Iceberg on Kubernetes 打造新一代云原生数据湖

Summary of common string algorithms

Windows 10 tensorflow (2) regression analysis of principles, deep learning framework (gradient descent method to solve regression parameters)

采购供应商系统是什么?采购供应商管理平台解决方案
随机推荐
Natural language processing - BM25 commonly used in search
EOS创始人BM: UE,UBI,URI有什么区别?
Word segmentation, naming subject recognition, part of speech and grammatical analysis in natural language processing
前端都应懂的入门基础-github基础
6.1.2 handlermapping mapping processor (2) (in-depth analysis of SSM and project practice)
I'm afraid that the spread sequence calculation of arbitrage strategy is not as simple as you think
基於MVC的RESTFul風格API實戰
Brief introduction of TF flags
Let the front-end siege division develop independently from the back-end: Mock.js
Python + appium automatic operation wechat is enough
ES6学习笔记(四):教你轻松搞懂ES6的新增语法
html
Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思
加速「全民直播」洪流,如何攻克延时、卡顿、高并发难题?
Group count - word length
Leetcode's ransom letter
多机器人行情共享解决方案
Using Es5 to realize the class of ES6
hadoop 命令总结
Tool class under JUC package, its name is locksupport! Did you make it?