当前位置:网站首页>15 years experience in software architect summary: in the field of ML, tread beginners, five hole
15 years experience in software architect summary: in the field of ML, tread beginners, five hole
2022-08-03 20:41:00 【software testnet】
近日,软件架构师、数据科学家、Kaggle 大师 Agnis Liukis 撰写了一篇文章,In it, he talks about solutions to some of the most common beginner mistakes in machine learning,to make sure beginners understand and avoid them.
Agnis Liukis 拥有超过 15 Years of software architecture and development experience,He is proficient Java、JavaScript、Spring Boot、React.JS 和 Python 等语言.此外,Liukis Also interested in data science and machine learning,He participated many times Kaggle race and get good results,已达到 Kaggle Tournament master level.
以下为文章内容:
在机器学习领域,这 5 个坑,你踩过吗?
1、Data normalization was not used where required(data normalization)
对数据进行归一化操作,Then get the features,并将其输入到模型中,让模型做出预测,This method is easy.但在某些情况下,The results of this simple approach can be disappointing,Because it's missing a very important part.
Some types of models require data normalization,如线性回归、Classical Neural Networks, etc.This type of model uses the feature values to multiply the weights of the training values.In the case of non-normalized features,The possible range of one eigenvalue may be different from the possible range of another eigenvalue.
Suppose the value of a feature is [0,0.001] 范围内,The value of another feature is in [100000,200000] 范围内.For models that make two features equally important,The weight of the first feature will be greater than the weight of the second feature 1 亿倍.Huge weights can cause serious problems for the model,For example, when there are some outliers.此外,Estimating the importance of various features becomes difficult,Because heavy weights can mean that features are important,But it could also just mean that its eigenvalues are small.
归一化后,All features have values in the same range,通常为 [0,1] 或 [-1,1].在这种情况下,The weights will be in a similar range,and correspond closely to the actual importance of each feature.
总的来说,Using data normalization where needed will yield better results、更准确的预测.
2、认为特征越多越好
One might think it's a good idea to include all features,Think that the model will automatically select and use the best features.实际上,Such an idea is hard to come true.
The more features the model has,过拟合的风险越大.Even in completely random data,The model is also able to find some features(信号),Although sometimes weaker,sometimes stronger.当然,There is no real signal in random noise.But if we have enough noisy columns,It is then possible for the model to use a part of it based on the detected fault signal.当这种情况发生时,The quality of model predictions will decrease,Because they are based in part on random noise.
There are many techniques that help us in feature selection.但你要记住,You need to explain every feature you have,and why this feature will help your model.
3. when extrapolation is required,使用基于树的模型
Tree-based models are easy to use,功能强大,This is also the reason for its popularity.然而,在某些情况下,Using a tree-based model might be wrong.
Tree-based models cannot be extrapolated,The predictions of these models will never be greater than the maximum value in the training data,And never output a prediction smaller than the minimum value during training.
在某些任务中,The ability to extrapolate can be very important.例如,If the model predicts stock prices,Then stock prices could be higher in the future than ever.在这种情况下,Tree-based models will not work out of the box,Because their forecast will almost exceed the highest historical price.
这个问题有多种解决方案,One solution is to predict changes or discrepancies,rather than directly predicting value.Another solution is to use a different type of model for such tasks.Linear regression or neural networks can be used for extrapolation.
4、Use data normalization where it is not needed
The previous article talked about the need for data normalization,但情况并非总是如此,Tree-based models do not require data normalization.Neural networks may also not require explicit normalization,Because some networks already contain a normalization layer inside,例如 Keras 库的 BatchNormalization 操作.
在某些情况下,Even linear regression may not require data normalization,This means that all features are already in a similar range of values,并且具有相同的含义.例如,If the model is suitable for time series data,And all features are historical values of the same parameter.
5. 在训练集和验证集 / Information leaks between test sets
Creating a data leak is easier than one might think,考虑以下代码段:
Example features of data leakage
实际上,these two characteristics(sum_feature 和 diff_feature)都不正确.They are leaking information,Because it is split into the training set / 测试集后,The part with training data will contain some information from testing.This will result in higher validation scores,But when applied to the actual data model,性能会更差.
The correct way is to start with the training set / The test set is separated,Only then is the feature generation function applied.通常,Treating the training and test sets separately is a good pattern for feature engineering.
在某些情况下,Might need to pass some information between the two —— 例如,We may want to use the same on the test and training sets StandardScaler.
总而言之,Learning from mistakes is a good thing,Hope the error examples provided above will help you.
边栏推荐
- 华为设备配置VRRP与BFD联动实现快速切换
- 力扣707-设计链表——链表
- 抖音web逆向教程
- 迪赛智慧数——柱状图(多色柱状图):2021年我国城市住户存款排名
- 转运RNA(tRNA)甲基化修饰7-甲基胞嘧啶(m7C)|tRNA-m7G
- 解决This application failed to start because no Qt platform plugin could be initialized的办法
- ECCV 2022 | 清华&腾讯AI Lab提出REALY:重新思考3D人脸重建的评估方法
- 15年软件架构师经验总结:在ML领域,初学者踩过的五个坑
- Go语言类型与接口的关系
- ThreadLocal详解
猜你喜欢
RNA核糖核酸修饰RNA-HiLyte FluorTM 405荧光染料|RNA-HiLyte FluorTM 405
ESP8266-Arduino编程实例-WS2812驱动
伪标签汇总
Markdown语法
15年软件架构师经验总结:在ML领域,初学者踩过的五个坑
EasyCVR平台海康摄像头语音对讲功能配置的3个注意事项
[email protected] 594/[email prote"/>
RNA核糖核酸修饰Alexa 568/[email protected] 594/[email prote
ARMuseum
力扣707-设计链表——链表
EMQX Newsletter 2022-07|EMQX 5.0 正式发布、EMQX Cloud 新增 2 个数据库集成
随机推荐
svg+js订单确认按钮动画js特效
ES6-箭头函数
Pytorch GPU 训练环境搭建
Go语言为任意类型添加方法
若依集成easyexcel实现excel表格增强
力扣206-反转链表——链表
倒计时2天,“文化数字化战略新型基础设施暨文化艺术链生态建设发布会”启幕在即
ES6--剩余参数
leetcode 072. Finding Square Roots
Edge box + time series database, technology selection behind Midea's digital platform iBuilding
头条服务端一面经典10道面试题解析
leetcode 461. Hamming Distance
华为设备VRRP配置命令
php截取中文字符串实例
盲埋孔PCB叠孔设计的利与弊
利用 rpush 和 blpop 实现 Redis 消息队列
力扣59-螺旋矩阵 II——边界判断
leetcode 072. 求平方根
Abs (), fabs () and LABS ()
svg胶囊药样式切换按钮