当前位置：网站首页>15 years experience in software architect summary: in the field of ML, tread beginners, five hole

15 years experience in software architect summary: in the field of ML, tread beginners, five hole

2022-08-03 20:41:00 【software testnet】

Data science and machine learning are becoming more and more popular,The number of people in this field is growing every day.This means that there are a lot of data scientists who don't have a lot of experience building their first machine learning model,And this is where mistakes can happen.

近日,软件架构师、数据科学家、Kaggle 大师 Agnis Liukis 撰写了一篇文章,In it, he talks about solutions to some of the most common beginner mistakes in machine learning,to make sure beginners understand and avoid them.

Agnis Liukis 拥有超过 15 Years of software architecture and development experience,He is proficient Java、JavaScript、Spring Boot、React.JS 和 Python 等语言.此外,Liukis Also interested in data science and machine learning,He participated many times Kaggle race and get good results,已达到 Kaggle Tournament master level.

以下为文章内容：

在机器学习领域,这 5 个坑,你踩过吗？

1、Data normalization was not used where required（data normalization）

对数据进行归一化操作,Then get the features,并将其输入到模型中,让模型做出预测,This method is easy.但在某些情况下,The results of this simple approach can be disappointing,Because it's missing a very important part.

Some types of models require data normalization,如线性回归、Classical Neural Networks, etc.This type of model uses the feature values to multiply the weights of the training values.In the case of non-normalized features,The possible range of one eigenvalue may be different from the possible range of another eigenvalue.

Suppose the value of a feature is [0,0.001] 范围内,The value of another feature is in [100000,200000] 范围内.For models that make two features equally important,The weight of the first feature will be greater than the weight of the second feature 1 亿倍.Huge weights can cause serious problems for the model,For example, when there are some outliers.此外,Estimating the importance of various features becomes difficult,Because heavy weights can mean that features are important,But it could also just mean that its eigenvalues are small.

归一化后,All features have values in the same range,通常为 [0,1] 或 [-1,1].在这种情况下,The weights will be in a similar range,and correspond closely to the actual importance of each feature.

总的来说,Using data normalization where needed will yield better results、更准确的预测.

2、认为特征越多越好

One might think it's a good idea to include all features,Think that the model will automatically select and use the best features.实际上,Such an idea is hard to come true.

The more features the model has,过拟合的风险越大.Even in completely random data,The model is also able to find some features（信号）,Although sometimes weaker,sometimes stronger.当然,There is no real signal in random noise.But if we have enough noisy columns,It is then possible for the model to use a part of it based on the detected fault signal.当这种情况发生时,The quality of model predictions will decrease,Because they are based in part on random noise.

There are many techniques that help us in feature selection.但你要记住,You need to explain every feature you have,and why this feature will help your model.

3. when extrapolation is required,使用基于树的模型

Tree-based models are easy to use,功能强大,This is also the reason for its popularity.然而,在某些情况下,Using a tree-based model might be wrong.

Tree-based models cannot be extrapolated,The predictions of these models will never be greater than the maximum value in the training data,And never output a prediction smaller than the minimum value during training.

在某些任务中,The ability to extrapolate can be very important.例如,If the model predicts stock prices,Then stock prices could be higher in the future than ever.在这种情况下,Tree-based models will not work out of the box,Because their forecast will almost exceed the highest historical price.

这个问题有多种解决方案,One solution is to predict changes or discrepancies,rather than directly predicting value.Another solution is to use a different type of model for such tasks.Linear regression or neural networks can be used for extrapolation.

4、Use data normalization where it is not needed

The previous article talked about the need for data normalization,但情况并非总是如此,Tree-based models do not require data normalization.Neural networks may also not require explicit normalization,Because some networks already contain a normalization layer inside,例如 Keras 库的 BatchNormalization 操作.

在某些情况下,Even linear regression may not require data normalization,This means that all features are already in a similar range of values,并且具有相同的含义.例如,If the model is suitable for time series data,And all features are historical values of the same parameter.

5. 在训练集和验证集 / Information leaks between test sets

Creating a data leak is easier than one might think,考虑以下代码段：

Example features of data leakage

实际上,these two characteristics（sum_feature 和 diff_feature）都不正确.They are leaking information,Because it is split into the training set / 测试集后,The part with training data will contain some information from testing.This will result in higher validation scores,But when applied to the actual data model,性能会更差.

The correct way is to start with the training set / The test set is separated,Only then is the feature generation function applied.通常,Treating the training and test sets separately is a good pattern for feature engineering.

在某些情况下,Might need to pass some information between the two —— 例如,We may want to use the same on the test and training sets StandardScaler.

总而言之,Learning from mistakes is a good thing,Hope the error examples provided above will help you.

原网站

版权声明
本文为[software testnet]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/215/202208032032037392.html