当前位置:网站首页>Summary of log feature selection (based on Tianchi competition)
Summary of log feature selection (based on Tianchi competition)
2022-07-08 01:47:00 【Mark_ Aussie】
The data is based on the third Alibaba cloud panjiu Zhiwei algorithm competition ,
Official address : Questions and data of the third Alibaba cloud panjiu Zhiwei algorithm competition - Tianchi competition - Alibaba cloud Tianchi (aliyun.com)
Runner up program gihub:AI-Competition/3rd_PanJiu_AIOps_Competition at main · yz-intelligence/AI-Competition · GitHub
This competition question provides fault work order and log data , analysis msg Structure , according to |
It can be decomposed . According to the actual business scenario , Before and after the failure 5/10/15/30 Log information generated in minutes or more , May be related to this fault .
sn Represents the server serial number , There are... In the fault work order 13700+ individual sn;
Server model server_model And server serial number sn It's a one-to-many relationship ;
take msg after TF-IDF code , Input into the linear model , Use eli5 Get the results under each category ,msg The contribution of words , The higher the weight, the greater the contribution to distinguish the category .
0 Classes and 1 Class represents CPU Related faults ,processor Is the highest weight , And the discrimination is not very high ;
2 Class represents a memory related fault , The higher weight is memory、mem、ecc;
3 Class represents other types of faults , The higher weight is hdd、fpga、bus, It may be a hardware related failure .
Main steps : Data preprocessing , Feature Engineering , feature selection , model training , Model fusion .
Data preprocessing : According to the interval from the occurrence time of the fault , Divide the log into different time intervals ;msg Standardization according to special symbols .
Feature Engineering : Mainly build keyword features 、 Time difference characteristics 、TF-IDF Statistical characteristics of word frequency 、W2V features 、 Statistical characteristics 、New Data features .
feature selection : Feature selection against verification , Ensure the consistency of training and test sets , Improve the generalization ability of the model in the test set .
model training :CatBoost And LightGBM Model training using pseudo label technique .
Model fusion :CatBoost And LightGBM The result of the prediction is 8:2 The final model prediction result is obtained by weighted fusion .
According to the actual business scenario , An alarm log may be generated before the fault occurs , A log storm may occur after a fault occurs , For each fault work order data , Construct new log data according to different time segmentation , Construct statistical features after log aggregation .
Feature Engineering :
Time difference characteristics
Reflect the interval between failure log and normal log . Feature construction method :
Get the time difference between the log time and the failure time , combination
sn, server_model
Group feature derivation .Statistical characteristics of time difference :
[max, min, median, std, var, skw, sum, mode]
Quantile characteristics of time difference :
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Time difference :
[max, min, median, std, var, skw, sum]
Keyword features
The influence of keywords on each category . To construct keyword features, you must first find the keyword , There are two ways to determine keywords :
programme 1:msg do TF-IDF code , Input into the linear model , Calculate the weight of keywords under each category , Take... For each category TOP20.
programme 2: according to ' | ' Yes msg participle , Count the word frequency of each category , Take... For each category TOP20.
Method 1 With the method 2 Union and collection , Get the final keyword .
Put each keyword in msg Whether or not it appears in .
Statistical characteristics : Statistical features are constructed by grouping according to category features , Make the hidden information of category characteristics fully exposed .
according to
sn
grouping ,server_mode
Statistical characteristics :[count,nunique,freq,rank]
according to
sn
grouping , Log statistical characteristics :msg, msg_0, msg_1, msg_2:[count,nunique,freq,rank]
W2V features : reflect msg
Semantic information
according to
sn
grouping , According to the timemsg
Sort , For each of thesesn
, Sort it outmsg
As a sequence , extractembedding
features .
TFIDF features
according to fault_id(sn+fault_time)
grouping , according to fault_id
take msg
Splice as a sequence , extract TF-IDF features .
feature selection
Feature selection is mainly to use confrontation verification for feature selection , Delete training set and test set label Re marking , The training set is 1, The test set is 0, Data sets are merged for model training calculation AUC, If AUC Greater than the set threshold , Delete the feature with the highest importance , Retraining the model . until AUC Less than threshold .
During model training , Using pseudo tag technology , Specifically, I will A、B Prediction results of the test set , Selection confidence >0.85 As a trusted sample , Join the training set , To increase the sample size .
Reference resources :
Tianchi algorithm competition : Fault diagnosis runner up scheme based on large-scale log !
边栏推荐
- The function of carbon brush slip ring in generator
- QT--创建QT程序
- SQLite3 data storage location created by Android
- About snake equation (2)
- 为什么更新了 DNS 记录不生效?
- 批次管控如何实现?MES系统给您答案
- 小金额炒股,在手机上开户安全吗?
- Redux usage
- In depth analysis of ArrayList source code, from the most basic capacity expansion principle, to the magic iterator and fast fail mechanism, you have everything you want!!!
- 项目经理有必要考NPDP吗?我告诉你答案
猜你喜欢
Qt - - Packaging Programs - - Don't install Qt - can run directly
break net
Anaconda3 download address Tsinghua University open source software mirror station
能力贡献 GBASE三大解决方案入选“金融信创生态实验室-金融信创解决方案(第一批)”
The difference between distribution function and probability density function of random variables
Gnuradio transmits video and displays it in real time using VLC
生态 | 湖仓一体的优选:GBase 8a MPP + XEOS
Write a pure handwritten QT Hello World
Introduction to grpc for cloud native application development
C language - modularization -clion (static library, dynamic library) use
随机推荐
快速熟知XML解析
Urban land use distribution data / urban functional zoning distribution data / urban POI points of interest / vegetation type distribution
SQLite3 data storage location created by Android
The difference between distribution function and probability density function of random variables
批次管控如何实现?MES系统给您答案
LaTeX 中 xcolor 颜色的用法
Remote Sensing投稿經驗分享
Gnuradio 3.9 using OOT custom module problem record
Matlab code on error analysis (MAE, MAPE, RMSE)
php 获取音频时长等信息
About snake equation (2)
Redis集群
Anaconda3 tutorial on installing and adding Tsinghua image files
Version 2.0 of tapdata, the open source live data platform, has been released
Gnuradio operation error: error thread [thread per block [12]: < block OFDM_ cyclic_ prefixer(8)>]: Buffer too small
Chapter 7 behavior level modeling
Matlab code about cosine similarity
生态 | 湖仓一体的优选:GBase 8a MPP + XEOS
Different methods for setting headers of different pages in word (the same for footer and page number)
Break algorithm --- map