当前位置：网站首页>Summary of log feature selection (based on Tianchi competition)

Summary of log feature selection (based on Tianchi competition)

2022-07-08 01:47:00 【Mark_ Aussie】

The data is based on the third Alibaba cloud panjiu Zhiwei algorithm competition ,

Official address ： Questions and data of the third Alibaba cloud panjiu Zhiwei algorithm competition - Tianchi competition - Alibaba cloud Tianchi (aliyun.com)

Runner up program gihub：AI-Competition/3rd_PanJiu_AIOps_Competition at main · yz-intelligence/AI-Competition · GitHub

This competition question provides fault work order and log data , analysis msg Structure , according to | It can be decomposed . According to the actual business scenario , Before and after the failure 5/10/15/30 Log information generated in minutes or more , May be related to this fault .

sn Represents the server serial number , There are... In the fault work order 13700+ individual sn;

Server model server_model And server serial number sn It's a one-to-many relationship ;

take msg after TF-IDF code , Input into the linear model , Use eli5 Get the results under each category ,msg The contribution of words , The higher the weight, the greater the contribution to distinguish the category .

0 Classes and 1 Class represents CPU Related faults ,processor Is the highest weight , And the discrimination is not very high ;
2 Class represents a memory related fault , The higher weight is memory、mem、ecc;
3 Class represents other types of faults , The higher weight is hdd、fpga、bus, It may be a hardware related failure .

Main steps ： Data preprocessing , Feature Engineering , feature selection , model training , Model fusion .

Data preprocessing ： According to the interval from the occurrence time of the fault , Divide the log into different time intervals ;msg Standardization according to special symbols .
Feature Engineering ： Mainly build keyword features 、 Time difference characteristics 、TF-IDF Statistical characteristics of word frequency 、W2V features 、 Statistical characteristics 、New Data features .
feature selection ： Feature selection against verification , Ensure the consistency of training and test sets , Improve the generalization ability of the model in the test set .
model training ：CatBoost And LightGBM Model training using pseudo label technique .
Model fusion ：CatBoost And LightGBM The result of the prediction is 8:2 The final model prediction result is obtained by weighted fusion .

According to the actual business scenario , An alarm log may be generated before the fault occurs , A log storm may occur after a fault occurs , For each fault work order data , Construct new log data according to different time segmentation , Construct statistical features after log aggregation .

Feature Engineering ：

Time difference characteristics

Reflect the interval between failure log and normal log . Feature construction method ：

Get the time difference between the log time and the failure time , combination sn, server_model Group feature derivation .
Statistical characteristics of time difference ：[max, min, median, std, var, skw, sum, mode]
Quantile characteristics of time difference ：[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Time difference : [max, min, median, std, var, skw, sum]

Keyword features

The influence of keywords on each category . To construct keyword features, you must first find the keyword , There are two ways to determine keywords ：

programme 1：msg do TF-IDF code , Input into the linear model , Calculate the weight of keywords under each category , Take... For each category TOP20.
programme 2： according to ' | ' Yes msg participle , Count the word frequency of each category , Take... For each category TOP20.
Method 1 With the method 2 Union and collection , Get the final keyword .
Put each keyword in msg Whether or not it appears in .

Statistical characteristics ： Statistical features are constructed by grouping according to category features , Make the hidden information of category characteristics fully exposed .

according to sn grouping , server_mode Statistical characteristics :[count,nunique,freq,rank]
according to sn grouping , Log statistical characteristics ：msg, msg_0, msg_1, msg_2:[count,nunique,freq,rank]

W2V features ： reflect msg Semantic information

according to sn grouping , According to the time msg Sort , For each of these sn, Sort it out msg As a sequence , extract embedding features .

TFIDF features

according to fault_id(sn+fault_time) grouping , according to fault_id take msg Splice as a sequence , extract TF-IDF features .

feature selection

Feature selection is mainly to use confrontation verification for feature selection , Delete training set and test set label Re marking , The training set is 1, The test set is 0, Data sets are merged for model training calculation AUC, If AUC Greater than the set threshold , Delete the feature with the highest importance , Retraining the model . until AUC Less than threshold .

During model training , Using pseudo tag technology , Specifically, I will A、B Prediction results of the test set , Selection confidence >0.85 As a trusted sample , Join the training set , To increase the sample size .

Reference resources ：

Tianchi algorithm competition ： Fault diagnosis runner up scheme based on large-scale log ！

原网站

版权声明
本文为[Mark_ Aussie]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/189/202207080008029364.html

当前位置：网站首页>Summary of log feature selection (based on Tianchi competition)

Summary of log feature selection (based on Tianchi competition)

边栏推荐

猜你喜欢

随机推荐