当前位置:网站首页>Summary of log feature selection (based on Tianchi competition)

Summary of log feature selection (based on Tianchi competition)

2022-07-08 01:47:00 Mark_ Aussie

The data is based on the third Alibaba cloud panjiu Zhiwei algorithm competition ,

Official address : Questions and data of the third Alibaba cloud panjiu Zhiwei algorithm competition - Tianchi competition - Alibaba cloud Tianchi (aliyun.com)

Runner up program gihub:AI-Competition/3rd_PanJiu_AIOps_Competition at main · yz-intelligence/AI-Competition · GitHub

This competition question provides fault work order and log data , analysis msg Structure , according to | It can be decomposed . According to the actual business scenario , Before and after the failure 5/10/15/30 Log information generated in minutes or more , May be related to this fault .

sn Represents the server serial number , There are... In the fault work order 13700+ individual sn;

Server model server_model And server serial number sn It's a one-to-many relationship ;

take msg after TF-IDF code , Input into the linear model , Use eli5 Get the results under each category ,msg The contribution of words , The higher the weight, the greater the contribution to distinguish the category .

  • 0 Classes and 1 Class represents CPU Related faults ,processor Is the highest weight , And the discrimination is not very high ;

  • 2 Class represents a memory related fault , The higher weight is memory、mem、ecc;

  • 3 Class represents other types of faults , The higher weight is hdd、fpga、bus, It may be a hardware related failure .

Main steps : Data preprocessing , Feature Engineering , feature selection , model training , Model fusion .

  1. Data preprocessing : According to the interval from the occurrence time of the fault , Divide the log into different time intervals ;msg Standardization according to special symbols .

  2. Feature Engineering : Mainly build keyword features 、 Time difference characteristics 、TF-IDF Statistical characteristics of word frequency 、W2V features 、 Statistical characteristics 、New Data features .

  3. feature selection : Feature selection against verification , Ensure the consistency of training and test sets , Improve the generalization ability of the model in the test set .

  4. model training :CatBoost And LightGBM Model training using pseudo label technique .

  5. Model fusion :CatBoost And LightGBM The result of the prediction is 8:2 The final model prediction result is obtained by weighted fusion .

According to the actual business scenario , An alarm log may be generated before the fault occurs , A log storm may occur after a fault occurs , For each fault work order data , Construct new log data according to different time segmentation , Construct statistical features after log aggregation .

Feature Engineering

Time difference characteristics

Reflect the interval between failure log and normal log . Feature construction method :

  • Get the time difference between the log time and the failure time , combination sn, server_model  Group feature derivation .

  • Statistical characteristics of time difference :[max, min, median, std, var, skw, sum, mode]

  • Quantile characteristics of time difference :[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

  • Time difference : [max, min, median, std, var, skw, sum]

Keyword features

The influence of keywords on each category . To construct keyword features, you must first find the keyword , There are two ways to determine keywords :

  • programme 1:msg do TF-IDF code , Input into the linear model , Calculate the weight of keywords under each category , Take... For each category TOP20.

  • programme 2: according to ' | ' Yes msg participle , Count the word frequency of each category , Take... For each category TOP20.

  • Method 1 With the method 2 Union and collection , Get the final keyword .

  • Put each keyword in msg Whether or not it appears in .

Statistical characteristics : Statistical features are constructed by grouping according to category features , Make the hidden information of category characteristics fully exposed .

  • according to sn grouping , server_mode Statistical characteristics :[count,nunique,freq,rank]

  • according to sn grouping , Log statistical characteristics :msg, msg_0, msg_1, msg_2:[count,nunique,freq,rank]

W2V features : reflect msg Semantic information

  • according to sn grouping , According to the time msg Sort , For each of these sn, Sort it out msg As a sequence , extract embedding features .

TFIDF features

according to  fault_id(sn+fault_time) grouping , according to fault_id take msg Splice as a sequence , extract TF-IDF features .

feature selection

Feature selection is mainly to use confrontation verification for feature selection , Delete training set and test set label Re marking , The training set is 1, The test set is 0, Data sets are merged for model training calculation AUC, If AUC Greater than the set threshold , Delete the feature with the highest importance , Retraining the model . until AUC Less than threshold .

During model training , Using pseudo tag technology , Specifically, I will A、B Prediction results of the test set , Selection confidence >0.85 As a trusted sample , Join the training set , To increase the sample size .

Reference resources :

Tianchi algorithm competition : Fault diagnosis runner up scheme based on large-scale log !

原网站

版权声明
本文为[Mark_ Aussie]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/189/202207080008029364.html