当前位置:网站首页>Summary of log feature selection (based on Tianchi competition)
Summary of log feature selection (based on Tianchi competition)
2022-07-08 01:47:00 【Mark_ Aussie】
The data is based on the third Alibaba cloud panjiu Zhiwei algorithm competition ,
Official address : Questions and data of the third Alibaba cloud panjiu Zhiwei algorithm competition - Tianchi competition - Alibaba cloud Tianchi (aliyun.com)
Runner up program gihub:AI-Competition/3rd_PanJiu_AIOps_Competition at main · yz-intelligence/AI-Competition · GitHub
This competition question provides fault work order and log data , analysis msg Structure , according to | It can be decomposed . According to the actual business scenario , Before and after the failure 5/10/15/30 Log information generated in minutes or more , May be related to this fault .


sn Represents the server serial number , There are... In the fault work order 13700+ individual sn;
Server model server_model And server serial number sn It's a one-to-many relationship ;
take msg after TF-IDF code , Input into the linear model , Use eli5 Get the results under each category ,msg The contribution of words , The higher the weight, the greater the contribution to distinguish the category .
0 Classes and 1 Class represents CPU Related faults ,processor Is the highest weight , And the discrimination is not very high ;
2 Class represents a memory related fault , The higher weight is memory、mem、ecc;
3 Class represents other types of faults , The higher weight is hdd、fpga、bus, It may be a hardware related failure .

Main steps : Data preprocessing , Feature Engineering , feature selection , model training , Model fusion .
Data preprocessing : According to the interval from the occurrence time of the fault , Divide the log into different time intervals ;msg Standardization according to special symbols .
Feature Engineering : Mainly build keyword features 、 Time difference characteristics 、TF-IDF Statistical characteristics of word frequency 、W2V features 、 Statistical characteristics 、New Data features .
feature selection : Feature selection against verification , Ensure the consistency of training and test sets , Improve the generalization ability of the model in the test set .
model training :CatBoost And LightGBM Model training using pseudo label technique .
Model fusion :CatBoost And LightGBM The result of the prediction is 8:2 The final model prediction result is obtained by weighted fusion .
According to the actual business scenario , An alarm log may be generated before the fault occurs , A log storm may occur after a fault occurs , For each fault work order data , Construct new log data according to different time segmentation , Construct statistical features after log aggregation .
Feature Engineering :
Time difference characteristics
Reflect the interval between failure log and normal log . Feature construction method :
Get the time difference between the log time and the failure time , combination
sn, server_modelGroup feature derivation .Statistical characteristics of time difference :
[max, min, median, std, var, skw, sum, mode]Quantile characteristics of time difference :
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]Time difference :
[max, min, median, std, var, skw, sum]
Keyword features
The influence of keywords on each category . To construct keyword features, you must first find the keyword , There are two ways to determine keywords :
programme 1:msg do TF-IDF code , Input into the linear model , Calculate the weight of keywords under each category , Take... For each category TOP20.
programme 2: according to ' | ' Yes msg participle , Count the word frequency of each category , Take... For each category TOP20.
Method 1 With the method 2 Union and collection , Get the final keyword .
Put each keyword in msg Whether or not it appears in .
Statistical characteristics : Statistical features are constructed by grouping according to category features , Make the hidden information of category characteristics fully exposed .
according to
sngrouping ,server_modeStatistical characteristics :[count,nunique,freq,rank]according to
sngrouping , Log statistical characteristics :msg, msg_0, msg_1, msg_2:[count,nunique,freq,rank]
W2V features : reflect msg Semantic information
according to
sngrouping , According to the timemsgSort , For each of thesesn, Sort it outmsgAs a sequence , extractembeddingfeatures .
TFIDF features
according to fault_id(sn+fault_time) grouping , according to fault_id take msg Splice as a sequence , extract TF-IDF features .
feature selection
Feature selection is mainly to use confrontation verification for feature selection , Delete training set and test set label Re marking , The training set is 1, The test set is 0, Data sets are merged for model training calculation AUC, If AUC Greater than the set threshold , Delete the feature with the highest importance , Retraining the model . until AUC Less than threshold .
During model training , Using pseudo tag technology , Specifically, I will A、B Prediction results of the test set , Selection confidence >0.85 As a trusted sample , Join the training set , To increase the sample size .
Reference resources :
Tianchi algorithm competition : Fault diagnosis runner up scheme based on large-scale log !
边栏推荐
- How to realize batch control? MES system gives you the answer
- 子矩阵的和
- Mat file usage
- LeetCode 练习——剑指 Offer 36. 二叉搜索树与双向链表
- GBASE观察 | 数据泄露频发 信息系统安全应如何守护
- Chapter 7 behavior level modeling
- 保姆级教程:Azkaban执行jar包(带测试样例及结果)
- How does Matplotlib generate multiple pictures in turn & only save these pictures without displaying them in the compiler
- The persistence mode of redis - RDB and AOF persistence mechanisms
- Dataworks duty table
猜你喜欢

图解网络:揭开TCP四次挥手背后的原理,结合男女朋友分手的例子,通俗易懂

Android 创建的sqlite3数据存放位置

About snake equation (2)

Partage d'expériences de contribution à distance

Urban land use distribution data / urban functional zoning distribution data / urban POI points of interest / vegetation type distribution

qt--将程序打包--不要安装qt-可以直接运行

Optimization of ecological | Lake Warehouse Integration: gbase 8A MPP + xeos

Kindle operation: transfer downloaded books and change book cover

Version 2.0 of tapdata, the open source live data platform, has been released

Qt - - Packaging Programs - - Don't install Qt - can run directly
随机推荐
uniapp一键复制功能效果demo(整理)
How to realize batch control? MES system gives you the answer
Codeforces Round #649 (Div. 2)——A. XXXXX
The difference between distribution function and probability density function of random variables
Gnuradio transmits video and displays it in real time using VLC
Break algorithm --- map
Get familiar with XML parsing quickly
第七章 行为级建模
3. Multi agent reinforcement learning
AttributeError: ‘str‘ object has no attribute ‘strftime‘
Js中forEach map无法跳出循环问题以及forEach会不会修改原数组
The function of carbon brush slip ring in generator
Application of slip ring in direct drive motor rotor
Graphic network: uncover the principle behind TCP's four waves, combined with the example of boyfriend and girlfriend breaking up, which is easy to understand
生态 | 湖仓一体的优选:GBase 8a MPP + XEOS
写一个纯手写的qt的hello world
快速熟知XML解析
Chapter 7 behavior level modeling
Understanding of prior probability, posterior probability and Bayesian formula
2、TD+Learning