当前位置:网站首页>Indoor user time series data classification baseline of 2020 ccfbdci training competition

Indoor user time series data classification baseline of 2020 ccfbdci training competition

2020-11-10 11:27:00 osc-u 2koojuzp

Introduction to the contest question

Title of competition : Indoor user movement time series data classification

Track : Training track

background : With the accumulation of data , The processing demand of massive time series information is becoming increasingly prominent . As one of the important tasks in time series analysis , Time series classification is widely used and diverse . The purpose of time series classification is to assign a discrete marker to the time series . Traditional feature extraction algorithm uses statistical information in time series as the basis of classification . In recent years , Time series classification based on deep learning has made great progress . Based on the end-to-end feature extraction method , Deep learning can avoid tedious artificial feature design . How to classify time series effectively , From a complex data set, a sequence with a certain form is assigned to the same set , It is of great significance for academic research and industrial application .

Mission : Based on the above actual needs and the progress of deep learning , This training competition aims to build a general time series classification algorithm . Establish an accurate time series classification model through this question , I hope you will explore a more robust representation of time series features .

Match Links https://www.datafountain.cn/competitions/484

Data brief

The data is collated from open data sets on the Internet UCI( Desensitized ), The dataset covers 2 Class different time series , This kind of dataset is widely used in business scenarios of time series classification .

File category file name The contents of the document
Training set train.csv Training dataset tag file , label CLASS
Test set test.csv Test dataset tag file , No label
Field description Field description .xlsx Training set / Test set XXX Specific description of the fields
Submit sample Ssample_submission.csv There are only two fields ID\CLASS

Data analysis

This question is a question of dichotomy , By observing the training set data , It turns out that the amount of data is very small (210 individual ) And it has a lot of features (240 individual ), And for the tag value of training data ,0 and 1 It's very evenly distributed ( About half of each ). Based on this , The use of direct neural network model will lead to too many parameters to be trained, so as to obtain unsatisfactory results . And use the tree model , Some hyperparameters need to be adjusted to fit the data , It's also complicated . Comprehensive analysis above , In this paper, we consider using the simplest support vector machine for classification , The results show that good results have been obtained .

Baseline Program

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.svm import SVR
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Separate data sets 
X_train_c = train.drop(['ID','CLASS'], axis=1).values
y_train_c = train['CLASS'].values
X_test_c = test.drop(['ID'], axis=1).values
nfold = 5
kf = KFold(n_splits=nfold, shuffle=True, random_state=2020)
prediction1 = np.zeros((len(X_test_c), ))
i = 0
for train_index, valid_index in kf.split(X_train_c, y_train_c):
    print("\nFold {}".format(i + 1))
    X_train, label_train = X_train_c[train_index],y_train_c[train_index]
    X_valid, label_valid = X_train_c[valid_index],y_train_c[valid_index]
    clf=SVR(kernel='rbf',C=1,gamma='scale')
    clf.fit(X_train,label_train)
    x1 = clf.predict(X_valid)
    y1 = clf.predict(X_test_c)
    prediction1 += ((y1)) / nfold
    i += 1
result1 = np.round(prediction1)
id_ = range(210,314)
df = pd.DataFrame({
   
   'ID':id_,'CLASS':result1})
df.to_csv("baseline.csv", index=False)

Submit results

Submit baseline, The score is 0.83653846154.
Because of the 50% discount on the data , So the score of the submitted results will fluctuate a little .

版权声明
本文为[osc-u 2koojuzp]所创,转载请带上原文链接,感谢