当前位置：网站首页>Indoor user time series data classification baseline of 2020 ccfbdci training competition

Indoor user time series data classification baseline of 2020 ccfbdci training competition

2020-11-10 11:27:00 【osc-u 2koojuzp】

Want to make APP Same thing as WeChat , Can run small programs smoothly ？ | Experience will send you to Xinjiang 、 Huawei 、 Cherry keyboard ！>>>

Indoor user time series data classification

Introduction to the contest question
Data brief
Data analysis
Baseline Program
Submit results

Introduction to the contest question

Title of competition ： Indoor user movement time series data classification

Track ： Training track

background ： With the accumulation of data , The processing demand of massive time series information is becoming increasingly prominent . As one of the important tasks in time series analysis , Time series classification is widely used and diverse . The purpose of time series classification is to assign a discrete marker to the time series . Traditional feature extraction algorithm uses statistical information in time series as the basis of classification . In recent years , Time series classification based on deep learning has made great progress . Based on the end-to-end feature extraction method , Deep learning can avoid tedious artificial feature design . How to classify time series effectively , From a complex data set, a sequence with a certain form is assigned to the same set , It is of great significance for academic research and industrial application .

Mission ： Based on the above actual needs and the progress of deep learning , This training competition aims to build a general time series classification algorithm . Establish an accurate time series classification model through this question , I hope you will explore a more robust representation of time series features .

Match Links ：https://www.datafountain.cn/competitions/484

Data brief

The data is collated from open data sets on the Internet UCI（ Desensitized ）, The dataset covers 2 Class different time series , This kind of dataset is widely used in business scenarios of time series classification .

File category	file name	The contents of the document
Training set	train.csv	Training dataset tag file , label CLASS
Test set	test.csv	Test dataset tag file , No label
Field description	Field description .xlsx	Training set / Test set XXX Specific description of the fields
Submit sample	Ssample_submission.csv	There are only two fields ID\CLASS

Data analysis

This question is a question of dichotomy , By observing the training set data , It turns out that the amount of data is very small （210 individual ） And it has a lot of features （240 individual ）, And for the tag value of training data ,0 and 1 It's very evenly distributed （ About half of each ）. Based on this , The use of direct neural network model will lead to too many parameters to be trained, so as to obtain unsatisfactory results . And use the tree model , Some hyperparameters need to be adjusted to fit the data , It's also complicated . Comprehensive analysis above , In this paper, we consider using the simplest support vector machine for classification , The results show that good results have been obtained .

Baseline Program

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.svm import SVR
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Separate data sets 
X_train_c = train.drop(['ID','CLASS'], axis=1).values
y_train_c = train['CLASS'].values
X_test_c = test.drop(['ID'], axis=1).values
nfold = 5
kf = KFold(n_splits=nfold, shuffle=True, random_state=2020)
prediction1 = np.zeros((len(X_test_c), ))
i = 0
for train_index, valid_index in kf.split(X_train_c, y_train_c):
    print("\nFold {}".format(i + 1))
    X_train, label_train = X_train_c[train_index],y_train_c[train_index]
    X_valid, label_valid = X_train_c[valid_index],y_train_c[valid_index]
    clf=SVR(kernel='rbf',C=1,gamma='scale')
    clf.fit(X_train,label_train)
    x1 = clf.predict(X_valid)
    y1 = clf.predict(X_test_c)
    prediction1 += ((y1)) / nfold
    i += 1
result1 = np.round(prediction1)
id_ = range(210,314)
df = pd.DataFrame({
   
   'ID':id_,'CLASS':result1})
df.to_csv("baseline.csv", index=False)

Submit results

Submit baseline, The score is 0.83653846154.
Because of the 50% discount on the data , So the score of the submitted results will fluctuate a little .

版权声明
本文为[osc-u 2koojuzp]所创，转载请带上原文链接，感谢