当前位置：网站首页>【Pick-in】Advertising-information flow cross-domain CTR estimation (to be updated)

【Pick-in】Advertising-information flow cross-domain CTR estimation (to be updated)

2022-08-04 16:02:00 【YKbsmn】

赛题介绍

广告推荐主要基于用户对广告的历史曝光、点击等行为进行建模,如果只是使用广告域数据,用户行为数据稀疏,行为类型相对单一.而引入同一媒体的跨域数据,可以获得同一广告用户在其他域的行为数据,深度挖掘用户兴趣,丰富用户行为特征.引入其他媒体的广告用户行为数据,也能丰富用户和广告特征.

本赛题希望选手基于广告日志数据,用户基本信息和跨域数据优化广告ctr预估准确率.目标域For the AD domain,源域Recommend the domain for information flow,通过获取用户在信息流域中曝光、点击信息流等行为数据,进行用户兴趣建模,帮助广告域CTR的精准预估.

Task1 Competition registration and try

比赛网址：https://developer.huawei.com/consumer/cn/activity/digixActivity/digixdetail/101655281685926449?ha_source=co&ha_sourceId=89000234

After signing up -> 下载比赛数据

Baseline尝试

#安装相关依赖库 如果是windows系统,cmd命令框中输入pip安装,参考上述环境配置
#!pip install sklearn
#!pip install pandas

#---------------------------------------------------
#导入库
import pandas as pd

#----------------数据探索----------------
# 只使用Target domain user behavior数据
train_ads = pd.read_csv('./train/train_data_ads.csv',
    usecols=['log_id', 'label', 'user_id', 'age', 'gender', 'residence', 'device_name',
            'device_size', 'net_type', 'task_id', 'adv_id', 'creat_type_cd'])

test_ads = pd.read_csv('./test/test_data_ads.csv',
    usecols=['log_id', 'user_id', 'age', 'gender', 'residence', 'device_name',
    'device_size', 'net_type', 'task_id', 'adv_id', 'creat_type_cd'])
    
# 数据集采样  
train_ads = pd.concat([
    train_ads[train_ads['label'] == 0].sample(70000),
    train_ads[train_ads['label'] == 1].sample(10000),
])


#----------------模型训练----------------
# 加载训练逻辑回归模型
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(
    train_ads.drop(['log_id', 'label', 'user_id'], axis=1),
    train_ads['label']
)


#----------------结果输出----------------
# 模型预测与生成结果文件
test_ads['pctr'] = clf.predict_proba(
    test_ads.drop(['log_id', 'user_id'], axis=1),
)[:, 1]
test_ads[['log_id', 'pctr']].to_csv('submission.csv',index=None)

提交结果为：

Task2 比赛数据分析

导入相关库

#安装相关依赖库 如果是windows系统,cmd命令框中输入pip安装,参考上述环境配置
#!pip install sklearn
#!pip install pandas
#!pip install catboost
# 如果有下载Anaconda,可以创建虚拟环境,然后输入：conda install -c https://conda.anaconda.org/conda-forge catboost
#---------------------------------------------------
#导入库
#----------------数据探索----------------
import pandas as pd
import numpy as np
import os
import gc
import matplotlib.pyplot as plt
from tqdm import *
#----------------核心模型----------------
from catboost import CatBoostClassifier
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
#----------------交叉验证----------------
from sklearn.model_selection import StratifiedKFold, KFold
#----------------评估指标----------------
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
#----------------Ignore the alarm----------------
import warnings
warnings.filterwarnings('ignore')

对`Target domain user behavior`进行分析

对于训练集和测试集,How much is the proportion of user registration？

train_data_ads = pd.read_csv('./train/train_data_ads.csv')
test_data_ads = pd.read_csv('./test/test_data_ads.csv')
train_id_ads = train_data_ads['user_id'].unique()
test_id_ads = test_data_ads['user_id'].unique()
len(train_id_ads), len(test_id_ads) # 在训练集和测试集中,去重的id个数

#Statistical two array elements same number
#方法一：In front of the userid已经去重了,但是这里不加set会报错
dup_id_len = len(set(train_id_ads) & set(test_id_ads))
dup_id_len
#方法二：
#duplicate_id = [x for x in train_id_ads if x in test_id_ads]
#dup_id_len = len(duplicate_id)

#对于训练集 和 测试集,How much is the proportion of user registration？
ratio_dup_train = dup_id_len / len(train_id_ads)
ratio_dup_test = dup_id_len / len(test_id_ads)
ratio_dup_train, ratio_dup_test

How much of the statistics field numerical fields,How many non numeric fields？

train_data_ads.dtypes

test_data_ads.dtypes

Statistics which users properties（年龄、性别、手机设备等）与 Label the strongest correlation？

DataFrame.corr()函数使用说明如下：

DataFrame.corr(method='pearson', min_periods=1)

作用：
data.corr()表示了data中的两个变量之间的相关性,取值范围为[-1,1],取值接近-1,表示反相关,类似反比例函数,取值接近1,表正相关.

参数说明：
method：可选值为{‘pearson’, ‘kendall’, ‘spearman’}
pearson：Pearson相关系数来衡量两个数据集合是否在一条线上面,即针对线性数据的相关系数计算,针对非线性数据便会有误差.
kendall：用于反映分类变量相关性的指标,即针对无序序列的相关系数,非正太分布的数据.
spearman：非线性的,非正太分析的数据的相关系数.
min_periods：样本最少的数据量.
返回值：各类型之间的相关系数DataFrame表格.

train_data_ads.corr(method = 'kendall')['label']

结果,The top three, respectively is：

对`The source domain user behavior`进行分析

The source domain user behavior 与 Target domain user behavior The proportion of training set and testing set user registration, respectively is how much？
How much of the statistics field numerical fields,How many non numeric fields？

原网站

版权声明
本文为[YKbsmn]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/216/202208041542391529.html