当前位置:网站首页>pyspark---low frequency feature processing
pyspark---low frequency feature processing
2022-08-03 07:40:00 【WGS.】
Low frequency features appear less frequently,I can't learn well either,Better to give bothdefault,integrated learning.
Take a feature as an example:Use for low frequenciesdefault代替,defaultParticipate in coding and training,A new value was encountered while evaluating,也可以用default来填充,This solves the problem of poor low-frequency characterization and cold start.
tmpd = [{
'model': 'AVA', 'city': '苏州', 'y': 0}, {
'model': 'AVA', 'city': '苏州', 'y': 0},
{
'model': 'TNY', 'city': '青岛', 'y': 0}, {
'model': 'AVA', 'city': '青岛', 'y': 0},
{
'model': 'TNY', 'city': '青岛', 'y': 0}, {
'model': 'TNY', 'city': '青岛', 'y': 0},
{
'model': 'TNY', 'city': '青岛', 'y': 0}, {
'model': 'AVA', 'city': '上海', 'y': 0},
{
'model': 'Mi', 'city': '上海', 'y': 0}, {
'model': 'Mi', 'city': '上海', 'y': 0},
{
'model': 'Mi', 'city': '上海', 'y': 0}, {
'model': 'fla', 'city': '北京', 'y': 0}, ]
tmpd = ss.createDataFrame(tmpd)
tmpd.show()
+----+-----+---+
|city|model| y|
+----+-----+---+
|苏州| AVA| 0|
|苏州| AVA| 0|
|青岛| TNY| 0|
|青岛| AVA| 0|
|青岛| TNY| 0|
|青岛| TNY| 0|
|青岛| TNY| 0|
|上海| AVA| 0|
|上海| Mi| 0|
|上海| Mi| 0|
|上海| Mi| 0|
|北京| fla| 0|
+----+-----+---+
- demo示例:The number of records will be less than3 & The eigenvalues that do not have a click are given
default
def row_count2(row):
uid, y = row[0], row[1]
clicks = sum(y)
lens = len(y)
pvs = lens - clicks
return uid, pvs, clicks, lens
''' low frequency processing '''
def low_frequency(df):
low_enc_dict = {
'city': 3, 'model': 3}
for c in low_enc_dict.keys():
dfpg = df.groupby(c).agg(fn.collect_list('y').alias('y')).rdd.map(row_count2).toDF(schema=[c, 'pvs', 'clicks', 'lens'])
# print(dfpg.count())
# dfpg.orderBy(['lens', 'clicks'], ascending=[0, 0]).show(10)
# dfpg.orderBy(['lens', 'clicks'], ascending=[1, 1]).show(50)
# dfpg.filter(dfpg['lens'] <= low_enc_dict[c]).filter(dfpg['clicks'] == 0).select('lens').agg({'lens': 'sum'}).show()
lowlst = []
tlst = dfpg.filter(dfpg['lens'] <= low_enc_dict[c]).filter(dfpg['clicks'] == 0).select(c).collect()
for row in tlst:
lowlst.append(row[0])
df = df.withColumn(c, fn.udf(lambda x: 'default' if x in lowlst else x)(fn.col(c)))
# print(c, len(lowlst), df.filter(df[c] == 'default').count())
return df
tmpd = low_frequency(df=tmpd)
tmpd.show()
+-------+-------+---+
| city| model| y|
+-------+-------+---+
|default| AVA| 0|
|default| AVA| 0|
| 青岛| TNY| 0|
| 青岛| AVA| 0|
| 青岛| TNY| 0|
| 青岛| TNY| 0|
| 青岛| TNY| 0|
| 上海| AVA| 0|
| 上海|default| 0|
| 上海|default| 0|
| 上海|default| 0|
|default|default| 0|
+-------+-------+---+
边栏推荐
猜你喜欢

PMP每日一练 | 考试不迷路-8.2(包含敏捷+多选)

死锁的成因和对应的解决方案

第一章:ARM公司Cortex-M 系列处理器介绍,第二章:嵌入式软件开发介绍和第三章:Cortex-M3和Cortex-M4处理器的一般介绍

Roson的Qt之旅#105 QML Image引用大尺寸图片

Roson的Qt之旅#103 QML之标签导航控件TabBar

Postman will return to results generated CSV file to the local interface

postman将接口返回结果生成json文件到本地

测试用例设计方法之因果图详解

第六章:存储系统

解决登录vCenter提示“当前网站安全证书不受信任“
随机推荐
力扣(LeetCode)214. 打家劫舍 II(2022.08.02)
ISIJ 2022收官,中国初中生再展风采
- display image API OpenCV 】 【 imshow () to a depth (data type) at different image processing methods
postman将接口返回结果生成json文件到本地
RHCSA第四天
FiBiNet torch reproduction
关于NOI 2022的报到通知
pt-online-schema-change工具使用的一次
mongodb的shell脚本
PMP每日一练 | 考试不迷路-8.2(包含敏捷+多选)
【图像去噪】基于matlab稀疏表示KSVD图像去噪【含Matlab源码 2016期】
spark中的bykey
word之图表目录中点号位置提升3磅
Haisi project summary
drop database出现1010
How to choose a reliable and formal training institution for the exam in September?
最新版图书馆招聘考试常考试题重点事业单位
(十五)51单片机——呼吸灯与直流电机调速(PWM)
被数据分析重塑的5个行业
uniapp 请求接口封装