当前位置:网站首页>pyspark @udf loop using variable problem
pyspark @udf loop using variable problem
2022-08-03 07:41:00 【WGS。】
问题描述
通过@udfway to add two columns,udfThe broadcast variable applied to each column is different.But in a few rounds,udfThe global variables accepted in are all first-round loops.伪代码如下:
from pyspark.sql.functions import udf
tlowdata = ss.createDataFrame([{
'suuid': 'DDD1', 'oaid': '00-01', 'y': 1},
{
'suuid': 'DOOD', 'oaid': '00-02', 'y': 0},
{
'suuid': '009-1234', 'oaid': 'default1', 'y': 0},
{
'suuid': 'DDD1', 'oaid': 'ttt', 'y': 0},
{
'suuid': 'www', 'oaid': 'fwao', 'y': 0},
{
'suuid': 'www', 'oaid': 'fff1', 'y': 0},])
tlowdata.show()
@udf
def tmp_udf(uid):
return str(tmplst.value)
cols = ['suuid', 'oaid']
lst = [[1, 2, 3], [4, 5, 6]]
global tmplst
for i in range(len(lst)):
tmplst = lst[i]
tmplst = sc.broadcast(tmplst)
print(tmplst.value)
tlowdata = tlowdata.withColumn(c[i] + '_flag', tmp_udf(fn.col(c[i])))
tmplst.unpersist()
tlowdata.show()
+--------+--------+---+
| oaid| suuid| y|
+--------+--------+---+
| 00-01| DDD1| 1|
| 00-02| DOOD| 0|
|default1|009-1234| 0|
| ttt| DDD1| 0|
| fwao| www| 0|
| fff1| www| 0|
+--------+--------+---+
[1, 2, 3]
[4, 5, 6]
+--------+--------+---+----------+---------+
| oaid| suuid| y|suuid_flag|oaid_flag|
+--------+--------+---+----------+---------+
| 00-01| DDD1| 1| [1, 2, 3]|[1, 2, 3]|
| 00-02| DOOD| 0| [1, 2, 3]|[1, 2, 3]|
|default1|009-1234| 0| [1, 2, 3]|[1, 2, 3]|
| ttt| DDD1| 0| [1, 2, 3]|[1, 2, 3]|
| fwao| www| 0| [1, 2, 3]|[1, 2, 3]|
| fff1| www| 0| [1, 2, 3]|[1, 2, 3]|
+--------+--------+---+----------+---------+
理论上,在新增oaid_flag
这一列的时候,应该是[4, 5, 6]
,But no matter how many times it is looped,udfThe broadcast variable accepted in is always [1, 2, 3]
.I doubt itspark2.4版本的问题了…
解决方案
放弃@udf,使用lambda udf
或者使用UserDefinedFunction
- lambda udf
# lambda udf
tmp_udf = fn.udf(lambda x: str(tmplst.value))
tlowdata = tlowdata.withColumn(c[i] + '_flag', tmp_udf(fn.col(c[i])))
- UserDefinedFunction
# UserDefinedFunction
from pyspark.sql.udf import UserDefinedFunction
def tmp_udf(uid):
return str(tmplst.value)
tlowdata = tlowdata.withColumn(c[i] + '_flag', UserDefinedFunction(lambda x: tmp_udf(x))(fn.col(c[i])))
+--------+--------+---+
| oaid| suuid| y|
+--------+--------+---+
| 00-01| DDD1| 1|
| 00-02| DOOD| 0|
|default1|009-1234| 0|
| ttt| DDD1| 0|
| fwao| www| 0|
| fff1| www| 0|
+--------+--------+---+
[1, 2, 3]
[4, 5, 6]
+--------+--------+---+----------+---------+
| oaid| suuid| y|suuid_flag|oaid_flag|
+--------+--------+---+----------+---------+
| 00-01| DDD1| 1| [1, 2, 3]|[4, 5, 6]|
| 00-02| DOOD| 0| [1, 2, 3]|[4, 5, 6]|
|default1|009-1234| 0| [1, 2, 3]|[4, 5, 6]|
| ttt| DDD1| 0| [1, 2, 3]|[4, 5, 6]|
| fwao| www| 0| [1, 2, 3]|[4, 5, 6]|
| fff1| www| 0| [1, 2, 3]|[4, 5, 6]|
+--------+--------+---+----------+---------+
没想明白是什么原因,Temporarily solved by the above method.I'll update when I figure it out.
边栏推荐
猜你喜欢
随机推荐
【Shell】3万字图文讲解带你快速掌握shell脚本编程
drop database出现1010
Charles capture shows
solution Postman will return to results generated CSV file to the local interface
多线程打印ABC(继承+进阶)
线程基础(二)
qt学习之旅--MinGW32编译opencv3.0.0
现货黄金分析的主要流派
DSP Trick:向量长度估算
测试用例设计方法之因果图详解
【图像去雾】基于matlab暗通道和非均值滤波图像去雾【含Matlab源码 2011期】
伦敦银现货市场如何使用多条均线?
【图像去噪】基于matlab稀疏表示KSVD图像去噪【含Matlab源码 2016期】
【着色器实现Glow可控局部发光效果_Shader效果第十三篇】
华为设备配置BFD与接口联动(触发与BFD联动的接口物理状态变为Down)
keepalived安装部署
C语言入门实战(14):选择排序
postman将接口返回结果生成json文件到本地
数据库表结构文档 生成工具screw的使用
9月考,如何选择靠谱正规的培训机构?