当前位置:网站首页>Scala104 - Built-in datetime functions for Spark.sql
Scala104 - Built-in datetime functions for Spark.sql
2022-08-04 18:32:00 【51CTO】
Sometimes we use it directlydf.createOrReplaceTempView(temp)
创建临时表,用sql去计算.sparkSQL有些语法和hql不一样,做个笔记.
- <scala.version>2.11.12</scala.version>
- <spark.version>2.4.3</spark.version>
val builder = SparkSession
. builder()
. appName( "learningScala")
. config( "spark.executor.heartbeatInterval", "60s")
. config( "spark.network.timeout", "120s")
. config( "spark.serializer", "org.apache.spark.serializer.KryoSerializer")
. config( "spark.kryoserializer.buffer.max", "512m")
. config( "spark.dynamicAllocation.enabled", false)
. config( "spark.sql.inMemoryColumnarStorage.compressed", true)
. config( "spark.sql.inMemoryColumnarStorage.batchSize", 10000)
. config( "spark.sql.broadcastTimeout", 600)
. config( "spark.sql.autoBroadcastJoinThreshold", - 1)
. config( "spark.sql.crossJoin.enabled", true)
. master( "local[*]")
val spark = builder. getOrCreate()
spark. sparkContext. setLogLevel( "ERROR")
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
builder: org.apache.spark.sql.SparkSession.Builder = [email protected]
spark: org.apache.spark.sql.SparkSession = [email protected]
- 1.
- 2.
var df1 = Seq(
( 1, "2019-04-01 11:45:50", 11.15, "2019-04-02 11:45:49"),
( 2, "2019-05-02 11:56:50", 10.37, "2019-05-02 11:56:51"),
( 3, "2019-07-21 12:45:50", 12.11, "2019-08-21 12:45:50"),
( 4, "2019-08-01 12:40:50", 14.50, "2020-08-03 12:40:50"),
( 5, "2019-01-06 10:00:50", 16.39, "2019-01-05 10:00:50")
). toDF( "id", "startTimeStr", "payamount", "endTimeStr")
df1 = df1. withColumn( "startTime", $ "startTimeStr". cast( "Timestamp"))
. withColumn( "endTime", $ "endTimeStr". cast( "Timestamp"))
df1. printSchema
df1. show()
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
root
|-- id: integer (nullable = false)
|-- startTimeStr: string (nullable = true)
|-- payamount: double (nullable = false)
|-- endTimeStr: string (nullable = true)
|-- startTime: timestamp (nullable = true)
|-- endTime: timestamp (nullable = true)
+---+-------------------+---------+-------------------+-------------------+-------------------+
| id| startTimeStr|payamount| endTimeStr| startTime| endTime|
+---+-------------------+---------+-------------------+-------------------+-------------------+
| 1|2019-04-01 11:45:50| 11.15|2019-04-02 11:45:49|2019-04-01 11:45:50|2019-04-02 11:45:49|
| 2|2019-05-02 11:56:50| 10.37|2019-05-02 11:56:51|2019-05-02 11:56:50|2019-05-02 11:56:51|
| 3|2019-07-21 12:45:50| 12.11|2019-08-21 12:45:50|2019-07-21 12:45:50|2019-08-21 12:45:50|
| 4|2019-08-01 12:40:50| 14.5|2020-08-03 12:40:50|2019-08-01 12:40:50|2020-08-03 12:40:50|
| 5|2019-01-06 10:00:50| 16.39|2019-01-05 10:00:50|2019-01-06 10:00:50|2019-01-05 10:00:50|
+---+-------------------+---------+-------------------+-------------------+-------------------+
df1: org.apache.spark.sql.DataFrame = [id: int, startTimeStr: string ... 4 more fields]
df1: org.apache.spark.sql.DataFrame = [id: int, startTimeStr: string ... 4 more fields]
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
timestamp转string
把timestampConvert to the corresponding format string
- date_format把timestamp转换成对应的字符串
- String format is used"yyyyMMdd"表示
root
|-- yyyyMMdd: string (nullable = true)
|-- yyyy_MM_dd: string (nullable = true)
|-- yyyy: string (nullable = true)
+--------+----------+----+
|yyyyMMdd|yyyy_MM_dd|yyyy|
+--------+----------+----+
|20190401|2019-04-01|2019|
|20190502|2019-05-02|2019|
|20190721|2019-07-21|2019|
|20190801|2019-08-01|2019|
|20190106|2019-01-06|2019|
+--------+----------+----+
sql: String =
"
SELECT date_format(startTime,'yyyyMMdd') AS yyyyMMdd,
date_format(startTime,'yyyy-MM-dd') AS yyyy_MM_dd,
date_format(startTime,'yyyy') AS yyyy
FROM TEMP
"
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
timestamp转date
- to_date可以把timestamp转换成date类型
root
|-- startTime: timestamp (nullable = true)
|-- endTime: timestamp (nullable = true)
|-- startDate: date (nullable = true)
|-- endDate: date (nullable = true)
+-------------------+-------------------+----------+----------+
| startTime| endTime| startDate| endDate|
+-------------------+-------------------+----------+----------+
|2019-04-01 11:45:50|2019-04-02 11:45:49|2019-04-01|2019-04-02|
|2019-05-02 11:56:50|2019-05-02 11:56:51|2019-05-02|2019-05-02|
|2019-07-21 12:45:50|2019-08-21 12:45:50|2019-07-21|2019-08-21|
|2019-08-01 12:40:50|2020-08-03 12:40:50|2019-08-01|2020-08-03|
|2019-01-06 10:00:50|2019-01-05 10:00:50|2019-01-06|2019-01-05|
+-------------------+-------------------+----------+----------+
sql: String =
SELECT startTime,endTime,
to_date(startTime) AS startDate,
to_date(endTime) AS endDate
FROM TEMP
df2: org.apache.spark.sql.DataFrame = [startTime: timestamp, endTime: timestamp ... 2 more fields]
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
求时间差
- Day difference functiondatediff可以应用在timestamp中,Can also be applied in date类型中,The unit is natural days,而不是24小时
- month difference functionmonths_between同样可以,The monthly unit does not seem to be fixed,即31天or30天
df2. createOrReplaceTempView( "temp")
var sql = """
SELECT startTime,
endTime,
datediff(endTime,startTime) AS dayInterval1,
datediff(endDate,startDate) AS dayInterval2,
months_between(endTime,startTime) AS monthInterval1,
months_between(endDate,startDate) AS monthInterval2
FROM TEMP
"""
// spark.sql(sql).printSchema
spark. sql( sql). show()
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
+-------------------+-------------------+------------+------------+--------------+--------------+
| startTime| endTime|dayInterval1|dayInterval2|monthInterval1|monthInterval2|
+-------------------+-------------------+------------+------------+--------------+--------------+
|2019-04-01 11:45:50|2019-04-02 11:45:49| 1| 1| 0.03225769| 0.03225806|
|2019-05-02 11:56:50|2019-05-02 11:56:51| 0| 0| 0.0| 0.0|
|2019-07-21 12:45:50|2019-08-21 12:45:50| 31| 31| 1.0| 1.0|
|2019-08-01 12:40:50|2020-08-03 12:40:50| 368| 368| 12.06451613| 12.06451613|
|2019-01-06 10:00:50|2019-01-05 10:00:50| -1| -1| -0.03225806| -0.03225806|
+-------------------+-------------------+------------+------------+--------------+--------------+
sql: String =
"
SELECT startTime,
endTime,
datediff(endTime,startTime) AS dayInterval1,
datediff(endDate,startDate) AS dayInterval2,
months_between(endTime,startTime) AS monthInterval1,
months_between(endDate,startDate) AS monthInterval2
FROM TEMP
"
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
Ref
2020-03-24 于南京市江宁区九龙湖
边栏推荐
- 【简答题】月薪4k和月薪8k的区别就在这里
- 当项目中自动格式化插件Prettier和ESLint冲突报错时如何解决
- 2018年南海区小学生程序设计竞赛详细答案
- Go language Go language, understand Go language file operation in one article
- 数据库SqlServer迁移PostgreSql实践
- 【web自动化测试】Playwright快速入门,5分钟上手
- Speech Recognition Learning Resources
- C#爬虫之通过Selenium获取浏览器请求响应结果
- 阿里云技术专家秦隆:云上如何进行混沌工程?
- 使用.NET简单实现一个Redis的高性能克隆版(二)
猜你喜欢
面试官:MVCC是如何实现的?
机器学习——线性回归
LVS负载均衡群集之原理叙述
Investigation and Research Based on the Involution Behavior of College Students
EuROC 数据集格式及相关代码
开发那些事儿:如何通过EasyCVR平台获取监控现场的人流量统计数据?
Documentary on Security Reinforcement of Network Range Monitoring System (1)—SSL/TLS Encrypted Transmission of Log Data
DHCP&OSPF组合实验演示(Huawei路由交换设备配置)
谷歌开源芯片 180 纳米制造工艺
limux入门3—磁盘与分区管理
随机推荐
Short-term reliability and economic evaluation of resilient microgrids under incentive-based demand response programs (Matlab code implementation)
股票开户广发证券,网上开户安全吗?
Boosting之GBDT原理
基于 eBPF 的 Kubernetes 可观测实践
【STM32】STM32单片机总目录
PHP代码审计10—命令执行漏洞
(ECCV-2022)GaitEdge:超越普通的端到端步态识别,提高实用性
curl命令的那些事
Google AppSheet: 无需编程构建零代码应用
使用bash语句,清空aaa文件夹下的所有文件
动态数组底层是如何实现的
Speech Recognition Learning Resources
2018读书记
关于使用腾讯云HiFlow场景连接器每天提醒签到打卡
防火墙基础之防火墙做出口设备安全防护
解决错误:The package-lock.json file was created with an old version of npm
如何模拟后台API调用场景,很细!
leetcode 13. 罗马数字转整数
Interval greedy (interval merge)
PHP代码审计7—文件上传漏洞