当前位置:网站首页>学习Parquet文件格式
学习Parquet文件格式
2022-07-27 14:23:00 【wankunde】
学习目的
- parquet文件作为列存的存储结构
- parquet文件的读写主要流程和调用接口
- spark对parquet文件读写的优化
- spark是如何实现向量化数据读取的
Parquet文件存储结构

例如一个实际的parquet文件meta信息
parquet-tools meta --debug part-00000-95a6898f-c2aa-4e89-86a6-4f17a2a8fe26.c000.snappy.parquet
creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
extra: org.apache.spark.version = 3.0.0
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"biz_id","type":"string","nullable":true,"metadata":{"comment":"营销/问答业务统一 trace id"}},{"name":"scene_id","type":"integer","nullable":true,"metadata":{"comment":"场景"}},{"name":"store_id","type":"string","n [more]...
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
biz_id: OPTIONAL BINARY O:UTF8 R:0 D:1
scene_id: OPTIONAL INT32 R:0 D:1
store_id: OPTIONAL BINARY O:UTF8 R:0 D:1
store_name: OPTIONAL BINARY O:UTF8 R:0 D:1
buyer_nick: OPTIONAL BINARY O:UTF8 R:0 D:1
trigger_time_in_ms: OPTIONAL INT64 R:0 D:1
dispatch_time_in_ms: OPTIONAL INT64 R:0 D:1
arrived_time_in_ms: OPTIONAL INT64 R:0 D:1
assistant_nick: OPTIONAL BINARY O:UTF8 R:0 D:1
trade_id: OPTIONAL BINARY O:UTF8 R:0 D:1
paid_time_in_ms: OPTIONAL INT64 R:0 D:1
order_fee: OPTIONAL INT64 O:DECIMAL R:0 D:1
order_number: OPTIONAL INT32 R:0 D:1
indirect_order_fee: OPTIONAL INT64 O:DECIMAL R:0 D:1
indirect_order_number: OPTIONAL INT32 R:0 D:1
is_arrived: REQUIRED BOOLEAN R:0 D:0
result: REQUIRED BINARY O:UTF8 R:0 D:0
sentence: OPTIONAL F:1
.list: REPEATED F:1
..element: REQUIRED BINARY O:UTF8 R:1 D:2
task: REQUIRED F:5
.task_id: OPTIONAL INT64 R:0 D:1
.round_id: OPTIONAL INT64 R:0 D:1
.round: OPTIONAL INT32 R:0 D:1
.entry_name: OPTIONAL BINARY O:UTF8 R:0 D:1
.strategy: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:350242 TS:114447905 // 测试文件只有一个RowGroup,多个的话,会循环显示
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
biz_id: BINARY SNAPPY DO:0 FPO:4 SZ:12781311/14711901/1.15 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
scene_id: INT32 SNAPPY DO:0 FPO:12781315 SZ:141/135/0.96 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
store_id: BINARY SNAPPY DO:0 FPO:12781456 SZ:625362/651136/1.04 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
store_name: BINARY SNAPPY DO:0 FPO:13406818 SZ:661669/720686/1.09 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
buyer_nick: BINARY SNAPPY DO:0 FPO:14068487 SZ:4926923/6019612/1.22 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
trigger_time_in_ms: INT64 SNAPPY DO:0 FPO:18995410 SZ:1141626/1301269/1.14 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
dispatch_time_in_ms: INT64 SNAPPY DO:0 FPO:20137036 SZ:2021695/2802163/1.39 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
arrived_time_in_ms: INT64 SNAPPY DO:0 FPO:22158731 SZ:1866923/2578762/1.38 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
assistant_nick: BINARY SNAPPY DO:0 FPO:24025654 SZ:944211/1151429/1.22 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
trade_id: BINARY SNAPPY DO:0 FPO:24969865 SZ:5252882/8012876/1.53 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
paid_time_in_ms: INT64 SNAPPY DO:0 FPO:30222747 SZ:469431/566161/1.21 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
order_fee: INT64 SNAPPY DO:0 FPO:30692178 SZ:194907/233987/1.20 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
order_number: INT32 SNAPPY DO:0 FPO:30887085 SZ:70615/87743/1.24 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
indirect_order_fee: INT64 SNAPPY DO:0 FPO:30957700 SZ:241548/282655/1.17 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
indirect_order_number: INT32 SNAPPY DO:0 FPO:31199248 SZ:79962/98381/1.23 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
is_arrived: BOOLEAN SNAPPY DO:0 FPO:31279210 SZ:5913/43819/7.41 VC:350242 ENC:BIT_PACKED,PLAIN
result: BINARY SNAPPY DO:0 FPO:31285123 SZ:86680/88435/1.02 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY
sentence:
.list:
..element: BINARY SNAPPY DO:0 FPO:31371803 SZ:32418858/73409092/2.26 VC:377833 ENC:RLE,PLAIN
task:
.task_id: INT64 SNAPPY DO:0 FPO:63790661 SZ:666110/714412/1.07 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.round_id: INT64 SNAPPY DO:0 FPO:64456771 SZ:664298/711780/1.07 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.round: INT32 SNAPPY DO:0 FPO:65121069 SZ:129151/131726/1.02 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.entry_name: BINARY SNAPPY DO:0 FPO:65250220 SZ:43694/64940/1.49 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.strategy: BINARY SNAPPY DO:0 FPO:65293914 SZ:43637/64805/1.49 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
被缩减的字段说明:
- RC = Record Count, TS = Total Byte Size
- DO = DictionaryPageOffset
- FPO = FirstDataPageOffset
- SZ : 字段第一个值 = TotalSize, 第二个值 = TotalUncompressedSize, 第三个值 ratio = TotalUncompressedSize / TotalSize
- VC = ValueCount
- ENC = Encodings
边栏推荐
- STM32F10x_硬件I2C读写EEPROM(标准外设库版本)
- shell脚本读取文本中的redis命令批量插入redis
- Network equipment hard core technology insider router Chapter 21 reconfigurable router
- 修改frameworks资源文件如何单编
- IJCAI 2022杰出论文公布,大陆作者中稿298篇拿下两项第一
- Digital storage oscilloscope based on FIFO idt7202-12
- Watermelon book machine learning reading notes Chapter 1 Introduction
- lua学习笔记
- Deveco studio2.1 operation item error
- Introduction of the connecting circuit between ad7606 and stm32
猜你喜欢

Zhou Hongyi: if the digital security ability is backward, it will also be beaten

With just two modifications, apple gave styleganv2 3D generation capabilities

IJCAI 2022 outstanding papers were published, and 298 Chinese mainland authors won the first place in two items

Introduction of the connecting circuit between ad7606 and stm32

Photoelectric isolation circuit design scheme (six photoelectric isolation circuit diagrams based on optocoupler and ad210an)

AssetBundle如何打包

TL431-2.5v基准电压芯片几种基本用法
Comparison of advantages and disadvantages between instrument amplifier and operational amplifier

Unity性能优化------渲染优化(GPU)之Occlusion culling(遮挡剔除)

Watermelon book machine learning reading notes Chapter 1 Introduction
随机推荐
适配验证新职业来了!华云数据参与国家《信息系统适配验证师国家职业技能标准》编制
RS485接口的EMC设计方案
Dialog manager Chapter 3: create controls
The first common node of the two linked lists of "Jianzhi offer"
EMC design scheme of RS485 interface
LeetCode 456. 132模式 单调栈/medium
STM32 can -- can ID filter analysis
Leetcode 1143. dynamic programming of the longest common subsequence /medium
IJCAI 2022杰出论文公布,大陆作者中稿298篇拿下两项第一
Inside router of network equipment hard core technology (10) disassembly of Cisco asr9900 (4)
3D相关的简单数学知识
After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues
Leetcode-1737- minimum number of characters to change if one of the three conditions is met
Singles cup, web:web check in
Introduction of the connecting circuit between ad7606 and stm32
Four kinds of relay schemes driven by single chip microcomputer
泛型
Wechat applet realizes music search page
TL431-2.5v基准电压芯片几种基本用法
What is the breakthrough point of digital transformation in the electronic manufacturing industry? Lean manufacturing is the key