当前位置:网站首页>Learn parquet file format
Learn parquet file format
2022-07-27 15:37:00 【wankunde】
List of articles
Learning goals
- parquet File as a storage structure for listing
- parquet The main process of reading and writing files and the calling interface
- spark Yes parquet Optimization of file reading and writing
- spark How to read vectorized data
Parquet File storage structure

For example, a practical parquet file meta Information
parquet-tools meta --debug part-00000-95a6898f-c2aa-4e89-86a6-4f17a2a8fe26.c000.snappy.parquet
creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
extra: org.apache.spark.version = 3.0.0
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"biz_id","type":"string","nullable":true,"metadata":{"comment":" marketing / Unified Q & a business trace id"}},{"name":"scene_id","type":"integer","nullable":true,"metadata":{"comment":" scene "}},{"name":"store_id","type":"string","n [more]...
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
biz_id: OPTIONAL BINARY O:UTF8 R:0 D:1
scene_id: OPTIONAL INT32 R:0 D:1
store_id: OPTIONAL BINARY O:UTF8 R:0 D:1
store_name: OPTIONAL BINARY O:UTF8 R:0 D:1
buyer_nick: OPTIONAL BINARY O:UTF8 R:0 D:1
trigger_time_in_ms: OPTIONAL INT64 R:0 D:1
dispatch_time_in_ms: OPTIONAL INT64 R:0 D:1
arrived_time_in_ms: OPTIONAL INT64 R:0 D:1
assistant_nick: OPTIONAL BINARY O:UTF8 R:0 D:1
trade_id: OPTIONAL BINARY O:UTF8 R:0 D:1
paid_time_in_ms: OPTIONAL INT64 R:0 D:1
order_fee: OPTIONAL INT64 O:DECIMAL R:0 D:1
order_number: OPTIONAL INT32 R:0 D:1
indirect_order_fee: OPTIONAL INT64 O:DECIMAL R:0 D:1
indirect_order_number: OPTIONAL INT32 R:0 D:1
is_arrived: REQUIRED BOOLEAN R:0 D:0
result: REQUIRED BINARY O:UTF8 R:0 D:0
sentence: OPTIONAL F:1
.list: REPEATED F:1
..element: REQUIRED BINARY O:UTF8 R:1 D:2
task: REQUIRED F:5
.task_id: OPTIONAL INT64 R:0 D:1
.round_id: OPTIONAL INT64 R:0 D:1
.round: OPTIONAL INT32 R:0 D:1
.entry_name: OPTIONAL BINARY O:UTF8 R:0 D:1
.strategy: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:350242 TS:114447905 // There is only one test file RowGroup, Many words , Will cycle
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
biz_id: BINARY SNAPPY DO:0 FPO:4 SZ:12781311/14711901/1.15 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
scene_id: INT32 SNAPPY DO:0 FPO:12781315 SZ:141/135/0.96 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
store_id: BINARY SNAPPY DO:0 FPO:12781456 SZ:625362/651136/1.04 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
store_name: BINARY SNAPPY DO:0 FPO:13406818 SZ:661669/720686/1.09 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
buyer_nick: BINARY SNAPPY DO:0 FPO:14068487 SZ:4926923/6019612/1.22 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
trigger_time_in_ms: INT64 SNAPPY DO:0 FPO:18995410 SZ:1141626/1301269/1.14 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
dispatch_time_in_ms: INT64 SNAPPY DO:0 FPO:20137036 SZ:2021695/2802163/1.39 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
arrived_time_in_ms: INT64 SNAPPY DO:0 FPO:22158731 SZ:1866923/2578762/1.38 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
assistant_nick: BINARY SNAPPY DO:0 FPO:24025654 SZ:944211/1151429/1.22 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
trade_id: BINARY SNAPPY DO:0 FPO:24969865 SZ:5252882/8012876/1.53 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
paid_time_in_ms: INT64 SNAPPY DO:0 FPO:30222747 SZ:469431/566161/1.21 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
order_fee: INT64 SNAPPY DO:0 FPO:30692178 SZ:194907/233987/1.20 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
order_number: INT32 SNAPPY DO:0 FPO:30887085 SZ:70615/87743/1.24 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
indirect_order_fee: INT64 SNAPPY DO:0 FPO:30957700 SZ:241548/282655/1.17 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
indirect_order_number: INT32 SNAPPY DO:0 FPO:31199248 SZ:79962/98381/1.23 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
is_arrived: BOOLEAN SNAPPY DO:0 FPO:31279210 SZ:5913/43819/7.41 VC:350242 ENC:BIT_PACKED,PLAIN
result: BINARY SNAPPY DO:0 FPO:31285123 SZ:86680/88435/1.02 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY
sentence:
.list:
..element: BINARY SNAPPY DO:0 FPO:31371803 SZ:32418858/73409092/2.26 VC:377833 ENC:RLE,PLAIN
task:
.task_id: INT64 SNAPPY DO:0 FPO:63790661 SZ:666110/714412/1.07 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.round_id: INT64 SNAPPY DO:0 FPO:64456771 SZ:664298/711780/1.07 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.round: INT32 SNAPPY DO:0 FPO:65121069 SZ:129151/131726/1.02 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.entry_name: BINARY SNAPPY DO:0 FPO:65250220 SZ:43694/64940/1.49 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.strategy: BINARY SNAPPY DO:0 FPO:65293914 SZ:43637/64805/1.49 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
Description of the reduced field :
- RC = Record Count, TS = Total Byte Size
- DO = DictionaryPageOffset
- FPO = FirstDataPageOffset
- SZ : The first value of the field = TotalSize, Second value = TotalUncompressedSize, Third values ratio = TotalUncompressedSize / TotalSize
- VC = ValueCount
- ENC = Encodings
边栏推荐
- 【剑指offer】面试题42:连续子数组的最大和——附0x80000000与INT_MIN
- 使用双星号代替Math.pow()
- Database: use the where statement to retrieve (header song)
- Tools - common methods of markdown editor
- [系统编程] 进程,线程问题总结
- 【剑指offer】面试题45:把数组排成最小的数
- 一文读懂鼠标滚轮事件(wheelEvent)
- Huawei's general card identification function enables multiple card bindings with one key
- STL value string learning
- JUC(JMM、Volatile)
猜你喜欢

Spark Bucket Table Join

Fluent -- layout principle and constraints

With just two modifications, apple gave styleganv2 3D generation capabilities

初探JuiceFS
Comparison of advantages and disadvantages between instrument amplifier and operational amplifier

Spark Filter算子在Parquet文件上的下推

Alibaba's latest summary 2022 big factory interview real questions + comprehensive coverage of core knowledge points + detailed answers

学习Parquet文件格式

C语言:数据的存储

MySQL interview 40 consecutive questions, interviewer, if you continue to ask, I will turn my face
随机推荐
Network equipment hard core technology insider router 20 dpdk (V)
js运用扩展操作符(…)简化代码,简化数组合并
Multi table query_ Sub query overview and multi table query_ Sub query situation 1 & situation 2 & situation 3
Summer Challenge harmonyos realizes a hand-painted board
Photoelectric isolation circuit design scheme (six photoelectric isolation circuit diagrams based on optocoupler and ad210an)
Leetcode 1143. dynamic programming of the longest common subsequence /medium
Deveco studio2.1 operation item error
学习Parquet文件格式
Spark 3.0 DPP实现逻辑
Two stage submission and three stage submission
Network device hard core technology insider router Chapter 15 from deer by device to router (Part 2)
Network equipment hard core technology insider router Chapter 16 dpdk and its prequel (I)
使用解构交换两个变量的值
Network equipment hard core technology insider router Chapter 10 Cisco asr9900 disassembly (III)
Network equipment hard core technology insider router Chapter 11 Cisco asr9900 disassembly (V)
Network equipment hard core technology insider router 19 dpdk (IV)
The design method of integral operation circuit is introduced in detail
shell脚本读取文本中的redis命令批量插入redis
Watermelon book machine learning reading notes Chapter 1 Introduction
MySQL interview 40 consecutive questions, interviewer, if you continue to ask, I will turn my face