当前位置：网站首页>Understand the execution principle of show create table

Understand the execution principle of show create table

2022-06-09 22:42:00 【Data warehouse practitioner】

This article mainly introduces show create table The source flow of command execution , Make sure. sparksql How and hive Metabase interaction , Query the corresponding table metadata, Then the final result is spliced and displayed to the user .

Today's article is also from 【 Source code co reading group 】 A discussion of , Chat first ：

We usually pay close attention to select Such a query statement , But little attention is paid to show create table The execution process of such a statement , It is really difficult to find blogs that write relevant content on the Internet . Just borrow this question , Dig deep into the operating principle , therefore , flowers 2 Hours , Roll over the source code , The basic conclusion is obtained ：

ha-ha , Thank you for your approval , All the friends want to record a video , Let me write an article first , then , Record another short screen .

Excavate below , The source code is boring , But it is also a window through which we can see the truth ~~

This article is based on spark 3.2

Outline of this article

1、 Writing can simulate from hive Local test class for table lookup

We are reading sparksql Source code , For convenience , Basically, they use df.createOrReplaceTempView("XXX") Form like this , To generate some data , These are enough for us to study 90% The above rules , But these cannot be simulated hive The situation of , If we set up a remote connection hive Environment , It takes a lot of energy .

not so bad , stay sparksql In the source code project , We can do it through inheritance TestHiveSingleton, No need to build hive Under the circumstances , To simulate the hive.

This is in 【 Source code read together 】 We will talk about ~~

The test class code is as follows ：

2、hive Correspondence between entity classes and metabase tables and fields in

MTable（ class ）--> TBLS（ surface ）

MDatabase（ class ）-->DBS（ surface ）

MStorageDescriptor（ class ）-->SDS（ surface ）

MFieldSchema（ class ）-->TYPE_FIELDS（ surface ）

partitionKeys（MTable Class filed） -->PARTITION_KEYS（ surface ）

parameters （MTable Class filed--> TABLE_PARAMS（ surface ）

The following configuration contains the corresponding relationships between fields in the class and table fields ：

<class name="MTable" table="TBLS" identity-type="datastore" detachable="true">
  <datastore-identity>
    <column name="TBL_ID"/>
  </datastore-identity>
  <index name="UniqueTable" unique="true">
    <column name="TBL_NAME"/>
    <column name="DB_ID"/>
  </index>
  <field name="tableName">
    <column name="TBL_NAME" length="256" jdbc-type="VARCHAR"/>
  </field>
  <field name="database">
    <column name="DB_ID"/>
  </field>
  <field name="partitionKeys" table="PARTITION_KEYS" >
    <collection element-type="MFieldSchema"/>
    <join>
      <primary-key name="PARTITION_KEY_PK">
        <column name="TBL_ID"/>
        <column name="PKEY_NAME"/>
      </primary-key>
      <column name="TBL_ID"/>
    </join>
    <element>
      <embedded>
        <field name="name">
          <column name="PKEY_NAME" length="128" jdbc-type="VARCHAR"/>
          </field>
        <field name="type">
          <column name="PKEY_TYPE" length="767" jdbc-type="VARCHAR" allows-null="false"/>
        </field>
        <field name="comment" >
          <column name="PKEY_COMMENT" length="4000" jdbc-type="VARCHAR" allows-null="true"/>
        </field>
      </embedded>
    </element>
  </field>
  <field name="sd" dependent="true">
    <column name="SD_ID"/>
  </field>
  <field name="owner">
    <column name="OWNER" length="767" jdbc-type="VARCHAR"/>
  </field>
  <field name="createTime">
    <column name="CREATE_TIME" jdbc-type="integer"/>
  </field>
  <field name="lastAccessTime">
    <column name="LAST_ACCESS_TIME" jdbc-type="integer"/>
  </field>
  <field name="retention">
    <column name="RETENTION" jdbc-type="integer"/>
  </field>
  <field name="parameters" table="TABLE_PARAMS">
    <map key-type="java.lang.String" value-type="java.lang.String"/>
    <join>
      <column name="TBL_ID"/>
    </join>
    <key>
       <column name="PARAM_KEY" length="256" jdbc-type="VARCHAR"/>
    </key>
    <value>
       <column name="PARAM_VALUE" length="32672" jdbc-type="VARCHAR"/>
    </value>
  </field>
  <field name="viewOriginalText" default-fetch-group="false">
    <column name="VIEW_ORIGINAL_TEXT" jdbc-type="LONGVARCHAR"/>
  </field>
  <field name="viewExpandedText" default-fetch-group="false">
    <column name="VIEW_EXPANDED_TEXT" jdbc-type="LONGVARCHAR"/>
  </field>
  <field name="rewriteEnabled">
    <column name="IS_REWRITE_ENABLED"/>
  </field>
  <field name="tableType">
    <column name="TBL_TYPE" length="128" jdbc-type="VARCHAR"/>
  </field>
</class>

3、 Source code analysis execution process

adopt println, Output show create table orders Physical execution plan for , You can see , Here's what it actually does ShowCreateTableCommand This class .

Code flow ：

Two core approaches ：

check hive Meta database （ObjectStore.getMTable）

mtbl = (MTable) query.execute(table, db) Corresponding sql:

Get some basic information about the table （tbl_id, tbl_type etc. ）

SELECT
   DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,
  A0.CREATE_TIME,
  A0.LAST_ACCESS_TIME,
  A0.OWNER,
  A0.RETENTION,
  A0.IS_REWRITE_ENABLED,
  A0.TBL_NAME,
  A0.TBL_TYPE,
  A0.TBL_ID 
FROM
  TBLS A0
  LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
WHERE
  A0.TBL_NAME = ?
  AND B0."NAME" = ?

debug Medium sql：

sql Correspondence between fields and entity classes ：

debug The process is as follows ：

You can see that after executing this method , Some basic fields are filled in

pm.retrieve(mtbl) Corresponding sql:

get database(MDatabase),sd(MStorageDescriptor),parameters,partitionKeys

SELECT
  B0."DESC",
  B0.DB_LOCATION_URI,
  B0."NAME",
  B0.OWNER_NAME,
  B0.OWNER_TYPE,
  B0.DB_ID,
  C0.INPUT_FORMAT,
  C0.IS_COMPRESSED,
  C0.IS_STOREDASSUBDIRECTORIES,
  C0.LOCATION,
  C0.NUM_BUCKETS,
  C0.OUTPUT_FORMAT,
  C0.SD_ID,
  A0.VIEW_EXPANDED_TEXT,
  A0.VIEW_ORIGINAL_TEXT
FROM
  TBLS A0
  LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
  LEFT OUTER JOIN SDS C0 ON A0.SD_ID = C0.SD_ID
WHERE
  A0.TBL_ID = ?

debug Medium sql：

sql Correspondence between fields and entity classes ：

debug The process is as follows ：

Real calculation parameters,partitionKeys when , There will be another callback , To get the schema：

Based on hive Metadata information , Generate the final presentation (ShowCreateTableCommand.run )


private def showCreateDataSourceTable(metadata: CatalogTable, builder: StringBuilder): Unit = {
  //colums
  showDataSourceTableDataColumns(metadata, builder)
  //table Parameters of ： Storage format, etc 
  showDataSourceTableOptions(metadata, builder)
  showDataSourceTableNonDataColumns(metadata, builder)
  //table Notes 
  showTableComment(metadata, builder)
  //location
  showTableLocation(metadata, builder)
  // such as ：TBLPROPERTIES
  showTableProperties(metadata, builder)
}

The result of the final splicing ：

CREATE TABLE `default`.`orders` (
  \ n `id` INT,
  \ n `make` STRING,
  \ n `type` STRING,
  \ n `price` INT,
  \ n `pdate` STRING,
  \ n `customer` STRING,
  \ n `city` STRING,
  \ n `state` STRING,
  \ n `month` INT
) \ nUSING parquet \ nPARTITIONED BY (state, month) \ nTBLPROPERTIES (\ n 'transient_lastDdlTime' = '1651553453') \ n

原网站

版权声明
本文为[Data warehouse practitioner]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/160/202206092154126971.html