当前位置:网站首页>Understand the execution principle of show create table

Understand the execution principle of show create table

2022-06-09 22:42:00 Data warehouse practitioner

This article mainly introduces show create table The source flow of command execution , Make sure. sparksql How and hive Metabase interaction , Query the corresponding table metadata, Then the final result is spliced and displayed to the user .

Today's article is also from 【 Source code co reading group 】 A discussion of , Chat first :

We usually pay close attention to select Such a query statement , But little attention is paid to show create table The execution process of such a statement , It is really difficult to find blogs that write relevant content on the Internet . Just borrow this question , Dig deep into the operating principle , therefore , flowers 2 Hours , Roll over the source code , The basic conclusion is obtained :

ha-ha , Thank you for your approval , All the friends want to record a video , Let me write an article first , then , Record another short screen .

Excavate below , The source code is boring , But it is also a window through which we can see the truth ~~

This article is based on spark 3.2

Outline of this article

1、 Writing can simulate from hive Local test class for table lookup

1、 Writing can simulate from hive Local test class for table lookup

We are reading sparksql Source code , For convenience , Basically, they use df.createOrReplaceTempView("XXX") Form like this , To generate some data , These are enough for us to study 90% The above rules , But these cannot be simulated hive The situation of , If we set up a remote connection hive Environment , It takes a lot of energy .

not so bad , stay sparksql In the source code project , We can do it through inheritance TestHiveSingleton, No need to build hive Under the circumstances , To simulate the hive.

This is in 【 Source code read together 】 We will talk about ~~

The test class code is as follows :

2、hive Correspondence between entity classes and metabase tables and fields in

MTable( class )--> TBLS( surface )

MDatabase( class )-->DBS( surface )

MStorageDescriptor( class )-->SDS( surface )

MFieldSchema( class )-->TYPE_FIELDS( surface )

partitionKeys(MTable Class filed) -->PARTITION_KEYS( surface )

parameters (MTable Class filed--> TABLE_PARAMS( surface )

The following configuration contains the corresponding relationships between fields in the class and table fields :

<class name="MTable" table="TBLS" identity-type="datastore" detachable="true">
  <datastore-identity>
    <column name="TBL_ID"/>
  </datastore-identity>
  <index name="UniqueTable" unique="true">
    <column name="TBL_NAME"/>
    <column name="DB_ID"/>
  </index>
  <field name="tableName">
    <column name="TBL_NAME" length="256" jdbc-type="VARCHAR"/>
  </field>
  <field name="database">
    <column name="DB_ID"/>
  </field>
  <field name="partitionKeys" table="PARTITION_KEYS" >
    <collection element-type="MFieldSchema"/>
    <join>
      <primary-key name="PARTITION_KEY_PK">
        <column name="TBL_ID"/>
        <column name="PKEY_NAME"/>
      </primary-key>
      <column name="TBL_ID"/>
    </join>
    <element>
      <embedded>
        <field name="name">
          <column name="PKEY_NAME" length="128" jdbc-type="VARCHAR"/>
          </field>
        <field name="type">
          <column name="PKEY_TYPE" length="767" jdbc-type="VARCHAR" allows-null="false"/>
        </field>
        <field name="comment" >
          <column name="PKEY_COMMENT" length="4000" jdbc-type="VARCHAR" allows-null="true"/>
        </field>
      </embedded>
    </element>
  </field>
  <field name="sd" dependent="true">
    <column name="SD_ID"/>
  </field>
  <field name="owner">
    <column name="OWNER" length="767" jdbc-type="VARCHAR"/>
  </field>
  <field name="createTime">
    <column name="CREATE_TIME" jdbc-type="integer"/>
  </field>
  <field name="lastAccessTime">
    <column name="LAST_ACCESS_TIME" jdbc-type="integer"/>
  </field>
  <field name="retention">
    <column name="RETENTION" jdbc-type="integer"/>
  </field>
  <field name="parameters" table="TABLE_PARAMS">
    <map key-type="java.lang.String" value-type="java.lang.String"/>
    <join>
      <column name="TBL_ID"/>
    </join>
    <key>
       <column name="PARAM_KEY" length="256" jdbc-type="VARCHAR"/>
    </key>
    <value>
       <column name="PARAM_VALUE" length="32672" jdbc-type="VARCHAR"/>
    </value>
  </field>
  <field name="viewOriginalText" default-fetch-group="false">
    <column name="VIEW_ORIGINAL_TEXT" jdbc-type="LONGVARCHAR"/>
  </field>
  <field name="viewExpandedText" default-fetch-group="false">
    <column name="VIEW_EXPANDED_TEXT" jdbc-type="LONGVARCHAR"/>
  </field>
  <field name="rewriteEnabled">
    <column name="IS_REWRITE_ENABLED"/>
  </field>
  <field name="tableType">
    <column name="TBL_TYPE" length="128" jdbc-type="VARCHAR"/>
  </field>
</class>

3、 Source code analysis execution process

adopt println, Output show create table orders Physical execution plan for , You can see , Here's what it actually does ShowCreateTableCommand This class .

Code flow :

Two core approaches :

check hive Meta database (ObjectStore.getMTable)

  • mtbl = (MTable) query.execute(table, db) Corresponding sql:

Get some basic information about the table (tbl_id, tbl_type etc. )

SELECT
   DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,
  A0.CREATE_TIME,
  A0.LAST_ACCESS_TIME,
  A0.OWNER,
  A0.RETENTION,
  A0.IS_REWRITE_ENABLED,
  A0.TBL_NAME,
  A0.TBL_TYPE,
  A0.TBL_ID 
FROM
  TBLS A0
  LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
WHERE
  A0.TBL_NAME = ?
  AND B0."NAME" = ?

debug Medium sql:

sql Correspondence between fields and entity classes :

debug The process is as follows :

You can see that after executing this method , Some basic fields are filled in

  • pm.retrieve(mtbl) Corresponding sql:

get database(MDatabase),sd(MStorageDescriptor),parameters,partitionKeys

SELECT
  B0."DESC",
  B0.DB_LOCATION_URI,
  B0."NAME",
  B0.OWNER_NAME,
  B0.OWNER_TYPE,
  B0.DB_ID,
  C0.INPUT_FORMAT,
  C0.IS_COMPRESSED,
  C0.IS_STOREDASSUBDIRECTORIES,
  C0.LOCATION,
  C0.NUM_BUCKETS,
  C0.OUTPUT_FORMAT,
  C0.SD_ID,
  A0.VIEW_EXPANDED_TEXT,
  A0.VIEW_ORIGINAL_TEXT
FROM
  TBLS A0
  LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
  LEFT OUTER JOIN SDS C0 ON A0.SD_ID = C0.SD_ID
WHERE
  A0.TBL_ID = ?

debug Medium sql:

sql Correspondence between fields and entity classes :

debug The process is as follows :

Real calculation parameters,partitionKeys when , There will be another callback , To get the schema:

Based on hive Metadata information , Generate the final presentation (ShowCreateTableCommand.run )


private def showCreateDataSourceTable(metadata: CatalogTable, builder: StringBuilder): Unit = {
  //colums
  showDataSourceTableDataColumns(metadata, builder)
  //table Parameters of : Storage format, etc 
  showDataSourceTableOptions(metadata, builder)
  showDataSourceTableNonDataColumns(metadata, builder)
  //table Notes 
  showTableComment(metadata, builder)
  //location
  showTableLocation(metadata, builder)
  // such as :TBLPROPERTIES
  showTableProperties(metadata, builder)
}

The result of the final splicing :

CREATE TABLE `default`.`orders` (
  \ n `id` INT,
  \ n `make` STRING,
  \ n `type` STRING,
  \ n `price` INT,
  \ n `pdate` STRING,
  \ n `customer` STRING,
  \ n `city` STRING,
  \ n `state` STRING,
  \ n `month` INT
) \ nUSING parquet \ nPARTITIONED BY (state, month) \ nTBLPROPERTIES (\ n 'transient_lastDdlTime' = '1651553453') \ n
原网站

版权声明
本文为[Data warehouse practitioner]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206092154126971.html