当前位置：网站首页>OLAP analysis engine kylin4.0

OLAP analysis engine kylin4.0

2022-06-25 04:45:00 【Youpei】

List of articles

kylin4.0

Portal ：https://www.bilibili.com/video/BV1Wt4y1s7rZ?share_source=copy_web

kylin4.0

The underlying build engine passes spark Build data source connection , Such as hadoop、hive、csv
Will be constructed cube Deposit in parquet Inside , Coexist to hdfs On . there cube Can be understood as a special structure , Composed of latitude , It is equal to building a three-dimensional model in which all dimensions are arranged and combined in advance
kylin Metadata in is similar to hive, Save in mysql in
The routing layer is equivalent to retrieving , If there is already cube If the conditions are met, go and check parquet, If the conditions are not met, it will be converted to sparksql To query the docked data source
The query engine is used to parse the data sent from the user's client sql sentence
rest server The client can access the interface , You can also let the client send sql
Because it is sprak Program , So it can be yarn/k8s Resource management and typesetting

therefore , In fact ,kylin does , It is the indicator that you specify the relevant measurement after passing in the fact latitude , It automatically builds 2^n -1 Latitude system

quick start

docker structure

#  Pull the mirror image 
docker pull apachekylin/apache-kylin-standalone:4.0.0

#  Start the container 
docker run -d \
-m 8G \
-p 7070:7070 \
-p 8088:8088 \
-p 50070:50070 \
-p 8032:8032 \
-p 8042:8042 \
-p 2181:2181 \
apachekylin/apache-kylin-standalone:4.0.0

After launching the browser http://ip:7070/kylin Direct access to web End

Currently, data source docking is supported hive surface 、csv file , Read hive Will synchronize when hive Table information

Directly import the created test table

Create a model process ： Creating models — Select relevant data — Select latitude — Choose the facts — Zoning + Filter settings

establish cube technological process ： establish cube( You need to select a model )— Latitude selection ( Select... From the model latitude )— Measure selection ( Select... From the fact field in the model )— Refresh the set ( Set up parquet Time of file merging ： Default 7 Days are small and ,28 The sky is big and )— Advanced settings ( Set up aggregation group 、rowkey、Cuboids)— Set up kylin Related parameters ( The default in accordance with the kylin*.properties)— Final confirmation

structure cube

Building , meanwhile yarn There will also be application Running

Build complete , Can write sql Inquire about , Year on year query comparison hive It's very fast

Query considerations

When you create a model, you specify join Mode of connection , If the query is not enabled, press , Write other join The way sql Will report a mistake
Fields are modeled 、cube It's time to specify , If the query is not enabled, press , Selecting other fields will also result in an error
The aggregate query method is also specified , If the query is not enabled, press , Selecting other field aggregations will also result in an error
The facts must come first , Dimension table after

restful api call

curl Execute a certain item sql

curl -X POST -H "Authorization: Basic QURNSU46S1lMSU4=" -H 'Content-Type: application/json' -d '{"sql":"select dname,sum(sal) from emp e join dept d on e.deptno = d.deptno group by dname","project":"FirstProject"}' http://172.16.6.14:7070/kylin/api/query

Scheduled scripts , You can use azkaban/oozie Such a scheduling tool to execute

#!/bin/bash
# From  1  Parameters get  cube_name
cube_name=$1
# From  2  Get build parameters  cube  Time 
if [ -n "$2" ]
then
 do_date=$2
else
 do_date=`date -d '-1 day' +%F`
fi
# Get the execution time  00:00:00  Time stamp (0  The time zone )
start_date_unix=`date -d "$do_date 08:00:00" +%s`
# Second timestamps change to millisecond timestamps 
start_date=$(($start_date_unix*1000))
# Get the execution time  24:00  The timestamp 
stop_date=$(($start_date+86400000))
curl -X PUT -H "Authorization: Basic QURNSU46S1lMSU4=" -H 'ContentType: application/json' -d '{"startTime":'$start_date', "endTime":'$stop_date', "buildType":"BUILD"}' 
http://172.16.6.14:7070/kylin/api/cubes/$cube_name/build
#  notes ： No modification  kylin  The time zone , therefore  kylin  The internal only recognizes  0  Time zone time ,0  Time zone  0  Point is East  8  Morning in the district  8  spot , So we have to write in the script $do_date 08:00:00  To make up for the time difference .

Open query and press

Open query and press , No build cube The use of sparksql go to hive Go to find out
- stay conf Under the table of contents ,kylin.properties
```
query.pushdown.runner-class-name=org.apache.kylin.query.pushdown.PushDownRunnerSparkImpl
```

Query engine

Sparder, It feels like spark-shell, Application resources have been occupied ,kylin By default, it will be enabled in the first query statement , So the first one is usually slow , It can also be opened directly by default , Change the parameter to true that will do

kylin.query.auto-sparder-context-enabled-enabled=true

hdfs Catalog
- Temporary file storage directory ：/project_name/job_tmp Cuboid
- File storage directory ： /project_name /parquet/cube_name/segment_name_XXX
- Dimension table snapshot storage directory ：/project_name /table_snapshot
- Spark Run log directory ：/project_name/spark_logs

kylin Related query parameters

####spark  Operation mode ####
#kylin.query.spark-conf.spark.master=yarn

####spark driver  The core number ####
#kylin.query.spark-conf.spark.driver.cores=1

####spark driver  Run a memory ####
#kylin.query.spark-conf.spark.driver.memory=4G

####spark driver  Run off heap memory ####
#kylin.query.spark-conf.spark.driver.memoryOverhead=1G

####spark executor  The core number ####
#kylin.query.spark-conf.spark.executor.cores=1

####spark executor  Number ####
#kylin.query.spark-conf.spark.executor.instances=1

####spark executor  Run a memory ####
#kylin.query.spark-conf.spark.executor.memory=4G

####spark executor  Run off heap memory ####
#kylin.query.spark-conf.spark.executor.memoryOverhead=1G

cube Build optimization

Derived latitude
- According to the common sense, the latitude of the indicator field is 2^n-1 individual cuboid Need to create
- Now, the primary key of the dimension table is used to represent other indicator dimensions of the dimension table , Suppose that the original two dimension tables each have three indicators , Then we have to 2^{6-1=63 individual cuboid, Now you only need two primary key combinations, that is 2}2-1=3 individual cuboid
- Not recommended , because kylin The power lies in the precomputing build cube, If you use derived latitude , This will cause the query to calculate , Reduce the calculation during pre calculation , have the order reversed
- In the build cube When choosing the type of latitude ,Deriver Keyword indicates derived latitude
Aggregate group
- It is mainly aimed at those that cannot be used in the actual environment cuboid, It is possible not to build... During precomputation
- Forced latitude ： Represents the built cuboid Must include this latitude , Otherwise, do not build , Be careful , Mandatory latitude cannot appear alone , The following is also wrong
- Hierarchical latitude ： It can be understood as the prerequisite latitude of a certain latitude , That is to say B When it appears A Must appear ,A—>B
- Joint latitude ： Indicates that the latitude must appear together
- actual cube Building applications
cube Parameter tuning
- Use appropriate spark resources ,cube It can also be adjusted according to its own parameters spark Running resources
- Global dictionary ： Mainly the de duplication function , For integers, you can use bitmap duplicate removal , And for String type ,kylin The method adopted is first to string Building maps , Reuse bitmap De duplication of mapped values
- Snapshot table optimization ： Each snapshot table corresponds to hive Latitude table

Query performance optimization

Sort columns ： That is, when building rowkey The order of , You can drag and drop the sorting order of each column , stay rowkeys In addition to the sorting order of the columns on the page , There are also slices according to a certain column , Sharding can improve query performance , Corresponding kylin Underlying storage parquet file

Reduce small or uneven parquet file

Connection tool integration

jdbc

<dependencies>
<dependency>
<groupId>org.apache.kylin</groupId>
<artifactId>kylin-jdbc</artifactId>
<version>4.0.1</version>
</dependency>
</dependencies>

import java.sql.*;
public class KylinTest
{
    
    public static void main(String[] args) throws Exception
    {
    
        //Kylin_JDBC  drive 
        String KYLIN_DRIVER = "org.apache.kylin.jdbc.Driver";
        //Kylin_URL
        String KYLIN_URL = "jdbc:kylin://172.16.6.14:7071/FirstProject";
        //Kylin  Username 
        String KYLIN_USER = "ADMIN";
        //Kylin  Password 
        String KYLIN_PASSWD = "KYLIN";
        // Add driver information 
        Class.forName(KYLIN_DRIVER);
        // Get the connection 
        Connection connection = DriverManager.getConnection(KYLIN_URL, KYLIN_USER, KYLIN_PASSWD);
        // precompile  SQL
        PreparedStatement ps = connection.prepareStatement("select 
            dname, sum(sal) from emp e join dept d on e.deptno = d.deptno group by dname ");
            // Execute the query 
            ResultSet resultSet = ps.executeQuery();
            // Traversal print 
            while(resultSet.next())
            {
    
                System.out.println(resultSet.getString(1) + ":" + resultSet.getDouble(2));
            }
        }
    }

MDX Integrate

May adopt docker Deploy , Pull the mirror image ：

docker pull apachekylin/apache-kylin-standalone:kylin-4.0.1-mondrian

Start the container ：

docker run -d \
-m 8G \
-p 7070:7070 \
-p 7080:7080 \
-p 8088:8088 \
-p 50070:50070 \
-p 8032:8032 \
-p 8042:8042 \
-p 2181:2181 \
apachekylin/apache-kylin-standalone:kylin-4.0.1-mondrian