当前位置:网站首页>ClickHouse学习(七)表查询优化
ClickHouse学习(七)表查询优化
2022-07-29 05:09:00 【阳光里哭泣的狗】
目录
单表
prewhere
其作用和where相同,用来过滤数据,首先会读取指定的列数据,来判断数据过滤,等待数据过滤之后再读取select声明的列字段补全其余的属性,降低io操作
explain syntax select WatchID,
JavaEnable,
Title,
GoodEvent,
EventTime,
EventDate,
CounterID,
ClientIP,
ClientIP6,
RegionID,
UserID,
CounterClass,
OS,
UserAgent,
URL,
Referer,
URLDomain,
RefererDomain,
Refresh,
IsRobot,
RefererCategories,
URLCategories,
URLRegions,
RefererRegions,
ResolutionWidth,
ResolutionHeight,
ResolutionDepth,
FlashMajor,
FlashMinor,
FlashMinor2
from datasets.hits_v1 where UserID='3198390223272470366';
prewhere若不关闭,默认是开启的关闭命令: set optimize_move_to_prewhere=0;
但是在某些场景下得自己手动指定prewhere,所以直接将prewhere替换了where写法简单明了
数据采样
SELECT Title,count(*) AS PageViews
FROM hits_v1
SAMPLE 0.1
WHERE CounterID =57
GROUP BY Title
ORDER BY PageViews DESC LIMIT 100;
从1000条数据中提取出前10%的样本,只是近似数据不是实际的数据
列裁剪与分区裁剪
总的来说列裁剪就是选择字段别用*,分区裁剪就是读取分区的信息
数据量太大时应避免使用 select * 操作,字段越少,消耗的 io 资源越少,性能就会越高。
select WatchID,
JavaEnable,
Title,
GoodEvent,
EventTime,
EventDate,
CounterID,
ClientIP,
ClientIP6,
RegionID,
UserID
from datasets.hits_v1;
分区裁剪就是只读取需要的分区,在过滤条件中指定。
select WatchID,
JavaEnable,
Title,
GoodEvent,
ClientIP6,
RegionID,
UserID
from datasets.hits_v1
where EventDate='2014-03-23';

orderby 结合 where、limit使用
千万以上数据集进行 order by 查询时需要搭配 where 条件和 limit 语句一起使用
SELECT UserID,Age
FROM hits_v1
PREWHERE CounterID=57
ORDER BY Age DESC LIMIT 1000

避免构建虚拟列
就是尽量别用as创建一列新列
例如select a/b as t from test
uniqCombined 替代 distinct
近似去重 uniqCombined
Count(distinct )会使用 uniqExact精确去重
select count(distinct rand()) from hits_v1;

SELECT uniqCombined(rand()) from datasets.hits_v1;

多表
准备工作
创建一个小表,避免内存炸了
CREATE TABLE visits_v2
ENGINE = CollapsingMergeTree(Sign)
PARTITION BY toYYYYMM(StartDate)
ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity = 8192
as select * from visits_v1 limit 10000;
创建一个用于存放数据的结果表,避免渲染炸了
CREATE TABLE hits_v2
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity = 8192
as select * from hits_v1 where 1=0;
尽量用in不用join
insert into hits_v2
select a.* from hits_v1 a where a. CounterID in (select CounterID from
visits_v1);

非要用join
!!!必须满足小表在右的原则,clickhouse会将右表的数据加载至内存中与左表进行比较,无论是用那种连接方式它都只会讲右表加载至内存
当大表在右时程序会直接报错,内存空间不够用
边栏推荐
猜你喜欢

365 day challenge leetcode 1000 questions - day 037 elements and the maximum side length of squares less than or equal to the threshold + the number of subsequences that meet the conditions

【C语言系列】— 把同学弄糊涂的 “常量” 与 “变量”

The road to success in R & D efficiency of 1000 person Internet companies

Day 2

Detailed explanation of serial port communication

B - 识别浮点常量问题

QT series - Installation

Yangyonglin, vice president of Rushi Technology: when traditional industries encounter "digital space"

Alibaba cloud architect Liang Xu: MES on cloud box helps customers quickly build digital factories

vim编辑器使用
随机推荐
C语言 一级指针
Alibaba cloud architect details nine trends in the game industry
The road to success in R & D efficiency of 1000 person Internet companies
C language file operation
Best practices for elastic computing in the game industry
Live broadcast preview | how to improve enterprise immunity through "intelligent edge security"?
刷题狂魔—LeetCode之剑指offer58 - II. 左旋转字符串 详解
Yangyonglin, vice president of Rushi Technology: when traditional industries encounter "digital space"
321, Jingdong Yanxi × Nlpcc 2022 challenge starts!
Camunda 1、Camunda工作流-介绍
小程序中的DOM对象元素块动态排序
Day 3
QT series - Installation
来!看排名一年上升16位的ClickHouse,如何在京东落地实践
B - 识别浮点常量问题
水一篇图的拓扑排序
CMU15-213 Shell Lab实验记录
365 day challenge leetcode1000 question - distance between bus stops on day 038 + time-based key value storage + array closest to the target value after transforming the array and + maximum value at t
Global components component registration
365天挑战LeetCode1000题——Day 035 每日一题 + 二分查找 13
