Clickhouse SQL优化技巧

2023-08-16

Clickhouse SQL优化技巧

1.使用分区
clickhouse的表，走索引和非索引效率差距很大，在使用一个表进行查询时，必须限制索引字段。避免扫描全表
确定索引分区字段，可以用show create table default.ods_user，查看本地表的建表语句，partition by 的字段就是分区字段。
如果需要限制的时间和分区字段不是同一个字段时，可以扩大分区字段取数区间，然后再过滤

2.distinct 和 group by
优先使用group by，distinct满足不了的情况，可以使用group by，
如果count(distinct uid)查询响应过长，可以使用groupBitmap(uid)代替

3.global in，global not in 和bitmap
当涉及到集合in的操作，如果字段是Int类型，可以转化成bitmap进行计算，效率会大幅提高

-- global in
select uid from xxx where uid global in (select uid from yyyy)

-- bitmap
with (select groupbitmapState(toUint64(uid)) from yyyy) as bitmap_uid
select uid from xxx where bitmapContains(bitmap_uid,uid)==1

-- global not in
select uid from xxx where uid global in (select uid from yyyy)

-- bitmap
with (select groupbitmapState(toUint64(uid)) from yyyy) as bitmap_uid
select uid from xxx where bitmapContains(bitmap_uid,uid)==0

4.使用with子句代替重复查询，查询耗时有可能减半
-- 重复子查询
select t1.*, t2.*
from (
select *
from test1_all
where uid global in (
select distinct uid
from test_all
where toDate(totalDate) = yesterday()
)
) as t1 global
left join(
select distinct uid
from test2_all
where uid global not in (
select uid
from test_all
where toDate(totalDate) = yesterday()
)
) as t2 using uid;

-- 使用with子句+bitmap代替

with(
select groupBitmapState(uid)
from test_all
where toDate(totalDate) = yesterday()
) as bm
select t1.*, t2.*
from (
select *
from test1_all
where bitmapContains(bm)==1
) as t1 global
left join(
select *
from test2_all/
where bitmapContains(bm)==0
) as t2 using uid;

5.global in 和 in
in子查询，使用global in
in 枚举值，可以使用in，例如 age in (10,18)

6.global all left join
能够提前过滤的左右表数据，提前过滤
左边大表，右边小表。切忌小表join大表

join条件，数值类比字符串类型快

global all left join与global left join并无区别

7.为什么查询sql有时候能执行
为什么有些sql，有时候能运行有时候不能运行，因为我们的集群的ck机器，有些是128g内存，有些是256g内存的，all表是随机路由的，如果路由到256g的可以运行，128g可能就会失败了，对应此种情况，我们做出的解决方案是配置一个新的lb路由，将大查询sql全部路由到256g的机器上

8.优化整体思路
首先检查是否使用分区字段，必须使用分区字段避免扫描全表
分而治之，如果一个业务的SQL，如果按某个字段，分成N批执行，最终的结果不变，那么就可以采取分批的方式优化，比如mod(uid,10)=batch,这样来实现分10批执行。
提前缩小数据量，如果涉及到的表数据过大，可以提前过滤数据，再进行join操作，而不是join之后再加where条件过滤。尽量做到在每个select都能够过滤一部分数据