HiveQL/SparQL 常用函数一、窗口函数1.1 基础结构1.2 row_number/rank/dense_rank:排序函数1.3 sum/avg/max/min:聚合函数1.4 lag/lead/first_value/last_value:取上下行函数1.5 ntile:分析函数,用于百分点、n分片等1.6 count over二、数组类型的函数2.1 array:生成数组格式2.2 collect_list/collect_set:分组中某列转为数组返回2.3 array_contains:判断数组是否包含某个元素2.4 split:字符串切分为数组三、正则表达式函数3.1 like:判断是否匹配3.2 regexp/rlike:正则判断是否3.3 regexp_extract:正则匹配3.4 regexp_replace:正则代替3.5 常用匹配四、字符串函数4.1 concat:字符串拼接4.2 concat_ws:列表格式拼接4.3 instr:字符串查找4.4 substr:字符串截取五、时间处理函数5.1 from_unixtime/unix_timestamp:时间戳转换5.2 pdate2date/date2pdate:日期格式转换5.3 date_sub/add_months:返回前n天/月的日期5.4 datediff:计算日期差5.5 trunc:返回当月第一天/返回双月第一天六、行转列6.1 explode:单字段行转列6.2 lateral view explode:行转列6.3 lateral view posexplode:行转列,带顺序七、分位数函数7.1 percentile & percentile_approx八、json函数8.1 get_json_object:获取json对象的value8.2 json_tuple:一次获取多个对象的value九、其他 9.1 greatest:多列最大值9.2 rand:随机函数9.3 max_pt:读取表中最大分区9.4 coalesce/nvl:返回第一个非空值十、参考
窗口函数 = 分析函数 + over函数
分析函数:sum ( )、max ( )、row_number ( ) 等等
over函数:over (partition by user_id order by date desc/asc)
按照user_id字段分区,每个分区内根据date字段排序,asc:升序(默认升序) desc:降序
a)row_number ( ) over ( )
:依次排序且不会并列重复
b)rank ( ) over ( )
:排序相等会在名次中留下空位
c)dense_rank ( ) over ( )
:排序相等不会在名次中留下空位
xxxxxxxxxx
101SELECT
2 user_id,
3 --依次排序且不会并列重复
4 row_number() over(partition by user_id order by date) as row_number,
5 --排序相等会在名次中留下空位
6 rank() over(partition by user_id order by date) as rank,
7 --排序相等不会在名次中留下空位
8 dense_rank() over(partition by user_id order by date) as dense_rank
9FROM
10 app_launch
user_id | date | row_number | rank | dense_rank |
---|---|---|---|---|
12345 | 20190101 | 1 | 1 | 1 |
12345 | 20190101 | 2 | 1 | 1 |
12345 | 20190102 | 3 | 3 | 2 |
xxxxxxxxxx
81SELECT
2 user_id,
3 sum(salary) over(partition by user_id order by date) as sum,--起点到当前行总和
4 avg(salary) over(partition by user_id order by date) as avg,--起点到当前行平均值
5 max(salary) over(partition by user_id order by date) as max,--起点到当前行最大值
6 min(salary) over(partition by user_id order by date) as min --起点到当前行最小值
7FROM
8 app_launch
user_id | date | salary | sum | avg | max | min |
---|---|---|---|---|---|---|
12345 | 201901 | 100 | 100 | 100 | 100 | 100 |
12345 | 201902 | 200 | 300 | 150 | 200 | 100 |
12345 | 201903 | 300 | 600 | 200 | 300 | 100 |
xxxxxxxxxx
11SELECT
2 user_id,
3 sum(salary) over(partition by user_id) as sum, --分组内所有行的总和
4 avg(salary) over(partition by user_id) as avg, --分组内所有行的平均值
5 max(salary) over(partition by user_id) as max, --分组内所有行的最大值
6 min(salary) over(partition by user_id) as min --分组内所有行的最小值
7FROM
8 app_launch
user_id | date | salary | sum | avg | max | min |
---|---|---|---|---|---|---|
12345 | 201901 | 100 | 600 | 200 | 300 | 100 |
12345 | 201902 | 200 | 600 | 200 | 300 | 100 |
12345 | 201903 | 300 | 600 | 200 | 300 | 100 |
lag (col, n, default) over ( )
:用于统计窗口内往上第n行值lead (col, n, default) over ( )
:用于统计窗口内往下第n行值
xxxxxxxxxx
11511SELECT
2 user_id,
3 lag(date, 1, "") over(partition by user_id order by date) as lag,
4 lead(date, 1) over(partition by user_id order by date) as lead,
5 first_value(date) over(partition by user_id order by date) as first_value,
6 first_value(date) over(partition by user_id order by date desc) as last_value
7FROM
8 app_launch
user_id | date | lag | lead | first_value | first_value_desc |
---|---|---|---|---|---|
12345 | 20190101 | "" | 20190102 | 20190101 | 20190103 |
12345 | 20190102 | 20190101 | 20190103 | 20190101 | 20190103 |
12345 | 20190103 | 20190102 | NULL | 20190101 | 20190103 |
xxxxxxxxxx
11SELECT
2 user_id,
3 last_value(date) over(partition by user_id order by date) as last_value,
4 last_value(date) over(partition by user_id order by date desc) as last_value2
5FROM
6 app_launch
user_id | date | last_value | last_value2 |
---|---|---|---|
12345 | 20190101 | 20190101 | 20190103 |
12345 | 20190102 | 20190102 | 20190102 |
12345 | 20190103 | 20190103 | 20190101 |
NTILE()
把有序的数据集合平均分配到指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
NTILE()
函数将排序分区中的行划分为特定数量的组。从每个组分配一个从一开始的桶号。对于每一行,NTILE()
函数返回一个桶号,表示行所属的组。
n
是一个字面正整数。桶号的范围是1到n
。- 在
PARTITION BY
从返回的结果集划分FROM
子句为分区到的NTILE()
函数被应用。ORDER BY
子句指定将NTILE()
值分配给分区中的行的顺序。xxxxxxxxxx
41NTILE(n) OVER (
2PARTITION BY <expression>[{,<expression>...}]
3ORDER BY <expression> [ASC|DESC], [{,<expression>...}]
4)
然后可以根据桶号,选取前或后n分之几的数据。
用户收入表:user_salary
user_id | salary |
---|---|
1 | 100 |
2 | 200 |
3 | 300 |
xxxxxxxxxx
11SELECT
2 user_id,
3 salary,
4 ntile(2) over(partition by user_id order by salary) as rn
5FROM
6 user_salary
user_id | salary | rn |
---|---|---|
1 | 100 | 1 |
2 | 200 | 1 |
3 | 300 | 2 |
xxxxxxxxxx
131-- 收入前50%用户的平均收入
2SELECT
3 avg(salary) as avg_salary
4FROM
5(
6 SELECT
7 user_id,
8 salary,
9 ntile(2) over(partition by user_id order by salary) as rn
10 FROM
11 user_salary
12) a
13where rn = 1
xxxxxxxxxx
11SELECT
2 *,
3 count(1) over(partition by department) as cnt1,
4 count(1) over(order by department) as cnt2,
5 count(1) over(partition by department order by name) as cnt3
6FROM
7 dm_homed.test_count_over
name | department | salary | cn1 | cn2 | cn3 |
---|---|---|---|---|---|
王五 | 人事 | 4000 | 1 | 1 | 1 |
刘二 | 研发 | 6500 | 3 | 4 | 1 |
赵六 | 研发 | 7000 | 3 | 4 | 2 |
陈一 | 研发 | 6500 | 3 | 4 | 3 |
张三 | 财务 | 5000 | 2 | 6 | 1 |
李四 | 财务 | 5200 | 2 | 6 | 2 |
array (val1, val2, val3, …):将多个数据生成数组格式
array类型的数据可以通过'数组名[index]'
的方式访问,index从0开始:
xxxxxxxxxx
51SELECT ARRAY(0,1,2);
2returns [0,1,2]
3
4SELECT ARRAY(0,1,2)[1]
5returns 1
将分组中某列转为一个数组返回,不同的是collect_list不去重而collect_set去重
collect_set的参数必须是string类型
user_id | date |
---|---|
12345 | 1001 |
12345 | 1001 |
12345 | 1002 |
x
1SELECT
2 user_id,
3 concat_ws(',', collect_list/collect_set(cast(date as string))) as date
4FROM
5 app_launch
6GROUP BY
7 user_id
函数 | collect_list | collect_set |
---|---|---|
user_id | date | date |
12345 | 1001,1001,1002 | 1001,1002 |
判断数组是否包含某个元素,返回true/false
xxxxxxxxxx
21SELECT array_contains(ARRAY(0,1), 0);
2returns true
将字符串切分为数组
xxxxxxxxxx
21SELECT split('装修避坑,选材经验,设计案例', ',');
2returns ["装修避坑", "选材经验", "设计案例"]
主要用在where子句中,用来筛选内容,%是通配符,返回true/false
xxxxxxxxxx
41SELECT "news_home/electric, news_home" like '%news_home%'
2return true
3SELECT "news_house/electric" like '%news_home%'
4return false
主要用在where子句中,用来筛选内容,返回true/false
xxxxxxxxxx
241category = "news_home/electric, news_home"
2category1 = "news_home, news_home/electric"
3SELECT category regexp 'news_home$'
4return true
5SELECT category1 regexp 'news_home$'
6return false
7
8-- 判断是否为纯数字 "^[0-9]+$"
9SELECT "123123123" RLIKE "^[0-9]+$"
10return true
11SELECT "123123123a" RLIKE "^[0-9]+$"
12return false
13
14-- 判断是否为纯字母 "^[A-Za-z]+$"
15SELECT "sdkfjsdfa" RLIKE "^[A-Za-z]+$"
16return true
17SELECT "sdkfjsdfa1231231" RLIKE "^[0-9]+$"
18return false--判断是否为纯字母 "^[A-Za-z]+$"
19
20-- 判断是否为有中文 "[\u4E00-\u9FFF]+"
21SELECT 'news_house我们electric' RLIKE '[\u4E00-\u9FFF]+'
22return true
23SELECT 'news_house/electric' RLIKE '[\u4E00-\u9FFF]+'
24return false
用来从内容中匹配指定样式的内容,缺点是不能提取满足条件的所有内容
regexp_extract ( date, pattern, index )
第3个参数:返回正则完成的第几个结果
xxxxxxxxxx
1231data = "news_home/electric, news_home, news_home/electric, news_home"
2
3SELECT regexp_extract(data, '(news_home(?!/))', 0);
4-- (?!/)表示后一位没有'/'
5return news_home
6SELECT regexp_extract(data, '.*(news_home),||.*,(news_home)$');
7-- 匹配在中间和在末尾的news_home
8return news_home
9
10SELECT regexp_extract('id=1,num=1,id=2,num=2', 'id=(.*?),.*id=(.*?),.*', 1)
111
12SELECT regexp_extract('id=1,num=1,id=2,num=2', 'id=(.*?),.*id=(.*?),.*', 2)
132
14SELECT regexp_extract('id=1,num=1', 'id=(.*?),.*id=(.*?),.*', 2) --匹配不上为空
15""
16
17--匹配满足条件的
18SELECT regexp_extract('insert overwrite table ecom.dim_shop_df partition', '.*overwrite table (.*?) partition.*', 1)
19ecom.dim_shop_df
20
21--或的使用
22SELECT regexp_extract('insert overwrite table ecom. dim_shop_df partition', '.*(overwrite table|OVERWRITE TABLE).*?\\.(.*?) (partition|PARTITION).*', 2)
23ecom.dim_shop_df
用在select 子句中,用来将目标内容的特定部分替换成指定的内容
xxxxxxxxxx
91data = "装修避坑: 0.358001,选材经验: 0.828523,设计案例: 0.329146"
2
3SELECT regexp_replace(data, ':(.*?),', ',');
4-- 替换": 0.828523,"为","
5return "装修避坑,选材经验,设计案例"
6
7SELECT regexp_replace([1645357130369064,1645005576984579], '\\[|\\]', '');
8-- 替换[或]为"",[前需要加两个斜杠
9return "1645357130369064,1645005576984579"
xxxxxxxxxx
1151-- 匹配第一个出现的数字结果
2select regexp_extract('耐克123/Nike','([0-9]+)',1)
3123
4
5-- 匹配第一个出现的字母结果
6select regexp_extract('abc耐克/Nike','([a-zA-Z]+)',1)
7abc
8
9-- 匹配第一个出现的中文结果
10select regexp_extract('耐克/Nike','([\u4E00-\u9FFF]+)', 1)
11耐克
12
13-- 匹配所有的字母:将非字母的替换为空
14select REGEXP_REPLACE('abc耐克/Nike','([^a-z]+)','')
15abcike
x
1SELECT concat("welcome", " ", "to", " ", "beijing");
2return "welcome to beijing"
xxxxxxxxxx
11data_list = ["装修避坑", "选材经验", "设计案例"]
2SELECT concat_ws(',' , data_list);
3return "装修避坑,选材经验,设计案例"
4
5SELECT concat_ws(',' , "装修避坑", "选材经验", "设计案例");
6return "装修避坑,选材经验,设计案例"
instr(str1, str2, start_index, nth_appearance)
:返回查找到的位置,未找到返回为0
xxxxxxxxxx
31SELECT instr('syranmo','syra') -- 返回 1
2SELECT instr('syranmo','ranm') -- 返回 3
3SELECT instr('syranmo','abcd') -- 返回 0
substr(string str1, int start, int len)
:返回字符串str1从start位置开始,长度为len的字符串;len不传,默认从start位置到结尾的字符串
xxxxxxxxxx
81SELECT substr('12345',3)
2345
3
4SELECT substr('12345',1,4)
51234
6
7SELECT substr('12345',-2,2)
845
from_unixtime(ts, date_fomat)
:时间戳转换为日期格式
unix_timestamp(date_fomat, ts)
:日期格式转换为时间戳
xxxxxxxxxx
131--from_unixtime
2SELECT from_unixtime(1562160194, 'yyyyMMdd HH:mm:ss');
320190703 21:23:14
4
5SELECT from_unixtime(1562160194, 'yyyy-MM-dd');
62019-07-03
7
8SELECT from_unixtime(1562160194, 'yyyyMMdd');
920190703
10
11--unix_timestamp
12SELECT unix_timestamp('2019-07-03 21:23:14', 'yyyy-MM-dd HH:mm:ss');
131562160194
xxxxxxxxxx
51SELECT pdate2date('2020-07-23');
220200723
3
4SELECT date2pdate('20200723');
52020-07-23
date_sub ('${DATE}', n)
:返回DATE的前n天的日期,n:正数是减,负数是加
add_months('${DATE}', n)
:返回DATE的后n月的日期,n:正数是加,负数是减
xxxxxxxxxx
131--date_sub
2select date_sub('2020-07-09', 4);
32020-07-05
4
5select date_sub('2020-07-09', -4);
62020-07-13
7
8--add_months
9select add_months('2020-07-09', -1)
102020-06-09
11
12select add_months('2020-03-31', -1)
132020-02-29
datediff (end_date, start_date)
求两个指定日期的差,end_date - start_date
end_date/start_date的日期格式必须是'yyyy-MM-dd'格式
1SELECT datediff('2019-07-23', '2019-07-01');
222
trunc('${DATE}', 'MM')
xxxxxxxxxx
151返回当月第一天
2select trunc('2020-07-09', 'MM')
32020-07-01
4
5返回上个月第一天
6select trunc(add_months('2020-07-09', -1), 'MM')
72020-06-01
8
9返回上个双月第一天
10select trunc(if(month('${DATE}')%2 = 0, add_months('${DATE}', -3), add_months('${DATE}', -2)), 'MM'))
112022-01-01
12
13返回上个双月最后一天
14select date_sub(trunc(if(month('${DATE}')%2 = 0, add_months('${DATE}', -1), add_months('${DATE}', 0)), 'MM'), 1))
152022-02-28
explode(array)
:将一行数组转成多列,只支持单个字段的行转列
xxxxxxxxxx
11SELECT explode(array(0,1,2)) as num
num |
---|
0 |
1 |
2 |
lateral view explode(array(0,1,2))
:将某列字段切分成多列,其他字段均不变
xxxxxxxxxx
11SELECT
2 user_id,
3 vid2
4FROM
5 dm_homed.ods_server_impressions_log_hourly
6 lateral view explode(split(version_id, ',')) vid_table as vid2
7WHERE
8 date = '${date}'
user_id | vid2 |
---|---|
111 | 1201 |
111 | 1202 |
111 | 1203 |
111 | 1204 |
场景1:行转列,并带上每个元素在数组中的顺序
rule_id | prod_list |
---|---|
111 | 1203, 1202, 1201,1204 |
1WITH prod_table AS (
2 SELECT
3 '111' as rule_id,
4 '1203,1202,1201,1204' as prod_list
5)
6
7SELECT
8 rule_id,
9 product_id,
10 rnk
11FROM
12 prod_table
13 lateral view posexplode(split(prod_list, ',')) p as rnk, product_id
rule_id | product_id | rnk |
---|---|---|
111 | 1203 | 0 |
111 | 1202 | 1 |
111 | 1201 | 2 |
111 | 1204 | 3 |
场景2:两行同时行转列,并按数组中相同的顺序放置为1行
rule_id | prod_list | gmv_list |
---|---|---|
111 | 1203, 1202, 1201,1204 | 300,200,100,400 |
1WITH prod_table AS (
2 SELECT
3 '111' as rule_id,
4 '1203,1202,1201,1204' as prod_list,
5 '300,200,100,400' as gmv_list
6)
7
8SELECT
9 rule_id,
10 product_id,
11 gmv
12FROM
13 prod_table
14 lateral view posexplode(split(prod_list, ',')) p as rnk1, product_id
15 lateral view posexplode(split(gmv_list, ',')) p as rnk2, gmv
16WHERE
17 rnk1 = rnk2
rule_id | product_id | gmv |
---|---|---|
111 | 1203 | 300 |
111 | 1202 | 200 |
111 | 1201 | 100 |
111 | 1204 | 400 |
percentile(col, pth)
:求准确的第pth个百分位数
percentile_approx(col, pth, acc)
:求近似的第pth个百分位数,acc默认10000(万分之一)
xxxxxxxxxx
51--score精确的25分位数
2SELECT percentile(score, 0.25);
3
4--score近似的25分位数,精度默认10000(万分之一),在数据量较大的情况下近似计算更快
5SELECT percentile_approx(score, 0.25, 10000);
get_json_object (json_str, '$.key1')
xxxxxxxxxx
51SELECT get_json_object('{"sku_id":712557421,"stock_num":100}', '$.sku_id');
2# 712557421
3
4SELECT get_json_object('{"sku_id":712557421,"stock_num":100}', '$.stock_num');
5# 100
json_tuple(json_str, key1, key2, ...)
xxxxxxxxxx
91--json
2SELECT
3 sku_id,
4 stock_num
5FROM
6(
7 SELECT '{"sku_id":7125,"stock_num":0}' as sku
8) a
9lateral view json_tuple(sku, 'sku_id', 'stock_num') a as sku_id, stock_num
sku_id | stock_num |
---|---|
7125 | 0 |
xxxxxxxxxx
1191--解析数组格式的json
2SELECT
3 sku_id,
4 stock_num
5FROM
6(
7 SELECT
8 regexp_replace(
9 regexp_replace(sku, '\\[|\\]', ''), --替换括号
10 '\\},\\{', '\\}###\\{' --替换逗号,为了切分准备
11 ) as sku_list
12 FROM
13 (
14 SELECT
15 '[{"sku_id":7125,"stock_num":0},{"sku_id":7126,"stock_num":10}]' as sku
16 ) a
17) a
18lateral view explode(split(sku_list, '###')) sku_infos as sku --通过###切分
19lateral view json_tuple(sku, 'sku_id', 'stock_num') a as sku_id, stock_num
sku_id | stock_num |
---|---|
7125 | 0 |
7126 | 10 |
greatest (col1, col2, ...):求多列的最大值
xxxxxxxxxx
21SELECT greatest(1, 2, 3, 4);
2# 4
rand ()
:返回一个0到1范围内的随机数,返回值: doublerand(int seed)
:如果指定种子seed,则会等到一个稳定的随机数序列xxxxxxxxxx
111SELECT rand();
20.590256460382232
3
4SELECT rand();
50.699686875752123
6
7SELECT rand(100);
80.6841403791584381
9
10SELECT rand(100);
110.6841403791584381
max_pt('库名.表名', '分区名', '分区格式'):返回表中最大分区
优势:函数会读取hive表最大分区,重跑时可有效避免空分区
劣势:数据回溯时,不能保证取当时单天的,取的最新数据
注意:分区名需为string格式,不能为bigint,所以尽量把分区格式设成 string
使用场景:
xxxxxxxxxx
81show partitions dm_temai.repeat_product_records_dict;
2date=20200712
3date=20200713
4date=20200714
5date=20200715
6
7SELECT max_pt('dm_temai.repeat_product_records_dict', 'date', 'yyyyMMdd');
820200715
xxxxxxxxxx
1111SELECT coalesce(NULL, '1', '2')
21
3
4SELECT coalesce(NULL, NULL, NULL)
5NULL
6
7SELECT nvl(1, NULL)
81
9
10SELECT nvl(NULL, 1)
111