
![]()
HiveQL/SparQL 常用函数一、窗口函数1.1 基础结构1.2 row_number/rank/dense_rank:排序函数1.3 sum/avg/max/min:聚合函数1.4 lag/lead/first_value/last_value:取上下行函数1.5 ntile:分析函数,用于百分点、n分片等1.6 count over二、数组类型的函数2.1 array:生成数组格式2.2 collect_list/collect_set:分组中某列转为数组返回2.3 array_contains:判断数组是否包含某个元素2.4 split:字符串切分为数组三、正则表达式函数3.1 like:判断是否匹配3.2 regexp/rlike:正则判断是否3.3 regexp_extract:正则匹配3.4 regexp_replace:正则代替3.5 常用匹配四、字符串函数4.1 concat:字符串拼接4.2 concat_ws:列表格式拼接4.3 instr:字符串查找4.4 substr:字符串截取五、时间处理函数5.1 from_unixtime/unix_timestamp:时间戳转换5.2 pdate2date/date2pdate:日期格式转换5.3 date_sub/add_months:返回前n天/月的日期5.4 datediff:计算日期差5.5 trunc:返回当月第一天/返回双月第一天六、行转列6.1 explode:单字段行转列6.2 lateral view explode:行转列6.3 lateral view posexplode:行转列,带顺序七、分位数函数7.1 percentile & percentile_approx八、json函数8.1 get_json_object:获取json对象的value8.2 json_tuple:一次获取多个对象的value九、其他 9.1 greatest:多列最大值9.2 rand:随机函数9.3 max_pt:读取表中最大分区9.4 coalesce/nvl:返回第一个非空值十、参考
窗口函数 = 分析函数 + over函数
分析函数:sum ( )、max ( )、row_number ( ) 等等
over函数:over (partition by user_id order by date desc/asc)
按照user_id字段分区,每个分区内根据date字段排序,asc:升序(默认升序) desc:降序
a)row_number ( ) over ( ):依次排序且不会并列重复
b)rank ( ) over ( ):排序相等会在名次中留下空位
c)dense_rank ( ) over ( ):排序相等不会在名次中留下空位
xxxxxxxxxx101SELECT2 user_id,3 --依次排序且不会并列重复4 row_number() over(partition by user_id order by date) as row_number,5 --排序相等会在名次中留下空位6 rank() over(partition by user_id order by date) as rank,7 --排序相等不会在名次中留下空位8 dense_rank() over(partition by user_id order by date) as dense_rank9FROM10 app_launch| user_id | date | row_number | rank | dense_rank |
|---|---|---|---|---|
| 12345 | 20190101 | 1 | 1 | 1 |
| 12345 | 20190101 | 2 | 1 | 1 |
| 12345 | 20190102 | 3 | 3 | 2 |
xxxxxxxxxx81SELECT2 user_id,3 sum(salary) over(partition by user_id order by date) as sum,--起点到当前行总和4 avg(salary) over(partition by user_id order by date) as avg,--起点到当前行平均值5 max(salary) over(partition by user_id order by date) as max,--起点到当前行最大值6 min(salary) over(partition by user_id order by date) as min --起点到当前行最小值7FROM8 app_launch| user_id | date | salary | sum | avg | max | min |
|---|---|---|---|---|---|---|
| 12345 | 201901 | 100 | 100 | 100 | 100 | 100 |
| 12345 | 201902 | 200 | 300 | 150 | 200 | 100 |
| 12345 | 201903 | 300 | 600 | 200 | 300 | 100 |
xxxxxxxxxx11SELECT2 user_id,3 sum(salary) over(partition by user_id) as sum, --分组内所有行的总和4 avg(salary) over(partition by user_id) as avg, --分组内所有行的平均值5 max(salary) over(partition by user_id) as max, --分组内所有行的最大值6 min(salary) over(partition by user_id) as min --分组内所有行的最小值7FROM8 app_launch| user_id | date | salary | sum | avg | max | min |
|---|---|---|---|---|---|---|
| 12345 | 201901 | 100 | 600 | 200 | 300 | 100 |
| 12345 | 201902 | 200 | 600 | 200 | 300 | 100 |
| 12345 | 201903 | 300 | 600 | 200 | 300 | 100 |
lag (col, n, default) over ( ):用于统计窗口内往上第n行值lead (col, n, default) over ( ):用于统计窗口内往下第n行值
xxxxxxxxxx11511SELECT2 user_id,3 lag(date, 1, "") over(partition by user_id order by date) as lag,4 lead(date, 1) over(partition by user_id order by date) as lead,5 first_value(date) over(partition by user_id order by date) as first_value,6 first_value(date) over(partition by user_id order by date desc) as last_value7FROM8 app_launch| user_id | date | lag | lead | first_value | first_value_desc |
|---|---|---|---|---|---|
| 12345 | 20190101 | "" | 20190102 | 20190101 | 20190103 |
| 12345 | 20190102 | 20190101 | 20190103 | 20190101 | 20190103 |
| 12345 | 20190103 | 20190102 | NULL | 20190101 | 20190103 |
xxxxxxxxxx11SELECT2 user_id,3 last_value(date) over(partition by user_id order by date) as last_value,4 last_value(date) over(partition by user_id order by date desc) as last_value25FROM6 app_launch| user_id | date | last_value | last_value2 |
|---|---|---|---|
| 12345 | 20190101 | 20190101 | 20190103 |
| 12345 | 20190102 | 20190102 | 20190102 |
| 12345 | 20190103 | 20190103 | 20190101 |
NTILE()把有序的数据集合平均分配到指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
NTILE()函数将排序分区中的行划分为特定数量的组。从每个组分配一个从一开始的桶号。对于每一行,NTILE()函数返回一个桶号,表示行所属的组。
n是一个字面正整数。桶号的范围是1到n。- 在
PARTITION BY从返回的结果集划分FROM子句为分区到的NTILE()函数被应用。ORDER BY子句指定将NTILE()值分配给分区中的行的顺序。xxxxxxxxxx41NTILE(n) OVER (2PARTITION BY <expression>[{,<expression>...}]3ORDER BY <expression> [ASC|DESC], [{,<expression>...}]4)
然后可以根据桶号,选取前或后n分之几的数据。
用户收入表:user_salary
| user_id | salary |
|---|---|
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
xxxxxxxxxx11SELECT2 user_id,3 salary,4 ntile(2) over(partition by user_id order by salary) as rn5FROM6 user_salary| user_id | salary | rn |
|---|---|---|
| 1 | 100 | 1 |
| 2 | 200 | 1 |
| 3 | 300 | 2 |
xxxxxxxxxx131-- 收入前50%用户的平均收入2SELECT3 avg(salary) as avg_salary4FROM5(6 SELECT7 user_id,8 salary,9 ntile(2) over(partition by user_id order by salary) as rn10 FROM11 user_salary12) a13where rn = 1xxxxxxxxxx11SELECT2 *,3 count(1) over(partition by department) as cnt1,4 count(1) over(order by department) as cnt2,5 count(1) over(partition by department order by name) as cnt36FROM7 dm_homed.test_count_over| name | department | salary | cn1 | cn2 | cn3 |
|---|---|---|---|---|---|
| 王五 | 人事 | 4000 | 1 | 1 | 1 |
| 刘二 | 研发 | 6500 | 3 | 4 | 1 |
| 赵六 | 研发 | 7000 | 3 | 4 | 2 |
| 陈一 | 研发 | 6500 | 3 | 4 | 3 |
| 张三 | 财务 | 5000 | 2 | 6 | 1 |
| 李四 | 财务 | 5200 | 2 | 6 | 2 |
array (val1, val2, val3, …):将多个数据生成数组格式
array类型的数据可以通过'数组名[index]'的方式访问,index从0开始:
xxxxxxxxxx51SELECT ARRAY(0,1,2);2returns [0,1,2]3
4SELECT ARRAY(0,1,2)[1]5returns 1将分组中某列转为一个数组返回,不同的是collect_list不去重而collect_set去重
collect_set的参数必须是string类型
| user_id | date |
|---|---|
| 12345 | 1001 |
| 12345 | 1001 |
| 12345 | 1002 |
x
1SELECT2 user_id,3 concat_ws(',', collect_list/collect_set(cast(date as string))) as date4FROM5 app_launch6GROUP BY 7 user_id| 函数 | collect_list | collect_set |
|---|---|---|
| user_id | date | date |
| 12345 | 1001,1001,1002 | 1001,1002 |
判断数组是否包含某个元素,返回true/false
xxxxxxxxxx21SELECT array_contains(ARRAY(0,1), 0);2returns true将字符串切分为数组
xxxxxxxxxx21SELECT split('装修避坑,选材经验,设计案例', ',');2returns ["装修避坑", "选材经验", "设计案例"]主要用在where子句中,用来筛选内容,%是通配符,返回true/false
xxxxxxxxxx41SELECT "news_home/electric, news_home" like '%news_home%'2return true3SELECT "news_house/electric" like '%news_home%'4return false主要用在where子句中,用来筛选内容,返回true/false
xxxxxxxxxx241category = "news_home/electric, news_home"2category1 = "news_home, news_home/electric"3SELECT category regexp 'news_home$'4return true5SELECT category1 regexp 'news_home$'6return false7
8-- 判断是否为纯数字 "^[0-9]+$"9SELECT "123123123" RLIKE "^[0-9]+$"10return true11SELECT "123123123a" RLIKE "^[0-9]+$"12return false13
14-- 判断是否为纯字母 "^[A-Za-z]+$"15SELECT "sdkfjsdfa" RLIKE "^[A-Za-z]+$"16return true17SELECT "sdkfjsdfa1231231" RLIKE "^[0-9]+$"18return false--判断是否为纯字母 "^[A-Za-z]+$"19
20-- 判断是否为有中文 "[\u4E00-\u9FFF]+"21SELECT 'news_house我们electric' RLIKE '[\u4E00-\u9FFF]+'22return true23SELECT 'news_house/electric' RLIKE '[\u4E00-\u9FFF]+'24return false用来从内容中匹配指定样式的内容,缺点是不能提取满足条件的所有内容
regexp_extract ( date, pattern, index )
第3个参数:返回正则完成的第几个结果
xxxxxxxxxx1231data = "news_home/electric, news_home, news_home/electric, news_home"2
3SELECT regexp_extract(data, '(news_home(?!/))', 0);4-- (?!/)表示后一位没有'/'5return news_home6SELECT regexp_extract(data, '.*(news_home),||.*,(news_home)$');7-- 匹配在中间和在末尾的news_home8return news_home9
10SELECT regexp_extract('id=1,num=1,id=2,num=2', 'id=(.*?),.*id=(.*?),.*', 1)11112SELECT regexp_extract('id=1,num=1,id=2,num=2', 'id=(.*?),.*id=(.*?),.*', 2)13214SELECT regexp_extract('id=1,num=1', 'id=(.*?),.*id=(.*?),.*', 2) --匹配不上为空15""16
17--匹配满足条件的18SELECT regexp_extract('insert overwrite table ecom.dim_shop_df partition', '.*overwrite table (.*?) partition.*', 1)19ecom.dim_shop_df20
21--或的使用22SELECT regexp_extract('insert overwrite table ecom. dim_shop_df partition', '.*(overwrite table|OVERWRITE TABLE).*?\\.(.*?) (partition|PARTITION).*', 2)23ecom.dim_shop_df用在select 子句中,用来将目标内容的特定部分替换成指定的内容
xxxxxxxxxx91data = "装修避坑: 0.358001,选材经验: 0.828523,设计案例: 0.329146"2
3SELECT regexp_replace(data, ':(.*?),', ',');4-- 替换": 0.828523,"为","5return "装修避坑,选材经验,设计案例"6
7SELECT regexp_replace([1645357130369064,1645005576984579], '\\[|\\]', '');8-- 替换[或]为"",[前需要加两个斜杠9return "1645357130369064,1645005576984579"xxxxxxxxxx1151-- 匹配第一个出现的数字结果2select regexp_extract('耐克123/Nike','([0-9]+)',1)31234
5-- 匹配第一个出现的字母结果6select regexp_extract('abc耐克/Nike','([a-zA-Z]+)',1)7abc8
9-- 匹配第一个出现的中文结果10select regexp_extract('耐克/Nike','([\u4E00-\u9FFF]+)', 1)11耐克12
13-- 匹配所有的字母:将非字母的替换为空14select REGEXP_REPLACE('abc耐克/Nike','([^a-z]+)','')15abcike
x
1SELECT concat("welcome", " ", "to", " ", "beijing");2return "welcome to beijing"xxxxxxxxxx11data_list = ["装修避坑", "选材经验", "设计案例"]2SELECT concat_ws(',' , data_list);3return "装修避坑,选材经验,设计案例"4
5SELECT concat_ws(',' , "装修避坑", "选材经验", "设计案例");6return "装修避坑,选材经验,设计案例"instr(str1, str2, start_index, nth_appearance):返回查找到的位置,未找到返回为0
xxxxxxxxxx31SELECT instr('syranmo','syra') -- 返回 12SELECT instr('syranmo','ranm') -- 返回 33SELECT instr('syranmo','abcd') -- 返回 0substr(string str1, int start, int len):返回字符串str1从start位置开始,长度为len的字符串;len不传,默认从start位置到结尾的字符串
xxxxxxxxxx81SELECT substr('12345',3) 23453
4SELECT substr('12345',1,4) 512346
7SELECT substr('12345',-2,2) 845from_unixtime(ts, date_fomat):时间戳转换为日期格式
unix_timestamp(date_fomat, ts):日期格式转换为时间戳
xxxxxxxxxx131--from_unixtime2SELECT from_unixtime(1562160194, 'yyyyMMdd HH:mm:ss');320190703 21:23:144
5SELECT from_unixtime(1562160194, 'yyyy-MM-dd');62019-07-037
8SELECT from_unixtime(1562160194, 'yyyyMMdd');92019070310
11--unix_timestamp12SELECT unix_timestamp('2019-07-03 21:23:14', 'yyyy-MM-dd HH:mm:ss');131562160194xxxxxxxxxx51SELECT pdate2date('2020-07-23');2202007233
4SELECT date2pdate('20200723');52020-07-23date_sub ('${DATE}', n):返回DATE的前n天的日期,n:正数是减,负数是加
add_months('${DATE}', n):返回DATE的后n月的日期,n:正数是加,负数是减
xxxxxxxxxx131--date_sub2select date_sub('2020-07-09', 4);32020-07-054
5select date_sub('2020-07-09', -4);62020-07-137
8--add_months9select add_months('2020-07-09', -1)102020-06-0911
12select add_months('2020-03-31', -1)132020-02-29datediff (end_date, start_date) 求两个指定日期的差,end_date - start_date
end_date/start_date的日期格式必须是'yyyy-MM-dd'格式
1SELECT datediff('2019-07-23', '2019-07-01');222trunc('${DATE}', 'MM')
xxxxxxxxxx151返回当月第一天2select trunc('2020-07-09', 'MM')32020-07-014
5返回上个月第一天6select trunc(add_months('2020-07-09', -1), 'MM')72020-06-018
9返回上个双月第一天10select trunc(if(month('${DATE}')%2 = 0, add_months('${DATE}', -3), add_months('${DATE}', -2)), 'MM'))112022-01-0112
13返回上个双月最后一天14select date_sub(trunc(if(month('${DATE}')%2 = 0, add_months('${DATE}', -1), add_months('${DATE}', 0)), 'MM'), 1))152022-02-28
explode(array):将一行数组转成多列,只支持单个字段的行转列
xxxxxxxxxx11SELECT explode(array(0,1,2)) as num| num |
|---|
| 0 |
| 1 |
| 2 |
lateral view explode(array(0,1,2)):将某列字段切分成多列,其他字段均不变
xxxxxxxxxx11SELECT2 user_id,3 vid24FROM5 dm_homed.ods_server_impressions_log_hourly 6 lateral view explode(split(version_id, ',')) vid_table as vid27WHERE8 date = '${date}'| user_id | vid2 |
|---|---|
| 111 | 1201 |
| 111 | 1202 |
| 111 | 1203 |
| 111 | 1204 |
场景1:行转列,并带上每个元素在数组中的顺序
| rule_id | prod_list |
|---|---|
| 111 | 1203, 1202, 1201,1204 |
1WITH prod_table AS (2 SELECT3 '111' as rule_id,4 '1203,1202,1201,1204' as prod_list5)6
7SELECT8 rule_id,9 product_id,10 rnk11FROM12 prod_table13 lateral view posexplode(split(prod_list, ',')) p as rnk, product_id| rule_id | product_id | rnk |
|---|---|---|
| 111 | 1203 | 0 |
| 111 | 1202 | 1 |
| 111 | 1201 | 2 |
| 111 | 1204 | 3 |
场景2:两行同时行转列,并按数组中相同的顺序放置为1行
| rule_id | prod_list | gmv_list |
|---|---|---|
| 111 | 1203, 1202, 1201,1204 | 300,200,100,400 |
1WITH prod_table AS (2 SELECT3 '111' as rule_id,4 '1203,1202,1201,1204' as prod_list,5 '300,200,100,400' as gmv_list6)7
8SELECT9 rule_id,10 product_id,11 gmv12FROM13 prod_table14 lateral view posexplode(split(prod_list, ',')) p as rnk1, product_id15 lateral view posexplode(split(gmv_list, ',')) p as rnk2, gmv16WHERE17 rnk1 = rnk2| rule_id | product_id | gmv |
|---|---|---|
| 111 | 1203 | 300 |
| 111 | 1202 | 200 |
| 111 | 1201 | 100 |
| 111 | 1204 | 400 |
percentile(col, pth):求准确的第pth个百分位数
percentile_approx(col, pth, acc):求近似的第pth个百分位数,acc默认10000(万分之一)
xxxxxxxxxx51--score精确的25分位数2SELECT percentile(score, 0.25);3
4--score近似的25分位数,精度默认10000(万分之一),在数据量较大的情况下近似计算更快5SELECT percentile_approx(score, 0.25, 10000);get_json_object (json_str, '$.key1')
xxxxxxxxxx51SELECT get_json_object('{"sku_id":712557421,"stock_num":100}', '$.sku_id');2# 7125574213
4SELECT get_json_object('{"sku_id":712557421,"stock_num":100}', '$.stock_num');5# 100json_tuple(json_str, key1, key2, ...)
xxxxxxxxxx91--json2SELECT3 sku_id,4 stock_num5FROM6(7 SELECT '{"sku_id":7125,"stock_num":0}' as sku8) a9lateral view json_tuple(sku, 'sku_id', 'stock_num') a as sku_id, stock_num| sku_id | stock_num |
|---|---|
| 7125 | 0 |
xxxxxxxxxx1191--解析数组格式的json2SELECT3 sku_id,4 stock_num5FROM6(7 SELECT 8 regexp_replace(9 regexp_replace(sku, '\\[|\\]', ''), --替换括号 10 '\\},\\{', '\\}###\\{' --替换逗号,为了切分准备11 ) as sku_list12 FROM13 (14 SELECT15 '[{"sku_id":7125,"stock_num":0},{"sku_id":7126,"stock_num":10}]' as sku16 ) a17) a18lateral view explode(split(sku_list, '###')) sku_infos as sku --通过###切分19lateral view json_tuple(sku, 'sku_id', 'stock_num') a as sku_id, stock_num| sku_id | stock_num |
|---|---|
| 7125 | 0 |
| 7126 | 10 |
greatest (col1, col2, ...):求多列的最大值
xxxxxxxxxx21SELECT greatest(1, 2, 3, 4);2# 4rand ():返回一个0到1范围内的随机数,返回值: doublerand(int seed):如果指定种子seed,则会等到一个稳定的随机数序列xxxxxxxxxx111SELECT rand();20.5902564603822323
4SELECT rand();50.6996868757521236
7SELECT rand(100);80.68414037915843819
10SELECT rand(100);110.6841403791584381max_pt('库名.表名', '分区名', '分区格式'):返回表中最大分区
优势:函数会读取hive表最大分区,重跑时可有效避免空分区
劣势:数据回溯时,不能保证取当时单天的,取的最新数据
注意:分区名需为string格式,不能为bigint,所以尽量把分区格式设成 string
使用场景:
xxxxxxxxxx81show partitions dm_temai.repeat_product_records_dict;2date=202007123date=202007134date=202007145date=202007156
7SELECT max_pt('dm_temai.repeat_product_records_dict', 'date', 'yyyyMMdd');820200715xxxxxxxxxx1111SELECT coalesce(NULL, '1', '2')213
4SELECT coalesce(NULL, NULL, NULL)5NULL6
7SELECT nvl(1, NULL)819
10SELECT nvl(NULL, 1)111