hiveSql 百度面试题-连续签到领金币
创始人
2024-05-07 15:53:39
0

hiveSql 百度面试题-连续签到领金币

    • 需求
    • 分析
    • 实现
    • 最后

需求

用户在文章页可以每天签到,签到按照某种规则可获取金币,需统计每个用户每月获取金币数。

有用户签到明细表:

CREATE TABLE tmp_tb_user_log (uid INT COMMENT '用户ID',artical_id INT COMMENT '视频ID',in_time string COMMENT '进入时间',out_time string COMMENT '离开时间',sign_in int COMMENT '是否签到1是0否'
)

记录用户进入签到页,点击签到按钮后退出log数据。
部分表数据明细如下:

uidartical_idin_timeout_timesign_in
102022-07-07 10:00:002022-07-07 10:00:091
102022-07-08 10:00:002022-07-08 10:00:091
102022-07-09 10:00:002022-07-09 10:00:421
102022-07-10 10:00:002022-07-10 10:00:091
102022-07-11 23:59:552022-07-11 23:59:591
102022-07-12 10:00:282022-07-12 10:00:501
102022-07-13 10:00:282022-07-13 10:00:501
102022-07-14 11:00:282022-07-14 11:00:501
102022-07-15 11:59:282022-07-16 00:01:201

领取金币规则:

(1)只要用户签到就能获取一枚金币
(2)连续签到第三天和第七天分别可以在每天一枚的基础上额外再获取2枚和6金币
(3)连续签到7天后进行重置,按照规则(1)和(2)进行新一轮领币
(4)用户进入时间和离开时间如果跨天 按照进入时间计算打卡时间

上述举例明细中,用户id:1在2022-07-07到2022-07-15号连续签到,按照上述金币规则,每天的获取金币数据如下:

uidartical_idin_timeout_timesign_incoin
102022-07-07 10:00:002022-07-07 10:00:0911
102022-07-08 10:00:002022-07-08 10:00:0911
102022-07-09 10:00:002022-07-09 10:00:4213
102022-07-10 10:00:002022-07-10 10:00:0911
102022-07-11 23:59:552022-07-11 23:59:5911
102022-07-12 10:00:282022-07-12 10:00:5011
102022-07-13 10:00:282022-07-13 10:00:5017
102022-07-14 11:00:282022-07-14 11:00:5011
102022-07-15 11:59:282022-07-16 00:01:2011

用户id:1在2022-07-09是连续签到3天获得3金币,在2022-07-13连续签到7天,获得7金币。7月份共获得17金币。

分析

逆反思想思考,想要获取以上结果,得知道每个用户的有效签到连续日期分组,同一个用户可能有多个连续签到数据段,在每个数据段内需要row_number() 标行号,按照需求的金币规则计算每个连续签到数据段内的获取金币数。

  • 1.如果我们已经有如下数据:
uidartical_idin_timeout_timesign_inrank
102022-07-07 10:00:002022-07-07 10:00:0911
102022-07-08 10:00:002022-07-08 10:00:0912
102022-07-09 10:00:002022-07-09 10:00:4213
102022-07-10 10:00:002022-07-10 10:00:0914
102022-07-11 23:59:552022-07-11 23:59:5915
102022-07-12 10:00:282022-07-12 10:00:5016
102022-07-13 10:00:282022-07-13 10:00:5017
102022-07-14 11:00:282022-07-14 11:00:5018
102022-07-15 11:59:282022-07-16 00:01:2019

rank是每个用户每个连续签到数据段内按照in_time升序序号,如果有了这个序号,用这个序号对7取余后,处理余数为3和0(7天的整数,即每一个连续7天),其余都是1金币。就可以算出每个用户每天的获取金币数。

  • 2.如果想要得到步骤1的数据,前提是先将每个用户每次的联系签到日期分到同一组中,这里很容易想到利用重分组思想,在用户每次签到相对于前一天变化的日期数据处标记数据为1,连续签到数据标记为0,再sum() over() 按照签到时间累加,即将每个用户每次联系签到数据分到同一组。可见举例数据:

假设有数据如下:

uidartical_idin_timeout_timesign_in
102022-07-07 10:00:002022-07-07 10:00:091
102022-07-08 10:00:002022-07-08 10:00:091
102022-07-09 10:00:002022-07-09 10:00:421
102022-07-10 10:00:002022-07-10 10:00:090
102022-07-11 23:59:552022-07-11 23:59:591
102022-07-13 10:00:282022-07-13 10:00:501
102022-07-14 11:00:282022-07-14 11:00:501

用户在2022-07-10号进入签到页了 ,但是没有签到(sign_in为0),在2022-07-13直接没有来,即该用户应该有3段连续签到数据段。
分别是7号到9号,11号 和 13号到14号。分组应该是这样:

uidartical_idin_timeout_timesign_ingroup_name
102022-07-07 10:00:002022-07-07 10:00:0910
102022-07-08 10:00:002022-07-08 10:00:0910
102022-07-09 10:00:002022-07-09 10:00:4210
102022-07-10 10:00:002022-07-10 10:00:0901
102022-07-11 23:59:552022-07-11 23:59:5912
102022-07-13 10:00:282022-07-13 10:00:5013
102022-07-14 11:00:282022-07-14 11:00:5013

可以看到group_name字段一共有0到3 四个分组,但是组号1是没有签到的。如何实现这种分组?

  • 3.实现重分组,将每个用户的每段连续签到日期分到同一组。
    构造每次签到相对于前一天变化的日期数据标记为1,连续日期标记为0,再sum() over()开窗累加即可。可见数据:
uidartical_idin_timeout_timesign_insum_over
102022-07-07 10:00:002022-07-07 10:00:0910
102022-07-08 10:00:002022-07-08 10:00:0910
102022-07-09 10:00:002022-07-09 10:00:4210
102022-07-10 10:00:002022-07-10 10:00:0901
102022-07-11 23:59:552022-07-11 23:59:5911
102022-07-13 10:00:282022-07-13 10:00:5011
102022-07-14 11:00:282022-07-14 11:00:5010

sum_over字段:

  • 2022-07-10号相对于2022-07-09号是变化的(sign_in变为0),标记为1;

  • 2022-07-11号相对于2022-07-10号是变化的(sign_in变为1),标记为1;

  • 2022-07-13号相对于2022-07-11号是变化的(日期不连续),标记为1;
    有sum_over字段,sum(sum_over) over(partition by uid order by in_time) 即得到步骤1中的group_name。
    那构造sum_over需要注意什么?再往上一步想。

  • 4.这一步就是具体的分析什么是相对于前一条数据是否变化
    首先要明确,断签的条件有两个:
    1、没来,数据表现为没有当天数据
    2、来了没签到,数据表现为有当天数据,但是sign_in是0
    所以要标注这两个条件以明确步骤3中的sum_over到底应该是1还是0

相对于前一条数据,很容易想到开窗函数lead() over() 或者 lag() over()。不清楚这两个开窗函数用法的同学可以查看俺写的另一篇介绍开窗函数的文章开窗函数,很nice的。

if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag

flag_days:当前行减去上一行日期取天数差,如果差值大于1 ,标记为1,否则标记为0。这里理解为当前数据相对于上一条数据是连续日期为0,否则为1
sign_flag:当前行sign_in与上一行sign_in不一样则标记为1,否则标记为0。这里理解为从签到未签到 和 从未签到签到都被标记为1,其他标记为0。

有了上述flag_days 和 sign_flag。步骤3中的sum_over,很容易理解是当flag_days 和sign_flag都是0则sum_over标记为0,否则是1,怎么理解?
flag_days = 0 :当前行与上一行日期连续
sign_flag = 0 :当前行与上一行sign_in值没有变化(这里先不管是不是签到了,等于1,因为连续的没签到也是单独分为一组,不会和签到的分到同一组)
所以有变化的数据行(日期断连 或者 sign_in变化)被标记为1,其他为0。得到sum_over。

实现

1、准备数据


DROP TABLE IF EXISTS tmp_tb_user_log;
CREATE TABLE tmp_tb_user_log (uid INT COMMENT '用户ID',artical_id INT COMMENT '视频ID',in_time string COMMENT '进入时间',out_time string COMMENT '离开时间',sign_in int COMMENT '是否签到'
) lifecycle 3;INSERT INTO tmp_tb_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES(1, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),(1, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),(1, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),(1, 0, '2022-07-10 10:00:00', '2022-07-10 10:00:09', 1),(1, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),(1, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 1),(1, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),(1, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),(1, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1),(2, 0, '2022-10-01 10:00:28', '2022-10-01 10:00:50', 1),(2, 0, '2022-10-02 10:00:01', '2022-10-02 10:01:50', 1),(2, 0, '2022-10-03 11:00:55', '2022-10-03 11:00:59', 1),(2, 0, '2022-10-04 11:00:45', '2022-10-04 11:00:55', 0),(2, 0, '2022-10-05 11:00:53', '2022-10-05 11:00:59', 1),(2, 0, '2022-10-06 11:00:45', '2022-10-06 11:00:55', 1),(3, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),(3, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),(3, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),(3, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),(3, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 1),(3, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),(3, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),(3, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1),(4, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),(4, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),(4, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),(4, 0, '2022-07-10 10:00:00', '2022-07-10 10:00:09', 1),(4, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),(4, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 0),(4, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),(4, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),(4, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1);select * from tmp_tb_user_log;

2、构造flag_days 和 sign_flag

select *,if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log

3、构造sum_over 和 group_name

select t.uid,t.artical_id,t.in_time,t.out_time,t.sign_in,sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
from(select *,if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flagfrom tmp_tb_user_log) t

4、构造irank

select t1.uid,t1.artical_id,t1.in_time,t1.out_time,t1.sign_in,t1.group_name,row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
from(select t.uid,t.artical_id,t.in_time,t.out_time,t.sign_in,sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_namefrom(select *,if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flagfrom tmp_tb_user_log) t) t1

5、处理金币获取规则

select t2.uid,t2.artical_id,t2.in_time,t2.out_time,t2.sign_in,t2.group_name,t2.irank,case when t2.irank % 7 = 3 then 3 when t2.irank % 7 = 0 then 7 else 1 end as coin
from(select t1.uid,t1.artical_id,t1.in_time,t1.out_time,t1.sign_in,t1.group_name,row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irankfrom(select t.uid,t.artical_id,t.in_time,t.out_time,t.sign_in,sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_namefrom(select *,if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flagfrom tmp_tb_user_log) t) t1) t2

6、分组计算每个用户每个月获取金币数

select t3.uid,substr(t3.in_time,1,7) as pmonth,sum(if(t3.sign_in = 1,coin,0)) as coins
from(select t2.uid,t2.artical_id,t2.in_time,t2.out_time,t2.sign_in,t2.group_name,t2.irank,case when t2.irank % 7 = 3 then 3 when t2.irank % 7 = 0 then 7 else 1 end as coinfrom(select t1.uid,t1.artical_id,t1.in_time,t1.out_time,t1.sign_in,t1.group_name,row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irankfrom(select t.uid,t.artical_id,t.in_time,t.out_time,t.sign_in,sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_namefrom(select *,if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flagfrom tmp_tb_user_log) t) t1) t2) t3
group by t3.uid,substr(t3.in_time,1,7);

最后

喜欢的点赞、关注、收藏吧~ 你的支持是最大的创作动力~~

相关内容

热门资讯

监控摄像头接入GB28181平... 流程简介将监控摄像头的视频在网站和APP中直播,要解决的几个问题是:1&...
Windows10添加群晖磁盘... 在使用群晖NAS时,我们需要通过本地映射的方式把NAS映射成本地的一块磁盘使用。 通过...
protocol buffer... 目录 目录 什么是protocol buffer 1.protobuf 1.1安装  1.2使用...
在Word、WPS中插入AxM... 引言 我最近需要写一些文章,在排版时发现AxMath插入的公式竟然会导致行间距异常&#...
【PdgCntEditor】解... 一、问题背景 大部分的图书对应的PDF,目录中的页码并非PDF中直接索引的页码...
Fluent中创建监测点 1 概述某些仿真问题,需要创建监测点,用于获取空间定点的数据࿰...
educoder数据结构与算法...                                                   ...
MySQL下载和安装(Wind... 前言:刚换了一台电脑,里面所有东西都需要重新配置,习惯了所...
修复 爱普生 EPSON L4... L4151 L4153 L4156 L4158 L4163 L4165 L4166 L4168 L4...
MFC文件操作  MFC提供了一个文件操作的基类CFile,这个类提供了一个没有缓存的二进制格式的磁盘...