前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >百度大数据面试SQL-无效搜索

百度大数据面试SQL-无效搜索

作者头像
数据仓库晨曦
发布2024-06-27 20:39:06
910
发布2024-06-27 20:39:06
举报
文章被收录于专栏:数据仓库技术数据仓库技术

一、题目

现有一份用户搜索日志,包含用户ID,时间,用户搜索内容。定义 无效搜索:如果用户下一次搜索内容中包含本次搜索内容,则认为本次搜索为无效搜索。请查询用户无效搜索记录

样例数据

代码语言:javascript
复制
+---------+---------------------+------------------------+
| user_id |     search_time      |     search_content      |
+---------+---------------------+------------------------+
|       1 | 2022-01-01 10:00:00 |          apple           |
|       1 | 2022-01-01 11:30:00 |    banana and apple      |
|       1 | 2022-01-01 12:45:00 |        fruit salad       |
|       1 | 2022-01-01 15:00:00 |         apple pie        |
|       1 | 2022-01-01 16:20:00 |   applesauce recipe      |
|       2 | 2022-01-01 10:00:00 |        cat food          |
|       2 | 2022-01-01 11:30:00 |   wet vs dry cat food    |
|       2 | 2022-01-01 12:45:00 | homemade cat food recipe |
|       2 | 2022-01-01 14:00:00 | cat food brands to avoid |
|       2 | 2022-01-01 16:20:00 | best cat food for i...   |
|       3 | 2022-01-01 10:00:00 |          book            |
|       3 | 2022-01-01 11:30:00 |  books like Harry Potter|
|       3 | 2022-01-01 13:00:00 |best selling books ...   |
|       3 | 2022-01-01 14:30:00 |   bookstores near me    |
|       3 | 2022-01-01 15:45:00 |    how to publish a book|
+---------+---------------------+------------------------+

二、分析

  1. 有序计算,即对上下两行数据进行比较,考察的是lead()函数,用法参考:hive开窗函数-lag和lead函数
  2. 字符串包含判断,instr()函数。 INSTR(str, substr) 其中,str是要搜索的字符串,substr是要查找的子字符串。该函数返回子字符串在指定字符串中第一次出现的位置,如果未找到则返回0。

维度

评分

题目难度

⭐️⭐️⭐️

题目清晰度

⭐️⭐️⭐️⭐️⭐

业务常见度

⭐️⭐️⭐️⭐️

三、SQL

1.查询出下一行数据,并把下一行搜索内容作为新字段放到本行

执行SQL

代码语言:javascript
复制
select user_id,
       search_time,
       search_content,
       lead(search_content) over (partition by user_id order by search_time asc) as next_search_content
from user_search_log

查询结果

代码语言:javascript
复制
+----------+------------------------+---------------------------------+---------------------------------+
| user_id  |      search_time       |         search_content          |       next_search_content       |
+----------+------------------------+---------------------------------+---------------------------------+
| 1        | 2022-01-01 10:00:00.0  | apple                           | banana and apple                |
| 1        | 2022-01-01 11:30:00.0  | banana and apple                | fruit salad                     |
| 1        | 2022-01-01 12:45:00.0  | fruit salad                     | apple pie                       |
| 1        | 2022-01-01 15:00:00.0  | apple pie                       | applesauce recipe               |
| 1        | 2022-01-01 16:20:00.0  | applesauce recipe               | NULL                            |
| 2        | 2022-01-01 10:00:00.0  | cat food                        | wet vs dry cat food             |
| 2        | 2022-01-01 11:30:00.0  | wet vs dry cat food             | homemade cat food recipe        |
| 2        | 2022-01-01 12:45:00.0  | homemade cat food recipe        | cat food brands to avoid        |
| 2        | 2022-01-01 14:00:00.0  | cat food brands to avoid        | best cat food for indoor cats   |
| 2        | 2022-01-01 16:20:00.0  | best cat food for indoor cats   | NULL                            |
| 3        | 2022-01-01 10:00:00.0  | book                            | books like Harry Potter         |
| 3        | 2022-01-01 11:30:00.0  | books like Harry Potter         | best selling books of all time  |
| 3        | 2022-01-01 13:00:00.0  | best selling books of all time  | bookstores near me              |
| 3        | 2022-01-01 14:30:00.0  | bookstores near me              | how to publish a book           |
| 3        | 2022-01-01 15:45:00.0  | how to publish a book           | NULL                            |
+----------+------------------------+---------------------------------+---------------------------------+
2.比较搜索内容是否为下一次搜索内容的子字符串,给判断逻辑打标记(如果是返回1,否则返回0)

执行SQL

代码语言:javascript
复制
select user_id,
       search_time,
       search_content,
       lead(search_content) over (partition by user_id order by search_time asc)                                      as next_search_content,
       If(instr(lead(search_content) over (partition by user_id order by search_time asc), search_content) > 0, 1,
          0)                                                                                                          as flag
from user_search_log

查询结果

代码语言:javascript
复制
+----------+------------------------+---------------------------------+---------------------------------+-------+
| user_id  |      search_time       |         search_content          |       next_search_content       | flag  |
+----------+------------------------+---------------------------------+---------------------------------+-------+
| 1        | 2022-01-01 10:00:00.0  | apple                           | banana and apple                | 1     |
| 1        | 2022-01-01 11:30:00.0  | banana and apple                | fruit salad                     | 0     |
| 1        | 2022-01-01 12:45:00.0  | fruit salad                     | apple pie                       | 0     |
| 1        | 2022-01-01 15:00:00.0  | apple pie                       | applesauce recipe               | 0     |
| 1        | 2022-01-01 16:20:00.0  | applesauce recipe               | NULL                            | 0     |
| 2        | 2022-01-01 10:00:00.0  | cat food                        | wet vs dry cat food             | 1     |
| 2        | 2022-01-01 11:30:00.0  | wet vs dry cat food             | homemade cat food recipe        | 0     |
| 2        | 2022-01-01 12:45:00.0  | homemade cat food recipe        | cat food brands to avoid        | 0     |
| 2        | 2022-01-01 14:00:00.0  | cat food brands to avoid        | best cat food for indoor cats   | 0     |
| 2        | 2022-01-01 16:20:00.0  | best cat food for indoor cats   | NULL                            | 0     |
| 3        | 2022-01-01 10:00:00.0  | book                            | books like Harry Potter         | 1     |
| 3        | 2022-01-01 11:30:00.0  | books like Harry Potter         | best selling books of all time  | 0     |
| 3        | 2022-01-01 13:00:00.0  | best selling books of all time  | bookstores near me              | 0     |
| 3        | 2022-01-01 14:30:00.0  | bookstores near me              | how to publish a book           | 0     |
| 3        | 2022-01-01 15:45:00.0  | how to publish a book           | NULL                            | 0     |
+----------+------------------------+---------------------------------+---------------------------------+-------+

3.限制标签为1,查询出最后结果

执行SQL

代码语言:javascript
复制
select user_id,
       search_time,
       search_content
from (select user_id,
             search_time,
             search_content,
             if(instr(lead(search_content) over (partition by user_id order by search_time asc), search_content) > 0, 1,
                0) as flag
      from user_search_log) t
where flag = 1;

查询结果

代码语言:javascript
复制
+----------+------------------------+-----------------+
| user_id  |      search_time       | search_content  |
+----------+------------------------+-----------------+
| 1        | 2022-01-01 10:00:00.0  | apple           |
| 2        | 2022-01-01 10:00:00.0  | cat food        |
| 3        | 2022-01-01 10:00:00.0  | book            |
+----------+------------------------+-----------------+

四、建表语句和数据插入

代码语言:javascript
复制
--建表语句
CREATE TABLE user_search_log (
  user_id STRING,
  search_time TIMESTAMP,
  search_content STRING
) STORED AS PARQUET;
--插入数据
INSERT INTO user_search_log
VALUES
  ('1', '2022-01-01 10:00:00', 'apple'),
  ('1', '2022-01-01 11:30:00', 'banana and apple'),
  ('1', '2022-01-01 12:45:00', 'fruit salad'),
  ('1', '2022-01-01 15:00:00', 'apple pie'),
  ('1', '2022-01-01 16:20:00', 'applesauce recipe'),
  ('2', '2022-01-01 10:00:00', 'cat food'),
  ('2', '2022-01-01 11:30:00', 'wet vs dry cat food'),
  ('2', '2022-01-01 12:45:00', 'homemade cat food recipe'),
  ('2', '2022-01-01 14:00:00', 'cat food brands to avoid'),
  ('2', '2022-01-01 16:20:00', 'best cat food for indoor cats'),
  ('3', '2022-01-01 10:00:00', 'book'),
  ('3', '2022-01-01 11:30:00', 'books like Harry Potter'),
  ('3', '2022-01-01 13:00:00', 'best selling books of all time'),
  ('3', '2022-01-01 14:30:00', 'bookstores near me'),
  ('3', '2022-01-01 15:45:00', 'how to publish a book');
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-06-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据仓库技术 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、题目
  • 二、分析
  • 三、SQL
    • 1.查询出下一行数据,并把下一行搜索内容作为新字段放到本行
      • 2.比较搜索内容是否为下一次搜索内容的子字符串,给判断逻辑打标记(如果是返回1,否则返回0)
    • 3.限制标签为1,查询出最后结果
    • 四、建表语句和数据插入
    相关产品与服务
    大数据
    全栈大数据产品,面向海量数据场景,帮助您 “智理无数,心中有数”!
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档