前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >学习正则表达式 - 用 HTML 标记文本

学习正则表达式 - 用 HTML 标记文本

作者头像
用户1148526
发布2023-10-14 09:48:53
1410
发布2023-10-14 09:48:53
举报
文章被收录于专栏:Hadoop数据仓库Hadoop数据仓库

一、需求

        使用 rime.txt 中柯勒律治的诗文作为示例文本,通过正则表达式为普通文本添加 HTML5 标签。可以在 Github 中找到 rime.txt 文件,地址是https://github.com/michaeljamesfitzgerald/Introducing-Regular-Expressions。为了节省篇幅,节选部分文本作为测试数据。

二、实现

1. 插入测试数据

代码语言:javascript
复制
drop table if exists t1;
create table t1 (a text);
insert into t1 values (
'THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.

ARGUMENT.

How a Ship having passed the Line was driven by Storms to the cold Country towards the South Pole; and how from thence she made her course to the tropical Latitude of the Great Pacific Ocean; and of the strange things that befell; and in what manner the Ancyent Marinere came back to his own Country.

I.

     It is an ancyent Marinere,
       And he stoppeth one of three:
     "By thy long grey beard and thy glittering eye
       "Now wherefore stoppest me?

     "The Bridegroom\'s doors are open\'d wide
       "And I am next of kin;
     "The Guests are met, the Feast is set,--
       "May\'st hear the merry din.--

II.

     The Sun came up upon the right,
       Out of the Sea came he;
     And broad as a weft upon the left
       Went down into the Sea.

     And the good south wind still blew behind,
       But no sweet Bird did follow
     Ne any day for food or play
       Came to the Marinere\'s hollo!

III.

     I saw a something in the Sky
       No bigger than my fist;
     At first it seem\'d a little speck
       And then it seem\'d a mist:
     It mov\'d and mov\'d, and took at last
       A certain shape, I wist.

     A speck, a mist, a shape, I wist!
       And still it ner\'d and ner\'d;
     And, an it dodg\'d a water-sprite,
       It plung\'d and tack\'d and veer\'d.

IV.

     "I fear thee, ancyent Marinere!
       "I fear thy skinny hand;
     "And thou art long and lank and brown
       "As is the ribb\'d Sea-sand.

     "I fear thee and thy glittering eye
       "And thy skinny hand so brown"--
     Fear not, fear not, thou wedding guest!
       This body dropt not down.

V.

     O sleep, it is a gentle thing
       Belov\'d from pole to pole!
     To Mary-queen the praise be yeven
     She sent the gentle sleep from heaven
       That slid into my soul.

     The silly buckets on the deck
       That had so long remain\'d,
     I dreamt that they were fill\'d with dew
       And when I awoke it rain\'d.

VI.

           FIRST VOICE.
     "But tell me, tell me! speak again,
       "Thy soft response renewing--
     "What makes that ship drive on so fast?
       "What is the Ocean doing?"

           SECOND VOICE.
     "Still as a Slave before his Lord,
       "The Ocean hath no blast:
     "His great bright eye most silently
       "Up to the moon is cast--

VII.

     This Hermit good lives in that wood
       Which slopes down to the Sea.
     How loudly his sweet voice he rears!
     He loves to talk with Marineres
       That come from a far Contrée.

     He kneels at morn and noon and eve--
       He hath a cushion plump:
     It is the moss, that wholly hides
       The rotted old Oak-stump.'
);

2. 使用 SQL 查询添加标签

代码语言:javascript
复制
with 
t1 as (select regexp_replace                       -- 添加头部标签
(a, '^(.*)$',
'<! DOCTYPE html>
<html lang="en">
 <head>
  <title>$1<title>
 </head>
<body>
<h1>$1<h1>',1,1,'m') a from t1),
t2 as (select regexp_replace                       -- 添加尾部标签
(a, '($)',concat('$1',char(10),char(10),'</body>',char(10),'</html>')) a from t1),
t3 as (select regexp_replace                       -- 添加标题标签
(a, '^(ARGUMENT\\.|((I{1,3}|IV|VI{0,2})\\.))$','<h2>$1</h2>', 1,0,'m') a from t2),
t4 as (select regexp_replace                       -- 添加段落标签
(a, '((?<=<h2>ARGUMENT\\.</h2>\\n).*?(?=\\n<h2>I\\.</h2>)|(?<=<h2>I\\.</h2>\\n).*?(?=\\n</body>))','<p>$1</p>', 1,0,'n') a from t3),
t5 as (select regexp_replace                       -- 添加换行标签
(a, '^([ ]{5,7}.*)', '$1<br/>',1,0,'m') a from t4),
t6 as (select regexp_replace(regexp_replace        -- 添加空行标签
(a, '^$', '<br/>',1,0,'m'),'<br/>','',1,1,'m') a from t5)
select * from t6;

        查询结果如下,在原文本上添加了基本的 HTML5 头尾、标题、段落、换行等标签:

代码语言:javascript
复制
<! DOCTYPE html>
<html lang="en">
 <head>
  <title>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<title>
 </head>
<body>
<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<h1>

<h2>ARGUMENT.</h2>
<p>
How a Ship having passed the Line was driven by Storms to the cold Country towards the South Pole; and how from thence she made her course to the tropical Latitude of the Great Pacific Ocean; and of the strange things that befell; and in what manner the Ancyent Marinere came back to his own Country.
</p>
<h2>I.</h2>
<p>
     It is an ancyent Marinere,<br/>
       And he stoppeth one of three:<br/>
     "By thy long grey beard and thy glittering eye<br/>
       "Now wherefore stoppest me?<br/>
<br/>
     "The Bridegroom's doors are open'd wide<br/>
       "And I am next of kin;<br/>
     "The Guests are met, the Feast is set,--<br/>
       "May'st hear the merry din.--<br/>
<br/>
<h2>II.</h2>
<br/>
     The Sun came up upon the right,<br/>
       Out of the Sea came he;<br/>
     And broad as a weft upon the left<br/>
       Went down into the Sea.<br/>
<br/>
     And the good south wind still blew behind,<br/>
       But no sweet Bird did follow<br/>
     Ne any day for food or play<br/>
       Came to the Marinere's hollo!<br/>
<br/>
<h2>III.</h2>
<br/>
     I saw a something in the Sky<br/>
       No bigger than my fist;<br/>
     At first it seem'd a little speck<br/>
       And then it seem'd a mist:<br/>
     It mov'd and mov'd, and took at last<br/>
       A certain shape, I wist.<br/>
<br/>
     A speck, a mist, a shape, I wist!<br/>
       And still it ner'd and ner'd;<br/>
     And, an it dodg'd a water-sprite,<br/>
       It plung'd and tack'd and veer'd.<br/>
<br/>
<h2>IV.</h2>
<br/>
     "I fear thee, ancyent Marinere!<br/>
       "I fear thy skinny hand;<br/>
     "And thou art long and lank and brown<br/>
       "As is the ribb'd Sea-sand.<br/>
<br/>
     "I fear thee and thy glittering eye<br/>
       "And thy skinny hand so brown"--<br/>
     Fear not, fear not, thou wedding guest!<br/>
       This body dropt not down.<br/>
<br/>
<h2>V.</h2>
<br/>
     O sleep, it is a gentle thing<br/>
       Belov'd from pole to pole!<br/>
     To Mary-queen the praise be yeven<br/>
     She sent the gentle sleep from heaven<br/>
       That slid into my soul.<br/>
<br/>
     The silly buckets on the deck<br/>
       That had so long remain'd,<br/>
     I dreamt that they were fill'd with dew<br/>
       And when I awoke it rain'd.<br/>
<br/>
<h2>VI.</h2>
<br/>
           FIRST VOICE.<br/>
     "But tell me, tell me! speak again,<br/>
       "Thy soft response renewing--<br/>
     "What makes that ship drive on so fast?<br/>
       "What is the Ocean doing?"<br/>
<br/>
           SECOND VOICE.<br/>
     "Still as a Slave before his Lord,<br/>
       "The Ocean hath no blast:<br/>
     "His great bright eye most silently<br/>
       "Up to the moon is cast--<br/>
<br/>
<h2>VII.</h2>
<br/>
     This Hermit good lives in that wood<br/>
       Which slopes down to the Sea.<br/>
     How loudly his sweet voice he rears!<br/>
     He loves to talk with Marineres<br/>
       That come from a far Contrée.<br/>
<br/>
     He kneels at morn and noon and eve--<br/>
       He hath a cushion plump:<br/>
     It is the moss, that wholly hides<br/>
       The rotted old Oak-stump.<br/>
</p>
</body>
</html>

三、分析

        该实现使用内嵌视图,嵌套6层,依次调用了 7 次 regexp_replace 函数添加标签。

1. 添加头部标签

        用如下 regexp_replace 函数添加头部标签。

代码语言:javascript
复制
regexp_replace
(a, '^(.*)$',
'<! DOCTYPE html>
<html lang="en">
 <head>
  <title>$1<title>
 </head>
<body>
<h1>$1<h1>',1,1,'m')
  • 使用多行模式 m,将换行符作为结束符。
  • 正则表达式 ^(.*)$ 匹配原文本每一行,并将匹配结果放到一个捕获组中。
  • 只替换第一行。
  • 添加 html、head、title、body、h1 等标签,其中用 $1 引用捕获组。

2. 添加尾部标签

        用如下 regexp_replace 函数添加尾部标签。

代码语言:javascript
复制
regexp_replace(a, '($)',concat('$1',char(10),char(10),'</body>',char(10),'</html>'))
  • 使用缺省匹配模式(单行、非dotall)。
  • 正则表达式 ($) 匹配原文本唯一结尾位置(零宽断言),并将匹配结果放到一个捕获组中。
  • 使用 concat 函数在结尾位置添加一个换行符、一个空行、以及 </body> 和 </html>

3. 添加标题标签

        诗文分为七个部分,每一部分以一个罗马数字开头。还有一个“ARGUMENT”标题。下面的 regexp_replace 函数捕获标题和罗马数字,并将它们用 <h2> 标签包括起来。

代码语言:javascript
复制
regexp_replace(a, '^(ARGUMENT\\.|((I{1,3}|IV|VI{0,2})\\.))$','<h2>$1</h2>', 1,0,'m')
  • 使用多行模式 m,将换行符作为结束符,完成多行替换。
  • 正则表达式 ^(ARGUMENT\\.|((I{1,3}|IV|VI{0,2})\\.))$ 匹配 ARGUMENT 标题和所有罗马数字的行,并将匹配结果放到一个捕获组中。
  • 替换所有匹配项。
  • 添加 h2、/h2 标签,其中用 $1 引用捕获组。

4. 添加段落标签

        用如下 regexp_replace 函数添加段落标签。

代码语言:javascript
复制
regexp_replace(a, '((?<=<h2>ARGUMENT\\.</h2>\\n).*?(?=\\n<h2>I\\.</h2>)|(?<=<h2>I\\.</h2>\\n).*?(?=\\n</body>))','<p>$1</p>', 1,0,'n')
  • 在原文中添加两个段落标签,第一段是 ARGUMENT 部分的正文,第二段是整个诗文正文。
  • 因为段落中包含换行符,需要使用 dotall 模式 n,使得 . 也匹配换行符 \n。
  • (?<=<h2>ARGUMENT\\.</h2>\\n).*?(?=\\n<h2>I\\.</h2>) 使用环视匹配 <h2>ARGUMENT.</h2> 和 <h2>I.</h2> 之间的部分。
  • (?<=<h2>I\\.</h2>\\n).*?(?=\\n</body>) 使用环视匹配 <h2>I.</h2> 和 </body> 之间的部分。
  • 将匹配结果放到一个捕获组中。
  • 替换所有匹配项。
  • 给两个段落添加 p、/p 标签,其中用 $1 引用捕获组。

5. 添加换行标签

        用如下 regexp_replace 函数标记多行诗文。

代码语言:javascript
复制
regexp_replace(a, '^([ ]{5,7}.*)', '$1<br/>',1,0,'m')
  • 使用多行模式 m,将换行符作为结束符,完成多行替换。
  • 正则表达式 ^([ ]{5,7}.*) 匹配每个开头有5至7个空格的行,并将匹配结果放到一个捕获组中。
  • 替换所有匹配项。
  • 在每行诗文后添加换行标签 <br/>,其中用 $1 引用捕获组。

6. 添加空行标签

        用如下两层嵌套 regexp_replace 函数为空行添加标签。

代码语言:javascript
复制
regexp_replace(regexp_replace(a, '^$', '<br/>',1,0,'m'),'<br/>','',1,1,'m')
  • 使用多行模式 m 匹配多行。
  • 正则表达式 ^$ 匹配空行。
  • 内层 regexp_replace 将所有空行替换为 <br/> 标签。
  • 外层 regexp_replace 将第一个 <br/> 替换为空,将 h1 和 第一个 h2 之间的 <br/> 再还原成空行。
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2023-05-19,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、需求
  • 二、实现
    • 1. 插入测试数据
      • 2. 使用 SQL 查询添加标签
      • 三、分析
        • 1. 添加头部标签
          • 2. 添加尾部标签
            • 3. 添加标题标签
              • 4. 添加段落标签
                • 5. 添加换行标签
                  • 6. 添加空行标签
                  领券
                  问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档