IT人写技术文档,例如我自己写博客,用的最多的就是 markdown. 但是在浏览器中看到的这些博客都是以 html 的格式展示在人们的面前的. 所以一个自然的问题就是markdown怎么变成html的?
众所周知,markdown和html都是全球通用的标记语言,那么从一种语言要转换为另一种语言不就是编译吗? 这学期刚好学了编译原理. 正好用上.
ps:这里墙裂向大家安利一下 coding 迪士尼陈屹老师的免费课程《自己动手用java写编译器》,一节课讲原理,一节课写代码,知行合一,落到实处.
这里并不想一次性写一个非常完善的markdown转html的语法解析器. 只是想将仅仅包含标题和正文的markdown文档严格遵从编译原理的流程步骤转换为html.
也就是我们写这个编译器的步骤如下
为什么要严格遵从上述编译原理的框架? 因为只有这样,这个编译器的扩展性才更好,才能为后续写更复杂的markdown语法转html编译器打下基础框架. 而不是靠灵光一闪的技巧性处理, 那种是很难维护和扩展的.
什么叫做仅仅包含标题和正文? 举个例子,咱们想要处理的markdown文档长下面的样子
### 人类補完计划 Nerv 绝密 \n
不能逃避、不能逃避、不能逃避,知道何谓痛苦的人比较能温柔待人。这和软弱是不同的。 \n
##这是最优先事项 \n
结果人类的敌人还是敌人, 并非使徒. \n
#### 人类终将補完 \n
肉体归于LCL之海, 灵魂回到莉莉丝之卵, 和亚当握手吧!\n
审判日到来之时, 人类还是使徒? 贤者必将做出抉择!
# 死海古卷的仪式被碇源堂改了! SEELE 对 Nerv 绝不会善罢甘休
编译后的html文档如下
<h3>
人类補完计划 Nerv 绝密
</h3>
<span>
不能逃避、不能逃避、不能逃避,知道何谓痛苦的人比较能温柔待人。这和软弱是不同的。
</span>
<br/>
<h2>
这是最优先事项
</h2>
<span>
结果人类的敌人还是敌人, 并非使徒.
</span>
<br/>
<h4>
人类终将補完
</h4>
<span>
肉体归于LCL之海, 灵魂回到莉莉丝之卵, 和亚当握手吧!
</span>
<br/>
<span>
审判日到来之时, 人类还是使徒? 贤者必将做出抉择!
</span>
<br/>
<h1>
死海古卷的仪式被碇源堂改了! SEELE 对 Nerv 绝不会善罢甘休
</h1>
首先,我们要制定语法. 我们不难 YY 出如下巴克斯范式
article -> title article | line article | ε
title -> POUND line
line -> TEXT LF
其中遵从国际惯例, 小写英文单词表示非终结符, 大写英文单词表示终结符. 关于语法中的token和symbol的说明如下
#
字符.显然,上述语法其实还可以再用一下边角替换再精简一些,变成标准的 LL(1) 语法
article -> title_or_line article | ε
title_or_line -> title | line
title -> POUND line
line -> TEXT LF
但是注意上述巴克斯范式的第二条,其实是不利于我们写 LL(1) 的,而只利于写递归下降. 所以要进行单子替换,于是得到我们最终的语法推导规则
article -> title_or_line article | ε
title_or_line -> POUND line | line
line -> TEXT LF
这里有必要指出的一点就是,因为 LF 被解释为终结符, 所以一行后面只能跟一个换行符. 也就是
### hello world! \n \n Almost every child will complain about their parents sometimes. \n
将编译失败. 因为 hello world! 后面跟了两个 \n
, 但是
### hello world! \n Almost every child will complain about their parents sometimes. \n
将编译成功, 编译为
<h3> hello world! </h3><span> Almost every child will complain about their parents sometimes. </span><br/>
如果要想允许文本后面跟多个换行符,就不能将 LF 解释为终结符,而要将上述语法扩展为
article -> title_or_line article | ε
title_or_line -> POUND line | line
line -> TEXT lf
lf -> LF lf
本文简单起见,并未做这一点. 还有一点,语法要求任何文本都要以换行符 '\n'
结束. 这是因为巴克斯范式 line -> TEXT LF
所致. 如果要打破这一点,同样可以再改动一下语法为
article -> title_or_line article | ε
title_or_line -> POUND line | line
line -> TEXT lf
lf -> LF | ε
同样,本文简单起见,也并未做这一点.
综上所述,我们要求每个终结符 TEXT 都以恰好一个换行符为结束.
本例中的token仅仅有 POUND、TEXT、LF 三个. 只需要提取文本中相应的字符串即可。显然,词法提取如果简单的采用顺序读入然后各种 if...else 的处理的话, 程序将显得异常臃肿. 这里建立了一个词法状态机来进行词法提取
各个弧的转移字符如下
1 #
2 \3 #
4 \5 n
6 空格
7 \8 [^\]
9 \10 [^\]
11 [^\]
12 [^#\]
13 [^空格]
14 空格
15 #
16 \
因为我们 yy 出来的语法是一个标准的 LL(1) 语法,所以使用标准的 PDA 属性化语法解析即可.
该语法的代码生成比较简单,甚至比算术表达式解析还要简单,因为只需要每解析得到一个终结符的时候就进行相应的html打标签即可.
但是代码生成阶段往往要考虑是否存在属性下传(自顶向下语法解析考虑的是属性下传, 自底向上语法解析考虑的是属性上传),这里显然要下传一个属性就是 解析到的文本内容 到底是作为标题还是正文. 所以我们需要下传的是这个属性. 也就是下面代码md2html.js中的 fa.
为了让我的代码更加流行, 我拿起很久没撸过的 JavaScript 搞了一个js版本, 用的es6的语法. js 不是很熟, 用的不好, 各位大佬将就一下哈~
// 词性
const TOKEN_END = 0
const TOKEN_POUND = 1
const TOKEN_TEXT = 2
const TOKEN_LF = 3
// 词法解析节点
const LEX_NL = 0
const LEX_POUND = 1
const LEX_BACK_SLASH = 2
const LEX_TEXT = 3
const LEX_SPACE = 4
// 字符属性
const CHARACTER_POUND = 0
const CHARACTER_BACK_SLASH = 1
const CHARACTER_N = 2
const CHARACTER_SPACE = 3
const CHARACTER_OTHER = 4
// 语法节点
const GRAMMAR_STATE_ARTICLE = 0
const GRAMMAR_STATE_LINE = 1
const GRAMMAR_STATE_TITLE_OR_LINE = 2
const GRAMMAR_STATE_POUND = 3
const GRAMMAR_STATE_TEXT = 4
const GRAMMAR_STATE_LF = 5
// 文本属性
const ATTR_TITLE = 0
const ATTR_TEXT = 1
// 辣鸡 JavaScript, 连栈都要手撸~ 囧~
let Stack = (function () {
const items = new WeakMap();
class Stack {
constructor () {
items.set(this, []);
}
push(element) {
let s = items.get(this);
s.push(element);
}
pop() {
let s = items.get(this);
return s.pop();
}
peek() {
let s = items.get(this);
return s[s.length - 1];
}
isEmpty() {
return items.get(this).length === 0;
}
size() {
return items.get(this).length;
}
clear() {
items.set(this, []);
}
}
return Stack;
})();
let token, cur, pointer = 0, markdown, totLen;
let symbol = [];
let trans = new Map();
let stk = new Stack();
let fa = {
first : null,
second: null
}
let ans = ""
// 初始化 trans
let row0 = new Map()
row0[CHARACTER_POUND] = LEX_POUND, row0[CHARACTER_SPACE] = LEX_SPACE, row0[CHARACTER_BACK_SLASH] = LEX_BACK_SLASH, row0[CHARACTER_N] = LEX_TEXT, row0[CHARACTER_OTHER] = LEX_TEXT
trans[LEX_NL] = row0
let row1 = new Map()
row1[CHARACTER_POUND] = LEX_POUND, row1[CHARACTER_SPACE] = LEX_TEXT, row1[CHARACTER_BACK_SLASH] = LEX_BACK_SLASH, row1[CHARACTER_N] = LEX_TEXT, row1[CHARACTER_OTHER] = LEX_TEXT
trans[LEX_POUND] = row1
let row2 = new Map()
row2[CHARACTER_POUND] = LEX_TEXT, row2[CHARACTER_SPACE] = LEX_TEXT, row2[CHARACTER_BACK_SLASH] = LEX_BACK_SLASH, row2[CHARACTER_N] = LEX_NL, row2[CHARACTER_OTHER] = LEX_TEXT
trans[LEX_BACK_SLASH] = row2
let row3 = new Map()
row3[CHARACTER_POUND] = LEX_TEXT, row3[CHARACTER_SPACE] = LEX_TEXT, row3[CHARACTER_BACK_SLASH] = LEX_BACK_SLASH, row3[CHARACTER_N] = LEX_TEXT, row3[CHARACTER_OTHER] = LEX_TEXT
trans[LEX_TEXT] = row3
let row4 = new Map()
row4[CHARACTER_POUND] = LEX_POUND, row4[CHARACTER_SPACE] = LEX_SPACE, row4[CHARACTER_BACK_SLASH] = LEX_BACK_SLASH, row4[CHARACTER_N] = LEX_TEXT, row4[CHARACTER_OTHER] = LEX_TEXT
trans[LEX_SPACE] = row4
let getType = (c) => {
switch (c)
{
case '#':
return CHARACTER_POUND;
case '\\':
return CHARACTER_BACK_SLASH;
case ' ':
return CHARACTER_SPACE;
case 'n':
return CHARACTER_N;
default:
return CHARACTER_OTHER;
}
}
let adv = () => {
let cmd, i;
cur = LEX_NL, symbol = [];
do
{
if (pointer == totLen)
{
return TOKEN_END;
}
cmd = markdown[pointer++];
symbol.push(cmd);
cur = trans[cur][getType(cmd)];
} while (cur ^ LEX_NL && cur ^ LEX_POUND);
if (cur == LEX_NL)
{
i = 0;
while (symbol[i] == ' ')
{
++i;
}
if (symbol.length - i == 2)
{
return TOKEN_LF;
}
pointer -= 2;
symbol.pop(), symbol.pop();
return TOKEN_TEXT;
}
else if (cur == LEX_POUND)
{
symbol = []
while (cmd == '#')
{
symbol.push(cmd);
cmd = markdown[pointer++]
}
if (cmd != undefined)
{
--pointer
}
return TOKEN_POUND;
}
}
let gen_title = () => {
ans += "<h";
ans += symbol.length;
ans += ">"
}
let gen_p = () => {
if (fa.first == ATTR_TEXT) {
ans += "<span style=\"color:black;\">"
}
for (let i = 0; symbol[i]; i++)
{
if (symbol[i] == ' ')
{
ans += " ";
}
else
{
ans += symbol[i]
}
}
if (fa.first == ATTR_TITLE)
{
ans += "</h"
ans += fa.second
ans += ">"
}
else if (fa.first == ATTR_TEXT) {
ans += "</span>"
}
}
let gen_br = () => {
if (fa.first == ATTR_TEXT) {
ans += "<br/>"
}
}
export const parse = (md) => {
markdown = md;
pointer = 0;
totLen = md.length;
token = adv();
stk.push(GRAMMAR_STATE_ARTICLE);
while (!stk.isEmpty() && token)
{
let action = stk.peek(); stk.pop();
switch (action)
{
case GRAMMAR_STATE_ARTICLE:
if (token)
{
stk.push(GRAMMAR_STATE_ARTICLE);
stk.push(GRAMMAR_STATE_TITLE_OR_LINE); // __cdecl 压栈
}
break;
case GRAMMAR_STATE_TITLE_OR_LINE:
stk.push(GRAMMAR_STATE_LINE);
if (token == TOKEN_POUND)
{
stk.push(GRAMMAR_STATE_POUND);
}
break;
case GRAMMAR_STATE_LINE:
stk.push(GRAMMAR_STATE_LF);
stk.push(GRAMMAR_STATE_TEXT);
break;
case GRAMMAR_STATE_POUND:
if (token == TOKEN_POUND)
{
gen_title(); // html代码生成
fa.first = ATTR_TITLE, fa.second = symbol.length; // 下传父节点属性
}
else
{
return console.log("编译失败..."), 0;
}
token = adv();
break;
case GRAMMAR_STATE_TEXT:
if (token == TOKEN_TEXT)
{
gen_p();
}
else
{
return console.log("编译失败..."), 0;
}
token = adv();
break;
case GRAMMAR_STATE_LF:
if (token == TOKEN_LF)
{
gen_br();
fa.first = ATTR_TEXT;
}
else
{
return console.log("编译失败..."), 0;
}
token = adv();
break;
}
}
console.log(ans)
return console.log("编译成功!"), ans;
}
要在vue中使用它很简单.
<template>
...
<span class = "field" v-html = "md"></span>
</template>
<script>
import { parse } from '@/api/md2html';
export default {
data() {
return {
md: '### 人类终将補完 \\n'
+ ' 肉身归于LCL之海, 灵魂归于莉莉丝之卵. 审批之日终将来临. 这是早就写在死海古卷里面的事情, 无法改变的, 碇源渡.'
}
},
mounted() {
this.md = parse(this.md);
},
}
</script>
markdown文档如下
### hello world! \n Almost every child will complain about their parents sometimes. \n ##No matter what happen to us \n But ignore about the unhappy time, our parents love us all the time. \n #### because when people stay together for a long time \n No matter what happen to us, they will stand by our sides. We should be grateful to them and try to understand them.\n # It is natural \n
编译后的html如下
<h3> hello world! </h3><br/><p> Almost every child will complain about their parents sometimes. </p><br/><h2>No matter what happen to us </h2><br/><p> But ignore about the unhappy time, our parents love us all the time. </p><br/><h4> because when people stay together for a long time </h4><br/><p> No matter what happen to us, they will stand by our sides. We should be grateful to them and try to understand them.</p><br/><h1> It is natural </h1><br/>
浏览器展示如下