Look 0 of my work involves HTML, well maybe 1-2 percent does; however, about 60% of my work involves regular expressions, grammar, lexical scanning and syntactic parsing, so it still irks me, and will irk me beyond my grave, when people say shit like ‘Don’t parse HTML/Markdown/etc with regex! Use a parser generator!’
So this is stupid, because most people know that HTML and Markdown are not the type of languages that require a push-down parser, or even a simple LL(1) recursive-descent parser! Unless by ‘parser generator’ they mean ‘lexer generator’ or ‘PEG generator’, they are wrong, or at least, partly incorrect.
Like my diabetes, they are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).
It’s preferred if you don’t do a syntax-directed lexical translation of Markdown or HTML, and it’s best if you build a tree. I learned that making Mukette and I am currently using my implementation of ASDL to build a tree. But truth is, unlike Context-Free languages, like any non-markup language, it is ENTIRELY possible to do a syntax-directed translation of HTML and Markdown, using pre-compiled, or runtime-compiled regex.
You will have to introduce states to make it a proper Automata, but even that is not required. I once did a syntax-directed translation of Markdown to HTML in AWK! With just one extra state.
I don’t remember the copypasta that was talk of the town 10 years ago, I was a kid back then (17) and I could not dig it up. But it’s a troll that has stuck with me ever since.
Maybe, just maybe, a PEG paser generator could have been what they meant. But even then, PEG generators generate a recursive-descent parser most of the times.
In fact, I dare you to use Byacc, Btacc, Bison, Racc, PYLR, ANTLR, peg(1), leg(1), PackCC or any of these LALR or LL parser generators to parse a markup language. You’ll have a very bad time, it is not impossible, it’s just an overkill.
TL;DR: Most markup languages, like HTML or Markdown, are best lexed, not parsed! Even if you wish to make a tree out of it. But for syntax-directed translations, REs would do.
Thanks.
PS: If you translate a markup language into a tree, you can translate that tree into other markup languages. That’s what Pandoc does. Pandoc is hands-down the best piece of tool I have laid my hands on.
This advice mostly applies to people who are less experienced and less familiar with just how complex HTML can be. As for other languages - if you’re doing regex on markdown, you’ll probably be fine (but you should verify if you’re writing something for the general case that must not fail). But in HTML’s case:
- You have nested languages (CSS and JS)
- You have tag-specific rules (
img
andlink
end in, but
div
must end in a separate closing tag) - Browsers use error correction to try to make sense of invalid HTML, like inserting missing tags. Many websites rely on this behavior.
If you’re trying to use Regex to parse a specific website’s HTML, you’ll be able to get what you want eventually, but as a general HTML parser, there will always be some website that breaks your assumptions.
HTML parsers scare me. I already knew it was a big job, but this blog post sealed the deal that HTML, err… the web’s interpretation of HTML(?), is one heck of a mess.
https://jakearchibald.com/2023/against-self-closing-tags-in-html/
The things is, if you try to parse html with a regex, you will always be able to construct valid html that will break it, no matter how complex your regex becomes.
Maybe you shouldn’t use regex to parse html?
How many combinations and levels of italics, bold and
strikethrough, combined with escaped chars like * can your program handle?How many combinations and levels of *italics*, **bold** and ~~strikethrough~~, combined with escaped chars like \* can your program handle?
btw this is the ASDL I wrote:
%{ #include "mukette.h" static Arena *ast_scratch = NULL; #define ALLOC(size) arena_alloc(ast_scratch, size) %} md_linkage = Hyperlink(md_compound? name, string url) | Image(string? alt, string path) ; md_inline = Italic(md_compound italic) | Bold(md_compound bold) | BoldItalic(md_compound bold_italic) | Strikethrough(md_compound strike_through) | InlineCode(string inline_code) | Linkage(md_linkage linkage) | RegularText(string regular_text) ; md_header_level = H1 | H2 | H3 | H4 | H5 | H6 ; md_line = Header(md_compound text, md_header_level level) | Indented(md_compound text, usize num_indents) | LinkReference(identifier name, string url, string title) | HorizontalLine ; md_compound = (md_inline* compound) ; md_table_row = (md_compound cells, size num_cell) ; md_table = (md_table_row* rows, size num_rows) ; md_ol_item = (int bullet_num, md_list_item item) ; md_ul_item = (char bullet_char, md_list_item item) ; md_list_item = TextItem(string text) | OrderedItem(md_ol_item ordered_item) | UnorderedItem(md_ul_item unordered_item) | NestedList(md_list nested_list) ; md_list = (md_list_item* items) ; md_block = Pargraph(md_compound* paragraph) | BlockQuote(md_compound* block_quote)2 | CodeListing(identifier? label, string code) | Table(md_table table) | List(md_list list) | Line(md_line line) ; markdown = (md_block* blocks) ; %% static inline void init_tree_scratch(void) { ast_scratch = arena_init(AST_ARENA_SIZE); } static inline void free_tree_scratch(void) { arena_free(ast_scratch); }
I had an easier time parsing ASDL with Yacc. I still can’t tell whether a grammar is LR, LL or RE, but I can tell that Markdown is not CFG.
I just updated ASDL: https://github.com/Chubek/ZephyrASDL
Apologies if I am too late on the documetnation. I am still trying to improve it by using it myself. I also wish to add an Ocaml target.
md_inline and md_compound use each other, and not only at the end xor the beginning of the rule, making this a non-type 3 grammar.
Sorry for the late response, i wanted to do a better response but don’t have the time for that currently.
Thanks. I actually have a parse-related question which I will post somewhere soon (as in 2-3 minute).