How browsers parse HTML
By Jane Doe, April 2026 — a long-enough body so Readability accepts it.
This article explains the three big phases of how a browser turns HTML bytes into pixels. We talk about byte streams, tokenization, and DOM construction. Plenty of words here so Readability doesn't bail out on charThreshold.
Phases
- Decode bytes into characters
- Tokenize into start/end tags
-
Build the DOM
- Parents adopt children
- Implicit elements get inserted
An example
function parse(input: string): Node[] {
return tokenize(input).map(toNode);
}
"Browsers are the most complex software you'll touch." — old internet wisdom
Comparison table
| Phase | Output |
|---|---|
| Decode | Code points |
| Tokenize | Tokens |
| Construct | DOM tree |
Read more in the follow-up post or look at
the diagram. Any plain
paragraph is fine — Readability just needs ~200 chars of running text
to stay confident this is the article body. So here we are with extra
padding to clear that bar without straining anyone's eyeballs.