A Crash Course in Markdig
In this article I'm going to teach you how to use Markdig, which is a C# Markdown processing library. I'll cover basic usage, using built-in extensions, parsing YAML front matter, and even writing a custom extension that automatically generates an HTML table of contents from Markdown headings. The table of contents on this page was generated using the extension we'll create.
Getting started
First, install the Markdig package from NuGet. Converting Markdown to HTML is as simple as this:
var markdown =
"""
# This is a heading
This is a paragraph and [this is a link](http://www.google.com)
This is a second paragraph with an inline URI http://johnh.co
""";
var html = Markdown.ToHtml(markdown);
html will contain:
<h1>This is a heading</h1>
<p>This is a paragraph and <a href="http://www.google.com">this is a link</a></p>
<p>This is a second paragraph with an inline URI http://johnh.co</p>
This is a good start! We have our HTML, but you may already notice a few things:
- The inline URI isn't converted to a link at all
- Links won't open in a new tab
- Links aren't rewritten to ensure they're
https
Getting the markup we want is the focus of the remainder of this article.
A whirlwind tour of Markdig internals
To understand how best to configure, or extend, Markdig to produce the HTML we want, it's best to have a rough grasp of how Markdig works.
To create HTML from Markdown, Markdig first needs to parse a Markdown document to understand its meaning. The output of this parsing process is an Abstract Syntax Tree (AST). The AST is a combination of Block and Inline elements that represent the fundamental building blocks of a Markdown document. Let's say we have the following Markdown document:
# How to Make Money
Call `moneytree.grow()` to increase the money on the tree.
Parsing this document would give us an AST structured roughly like this:
MarkdownDocument
├─ HeadingBlock
│ └─ LiteralInline ("How to Make Money")
└─ ParagraphBlock
├─ LiteralInline ("Call ")
├─ CodeInline ("moneytree.grow()")
└─ LiteralInline (" to increase the money on the tree.")
We've gone from a string that has no contextual meaning to an AST that describes the structure of our document. For example, as part of this process the parser has taken the string # How to Make Money, and turned it into a HeadingBlock that contains a LiteralInline.
An AST provides the context an HTML renderer can use to know what tags it should insert into the output, and what their contents should be. If we generated HTML from our tree, we would expect a heading that has some text, followed by a paragraph that contains some text, some inline code, and more text. But notice there is nothing restricting us to only creating HTML from this – we could absolutely extend Markdig to turn this tree into XML or a PDF if we wanted to.
To produce the AST, Markdig has a MarkdownPipeline that contains Block parsers and Inline parsers. It's these parsers that determine the final output of the AST. If we want to customise our HTML, we have a few options:
- Use a built-in extension, if one exists, that covers our needs
- Create a custom parser to modify how the AST is constructed
- Modify the AST after it's been created
- Configure an
HtmlRendererinstance to control how HTML is generated
Modifying the AST after it's been created naturally requires a second pass. If you want to write the least amount of code, this is likely your best bet if no existing extension serves your needs. But if you need to modify the AST while it's being constructed, or require better performance, then implementing a parser is the way to go.
I'll cover parsers below, where we'll build a custom extension that to create a table of contents.
(Aside: If you'd like to know more about ASTs and how they relate to creating a compiler, I'd encourage you to read the excellent Crafting Interpreters book by Robert Nystrom. It's a high-quality book that walks you through building a programming language called Lox, and you can read it online for free.)
Using built-in extensions
Markdig comes with many built-in extensions. One of these is PipeTableExtension, which we can use to convert tables defined in Markdown into HTML tables. To tell Markdig we want to use this extension, we need to modify its default pipeline. We can do that like so:
var pipeline = new MarkdownPipelineBuilder()
.UsePipeTables()
.Build();
var markdown =
"""
| Price | Units Sold |
|-------|------------|
| £4 | 670 |
""";
// make sure to pass pipeline as a parameter here,
// otherwise the PipeTableExtension won't be used
var html = Markdown.ToHtml(markdown, pipeline);
UsePipeTables() is a predefined extension method that adds the PipeTableExtension into our pipeline. The resulting HTML is:
<table>
<thead>
<tr>
<th>Price</th>
<th>Units Sold</th>
</tr>
</thead>
<tbody>
<tr>
<td>£4</td>
<td>670</td>
</tr>
</tbody>
</table>
Modifying an AST after it's been created
We can now tackle two of the problems from the first example at the top of this page, namely converting inline URIs to links and making links open in a new tab:
var pipeline = new MarkdownPipelineBuilder()
.UseAutoLinks()
.Build();
var markdown =
"""
[This](http://one.co) is a regular link.
This http://two.co is an inline URI.
This <http://three.co> is an autolink.
""";
var document = Markdown.Parse(markdown, pipeline);
foreach (var link in document.Descendants<LinkInline>())
{
if (!link.IsImage)
{
link.GetAttributes().AddPropertyIfNotExist("target", "_blank");
}
}
foreach (var link in document.Descendants<AutolinkInline>())
{
link.GetAttributes().AddPropertyIfNotExist("target", "_blank");
}
var html = document.ToHtml();
This outputs:
<p>
<a href="http://one.co" target="_blank">This</a> is a regular link.
This <a href="http://two.co" target="_blank">http://two.co</a> is an inline URI.
This <a href="http://three.co" target="_blank">http://three.co</a> is an autolink.
</p>
The process is as follows:
- Configure our pipeline to use the
AutoLinkExtension, which will convert inline URIs to links - Parse a Markdown document to produce an AST
- Find all child nodes in the AST that represent links, excluding those that are images
- Add a
target="_blank"attribute to any if they don't already have one - Generate our HTML
It's worth noting that we've done a few things differently here. This is the first time we've used Markdown.Parse(). We're using it so we can get access to the AST before it's used to generate HTML. Make sure to pass pipeline as a parameter to it. If you don't, the default pipeline will be used, meaning AutoLinkExtension won't take part in the parsing process. I made this mistake a few times when figuring out how the library worked.
We're also walking the AST by way of the Descendants<>() method. Because it returns an IEnumerable<T>, we can use it with all of our favourite LINQ methods. We could also have written our functionality like so:
foreach (var node in document.Descendants())
{
if (node is LinkInline link && !link.IsImage)
{
link.GetAttributes().AddPropertyIfNotExist("target", "_blank");
}
else if (node is AutolinkInline autolink)
{
autolink.GetAttributes().AddPropertyIfNotExist("target", "_blank");
}
}
This produces the same HTML as before, but it requires only one pass through the AST. Using Descendants<>() is merely a nice convenience when working with a single node type.
Configuring an HtmlRenderer instance to control how HTML is generated
It's time to rewrite our links to use https.
If you've been following along, you may notice AutoLinkExtension which, from looking at the AutoLinkOptions type, would seem designed to handle opening links in new tabs, and rewriting URIs to enforce https. Unfortunately, I found this extension to behave inconsistently in a number of cases and I don't recommend using it for anything other than recognising plain inline URIs.
Instead, we can make use of HtmlRenderer.LinkRewriter:
var pipeline = new MarkdownPipelineBuilder()
// again, we're only using this to turn the inline URI into a link
.UseAutoLinks()
.Build();
using var writer = new StringWriter();
var renderer = new HtmlRenderer(writer)
{
LinkRewriter = link => link.Replace("http://", "https://")
};
var markdown =
"""
[this](http://one.co) is a regular link.
This http://two.co is an inline URI.
<http://three.co> is an autolink.
""";
var document = Markdown.Parse(markdown, pipeline);
// notice we're using our renderer to produce HTML now
var html = renderer.Render(document);
This gives us:
<p>
<a href="https://one.co">this</a> is a regular link.
This <a href="https://two.co">http://two.co</a> is an inline URI.
<a href="https://three.co">http://three.co</a> is an autolink.
</p>
It's rewritten the links, but notice how the link text for the inline URI and the autolink haven't been updated. I'm surprised HtmlRenderer.LinkRewriter doesn't do the replacement on the link text as well for http://two.co and http://three.co considering that, in both of those cases, we don't supply link text – the URI gets used as the link text. If this behaviour isn't what you want, you'll either need to dig into the guts of replacing things in the AST, or hook into HtmlRenderer.ObjectWriteBefore directly to rewrite links on the fly.
That said, the easiest way to rewrite all of our links is plain old string.Replace():
html = html.Replace("http://", "https://");
But keep in mind that this will also rewrite everything prefixed with http://, including text in code blocks, so this is a sledgehammer approach that won't be appropriate in all but the simplest of cases.
Extracting YAML front matter
Front matter is a way of specifying metadata in Markdown documents. Here's an example:
---
title: I like jokes
date-published: 2025-02-22
tags:
- funny
- jokes
---
How does the ocean say hello?
It waves.
Here, we've defined several pieces of metadata – title, date-published, and a collection of tags – in a front matter block which is typically enclosed in triple dashes. Following this block, we have some content as normal. If we attempt to create HTML from this document, we'll see the front matter gets turned into HTML, which isn't what we want. We need to be able to extract it somehow.
We can achieve this with a combination of YamlFrontMatterExtension and the YamlDotNet library, which we'll need to install from NuGet. YamlFrontMatterExtension will parse front matter, adding it as a YamlFrontMatterBlock to the AST. HtmlRenderer will then ignore this block so our metadata doesn't incorrectly end up in our HTML. So how do we parse the contents of that block? This is where YamlDotNet comes in.
First, we'll need a class to deserialise our front matter to:
public class Article
{
[YamlMember(Alias = "title")]
public string Title { get; set; }
[YamlMember(Alias = "date-published")]
public DateTime Published { get; set; }
[YamlMember(Alias = "tags")]
public List<string> Tags { get; set; }
}
Next, we setup our pipeline, create a YamlDotNet deserialiser, and parse our document:
var pipeline = new MarkdownPipelineBuilder()
.UseYamlFrontMatter()
.Build();
// this is from the YamlDotNet.Serialization namespace
var deserialiser = new DeserializerBuilder().Build();
var document = Markdown.Parse(markdown, pipeline);
And now we get to the fun stuff:
var yaml = document.Descendants<YamlFrontMatterBlock>().FirstOrDefault();
if (yaml == null)
{
throw new InvalidOperationException("YAML block is missing");
}
var frontMatter = yaml.Lines.ToString();
var article = deserialiser.Deserialize<Article>(frontMatter);
var html = document.ToHtml(pipeline);
We start out by trying to get a YamlFrontMatterBlock from the AST and throw if one wasn't found. If the program continues, we can assume a block was found, so we can use its Lines property to get its string content. We then take that data and deserialise it to an Article instance. At this stage, we can access our metadata via the properties on article. And finally, we generate the HTML, which gives us:
<p>How does the ocean say hello?</p>
<p>It waves.</p>
This is pretty cool and it didn't take much code to accomplish.
Creating an extension to generate a table of contents
In this section, we'll create a custom extension that touches on a variety of different extension points in Markdig. When rendering HTML from a Markdown document, if we've added :::toc to the document, our extension will recognise it and automatically generate a table of contents from the headings in that document. The extension also supports multiple levels by adding CSS classes to the output that reflect the level of each heading.
The table of contents on this page was produced by this extension.
The approach
There are several different things we need to implement, so let's go over the approach before I sling lots of code at you.
We're going to define a TableOfContentsBlock to represent a table of contents, which we'll insert into an AST when the :::toc token is found in a document. In order to detect that token, we'll implement a parser, TableOfContentsBlockParser, which will be called when the opening character : is read by Markdig. If our parser finds :::toc, it will add our new block to the AST. At this stage, we can't do anything else because Markdig hasn't finished parsing the rest of the document. That means we don't know what headings are available yet – we need to do some post-processing.
In order to do that, we're going to make use of the MarkdownPipelineBuilder.DocumentProcessed event, which triggers once Markdig has finished parsing a document. Once that event fires, we'll handle it and add the contents of the headings to our TableOfContentsBlock. This completes the AST side of things, and so we turn our attention to the HTML generation.
To accomplish this part, we need to implement HtmlObjectRenderer<T>, which in our case will be HtmlObjectRenderer<TableOfContentsBlock>. This will write out all of the HTML for our table of contents. We'll also use this renderer to generate the correct links for our headings so that if a user clicks them, they'll be taken straight to the content.
Finally, we need to make Markdig aware of our new types. We'll do this in TableOfContentsExtension, which will implement the IMarkdownExtension interface, and this will allow us to add it to the default pipeline just as we've seen with other extensions.
The block
Here we go! Let's start by defining TableOfContentsBlock:
public class TableOfContentsBlock : ContainerBlock
{
public TableOfContentsBlock(BlockParser? parser) : base(parser)
{
}
}
We're inheriting from ContainerBlock so we can add the collection of headings to it. This is important because it's what will form the structure of our table of contents. We need this so the HTML renderer will know when to add a new item to the table of contents and what content belongs to that item.
The parser
Now for the parser:
public class TableOfContentsBlockParser : BlockParser
{
private const string _token = ":::toc";
public TableOfContentsBlockParser()
{
// important: if you don't set this, the parser won't be called!
OpeningCharacters = [':'];
}
public override BlockState TryOpen(BlockProcessor processor)
{
// stop processing if we're in a code block
if (processor.IsCodeIndent)
{
return BlockState.None;
}
var line = processor.Line;
if (!line.MatchLowercase(_token))
{
return BlockState.None;
}
// ensure the rest of the line contains only whitespace
var start = line.Start + _token.Length;
var end = line.End;
while (end > start)
{
char c = line[end];
if (!c.IsWhitespace())
{
return BlockState.None;
}
end--;
}
var block = new TableOfContentsBlock(this);
processor.NewBlocks.Push(block);
// advance the parsing algorithm's position past our token
// if we don't do this, our parser will be called again and again...
processor.Line.Start += _token.Length;
return BlockState.BreakDiscard;
}
}
We start by ensuring our parser inherits from BlockParser. All block parsers must inherit from BlockParser, otherwise they can't be added to the block parsing pipeline. We set OpeningCharacters = [':'] in the constructor which tells Markdig we want our parser to be called when the parsing algorithm encounters a : character. Naturally, this will only happen if the parser is added to the pipeline.
The TryOpen method is where the actual parsing happens. The logic in this method has been designed to ensure the :::toc token appears on a line all by itself. If it's surrounded by whitespace, that's fine. But if there are any other characters, then the parser will stop. This is to enforce the idea that the token represents a block.
With that in mind, we first check to see if we're in a code block. If we are, we return BlockState.None. BlockState.None signals that our parser hasn't found a match for its content, so it can't handle the current character. Next, we check to see if the current line being processed starts with :::toc and stop processing if it doesn't.
There are a few things worth mentioning here. The Markdig parsing algorithm first parses all blocks in a document, followed by all inlines within those blocks. The algorithm checks for blocks line-by-line, which means when we're given a line to process in a BlockParser, line.Start will point at the first non-trivial character (trivia are characters like whitespace). Despite not mentioning it in the documentation, MatchLowercase() and its companion method Match(), will start their search from line.Start. This wasn't obvious to me and I initially didn't use them as it seemed they'd find a match potentially anywhere in the current line.
We then ensure the rest of the line only contains whitespace. If this check succeeds, we now know we're dealing with our :::toc token and it's time to add the TableOfContentsBlock to the AST. We do that by creating a new block, passing the current parser to it, and adding it to the processor's list of new blocks. We make sure to advance the position of the parsing algorithm by the length of our token so it can continue on its way. If we don't do this, our parser will be called again repeatedly on the very first : character causing an infinite loop.
Finally, we return BlockState.BreakDiscard to signal to the algorithm that we're ending a block and to discard the current line.
The extension
We've reached the point where we've inserted our block into the AST, but we haven't added the heading information to it yet. We'll handle that in our extension. The extension is also going to hook up the renderer, which I haven't shown yet. We'll get to the renderer's code in the next section, but here's the code for the extension:
public class TableOfContentsExtension : IMarkdownExtension
{
public void Setup(MarkdownPipelineBuilder pipeline)
{
pipeline
.BlockParsers
.AddIfNotAlready(new TableOfContentsBlockParser());
pipeline.DocumentProcessed += document =>
{
var toc = document
.Descendants<TableOfContentsBlock>()
.FirstOrDefault();
if (toc == null)
{
return;
}
// this parser won't be used, as we're adding elements
// after document parsing has finished, but heading blocks
// expect a parser to be given to them
var parser = new HeadingBlockParser();
var headings = document
.Descendants<HeadingBlock>()
.Where(block => block.Parent != toc);
foreach (var heading in headings)
{
toc.Add(new TableOfContentsHeadingBlock(heading, parser));
}
};
}
public void Setup(MarkdownPipeline pipeline, IMarkdownRenderer renderer)
{
if (renderer is HtmlRenderer htmlRenderer)
{
htmlRenderer
.ObjectRenderers
.AddIfNotAlready<TableOfContentsBlockRenderer>();
}
}
}
We implement the IMarkdownExtension interface so we can register our extension with Markdig later on. In the first Setup method, we start by adding our block parser to the pipeline's list of block parsers. Next, we add an event handler for the DocumentProcessed event, which fires once the parsing of a document is complete. In our handler, we check to see if the document contains a TableOfContentsBlock, and stop if it doesn't. If it's found, we query for the HeadingBlocks that don't have toc as a parent, and add them to the list of blocks in toc using a new type, TableOfContentsHeadingBlock, which I'll introduce in a moment.
Why are we checking for a parent? To answer that, let's look at the following document:
# Introduction
Some introductory text.
:::toc
# Another heading
Notice the :::toc token comes after the first heading. When we iterate over the headings from the call to Descendants<HeadingBlock>(), the HeadingBlocks are returned in order. The first one, # Introduction is returned and added to toc. When Descendants<HeadingBlock>() is called the second time, it's advanced through the document and now returns the heading we just added to toc! That's definitely asking for trouble. To prevent this causing problems, we filter out blocks whose parent is toc. Alternatively, we could've done this instead:
var headings = document
.Descendants<HeadingBlock>()
.ToList();
This would give us all of the headings upfront.
Let's now look at TableOfContentsHeadingBlock:
public class TableOfContentsHeadingBlock : HeadingBlock
{
public TableOfContentsHeadingBlock(HeadingBlock headingBlock, BlockParser parser)
: this(parser)
{
var containerInline = new ContainerInline();
if (headingBlock.Inline != null)
{
foreach (var inline in headingBlock.Inline)
{
if (inline is LiteralInline literal)
{
containerInline.AppendChild(new LiteralInline(literal.Content));
}
else if (inline is CodeInline code)
{
containerInline.AppendChild(new CodeInline(code.Content));
}
}
}
Inline = containerInline;
Level = headingBlock.Level;
}
public TableOfContentsHeadingBlock(BlockParser parser) : base(parser)
{
}
}
We're inheriting from HeadingBlock so we have access to the Inline and Level properties, the latter of which determines the level of the heading (e.g. h1, h2, etc.). We take all of the literal and code inlines, add them to a container inline, and finally store that collection in our TableOfContentsHeadingBlock. The reason for doing all of this is Markdig won't allow us to add elements that already have a parent into an AST, so we need to create new instances. This makes sense because blocks or inlines being shared in the AST could have all sorts of unintended side-effects when they are modified.
Finally, in the second Setup method, we register our new renderer, which we'll cover next.
The HTML renderer
To recap, at this point we've add a custom block to an AST, and added a bunch of content to that block. We now need a way to render that content to HTML, which is where our renderer comes in:
public class TableOfContentsBlockRenderer : HtmlObjectRenderer<TableOfContentsBlock>
{
protected override void Write(HtmlRenderer renderer, TableOfContentsBlock obj)
{
// if there is nothing in the block, stop
if (obj.Count == 0)
{
return;
}
renderer.EnsureLine();
renderer.Write("<nav class=\"table-of-contents\">");
renderer.WriteLine();
renderer.Write("<h2>Table of Contents</h2>");
renderer.WriteLine();
renderer.Write("<ul>");
renderer.WriteLine();
var headings = obj.Descendants<TableOfContentsHeadingBlock>();
foreach (var heading in headings)
{
var id = heading
.GetAttributes()
.Id ?? LinkHelper.Urilize(GetHeadingText(heading.Inline!), true);
renderer.Write($"<li class=\"level-{heading.Level}\"><a href=\"#{id}\">");
foreach (var inline in heading.Inline!)
{
if (inline is LiteralInline literal)
{
renderer.WriteEscape(literal.Content);
}
else if (inline is CodeInline code)
{
renderer
.Write("<code>")
.WriteEscape(code.Content)
.Write("</code>");
}
}
renderer.Write("</a></li>");
renderer.WriteLine();
}
renderer.Write("</ul>");
renderer.WriteLine();
renderer.Write("</nav>");
renderer.EnsureLine();
}
private static string GetHeadingText(ContainerInline container)
{
var sb = new StringBuilder();
foreach (var inline in container)
{
if (inline is LiteralInline literal)
{
sb.Append(literal.Content);
}
else if (inline is CodeInline code)
{
sb.Append(code.Content);
}
}
return sb.ToString();
}
}
Most of this is writing out HTML, including adding a CSS class based on the heading level. This is what allows us to target individual levels for styling.
The other thing of interest is we're ensuring links to headings have the correct fragment associated with them. We need the fragment portion of links to match the id of the heading we want to link to. For example, if we have <h2 id="my-heading">My Heading</h2>, our link needs to have a fragment like so: <a href="example.html#my-heading">example</a>.
The AutoIdentifierExtension, which we'll add to our pipeline shortly, takes care of ensuring the headings themselves have ids. AutoIdentifierExtension passes all of a heading's inline content to Markdig's LinkHelper.Urilize helper to do this, which is exactly what we're doing in our renderer to build the link fragments. This is what makes the two pieces match, allowing users to click a link to take them to the respective heading.
Wiring it all up
We're almost done! To hook things together, we need to construct and use a pipeline that has our new functionality in it:
var pipeline = new MarkdownPipelineBuilder()
.UseAutoIdentifiers()
.Use<TableOfContentsExtension>()
.Build();
Now, let's define a document to test with:
# Introduction
Welcome to the longest blog post ever. It took me ages. I need to sleep now.
:::toc
## Part one
Something about part one.
## Part two
Something about part two.
Finally, let's do like we've done previously, and generate the HTML:
var html = Markdown.ToHtml(markdown, pipeline);
And finally, our table of contents:
<h1 id="introduction">Introduction</h1>
<p>Welcome to the longest blog post ever. It took me ages. I need to sleep now.</p>
<div class="table-of-contents">
<h2>Table of Contents</h2>
<ul>
<li class="level-1"><a href="#introduction">Introduction</a></li>
<li class="level-2"><a href="#part-one">Part one</a></li>
<li class="level-2"><a href="#part-two">Part two</a></li>
</ul>
</div>
<h2 id="part-one">Part one</h2>
<p>Something about part one.</p>
<h2 id="part-two">Part two</h2>
<p>Something about part two.</p>
And that's our table of contents! Users can click the links to jump straight to the headings, and we can style each level individually if we'd like.
Closing thoughts
Markdig is a great library to use. It's clear the extension points have been well designed and it feels pretty fast too. That said, some of the built-in extensions are a bit hit-and-miss. I also can't help but wonder what the point of AutolinkInline is. It seems to me LinkInline should handle autolinks so consumers don't have to learn the hard way that they need to work with both. But that's a small price to pay for something so extensible.
Adding functionality is pretty straightforward once you familiarise yourself with the extension points, but you can expect to write a fair amount of code if you need something more involved.