From WordPress to Markdown

kuniga.me > NP-Incompleteness > From WordPress to Markdown

From WordPress to Markdown

11 Jul 2020

I’ve recently transitioned from writing on WordPress to a self-hosted solution using GitHub pages, which in turn uses Jekyll for static content generation, which in turn uses Liquid for markdown processing.

There were about a hundred posts and my old blog and I wanted to bring them all into my new site to keep everything in one place although I’m keeping the wordpress blog for historical purposes).

In this meta post we’ll describe some of the steps taken to convert WordPress posts into a markdown one and changes I needed to make to support features from WordPress.

I used Python for the processing.

XML file

Wordpress offers a way to download all the posts and their metadata as XML which is very convenient. Python on its turn has a library to parse XML files:

xml.etree.ElementTree
tree = ET.parse(filename)
root = tree.getroot()

root can be used to search elements within the XML tree. If you used JavaScript DOM APIs like node.getElementById(), the idea is very similar:

root.findall('./channel/item')

One difficulty I had was matching tags containing namespaces such as:

<wp:post_type>post</wp:post_type>

Where wp is the namespace. Running:

node.find("wp:post_type")

Won’t do, you need to provide an additional parameter, a list of the namespaces. They’re listed in the XML itself (xmlns attributes), in my case:

<rss version="2.0"
  xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
  xmlns:wp="http://wordpress.org/export/1.2/"
>

And can be retrieved as

namespaces = dict(
    [node for _, node in ET.iterparse(filename, events=['start-ns'])]
)

Finally we can search by wp:post_type:

node.find("wp:post_type", namespaces)

Translation

To facilitate the translation between wordpress HTML-ish (it also includes non-HTML tags such as [sourcecode], [caption], etc) and markdown I wrote a parser to generate an intermediate AST and then a markdown generator to convert the AST to markdown.

The hardest case was converting images with caption, since it had several cases some of which required re-structuring the AST.

[caption]
  <a>
    <img />
  </a>
  Some caption
[/caption]

The ast was:

{
  type: 'caption',
  children: [
    {type: 'a', children: [{type: 'img', props: ...}]},
    {type: 'text', text: 'some caption'}
  ]
}

Became

<figure>
<a><img /></a>
<figcaption>Some caption</figcaption>
</figure>

There was also

<a>
  <img />
</a>

To

<figure>
  <a>
     <img />
  </a>
</figure>

And finally plain images

<img />

to

<figure>
  <img />
</figure>

I opted to wrap them all into <figure> for consistency and to have less output formats.

Latex

Liquid has good integration with MathJax. The only change needed is to include

<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML"></script>

This post was very helpful.

Gists

I started using Gists in wordpress because the [sourcecode][/sourcecode] code from wordpress was subpar. Gists have their own set of issues, like the extra frame and sometimes failing to load on mobile. Liquid has nice support for code highlighting, so I just needed to download the contents.

Github has a REST API to download Gists but the rate limit is very low (1 gist/min) if logged out, while logged in it’s ~100x higher. The github official documentation was not very friendly on explaining how to programmatically authenticate with your own user, but this post was very helpful.

We can then use the requests Python library to get the data:

import requests

# Careful to not include your token in code!
token = os.environ['GITHUB_TOKEN']

resp = requests.get(
    f"https://api.github.com/gists/{gist_id}",
    headers={
        'content-type':'application/json',
        'accept':'application/vnd.github.v3+json',
        'Authorization': f"token {token}",
    }
)

if resp.status_code != 200:
    message = resp.json().get('message')
    raise Exception(f'Failed to fetch gist {gist_id}. Reason: {message}')

payload = resp.json()
file = list(payload.get('files').values())[0]

code = file['content']
lang = file['language']

HTTPS

I wanted to change my blog to use HTTPS, since Chrome now shows a “not secure” warning on HTTP-only sites.

Github supports HTTPS on its pages but if you have custom domain it doesn’t work out of the box.

I have a custom domain, kuniga.me, which I register using DreamHost. I don’t fully grasp the intricacies of CNAME and A records, but this post made it work.

Comments

I opted to use Facebook’s comment plugin to allow social interaction with my posts. The reason is that most people have a Facebook account and it handles spam / bots. I might have been partial on that choice ;)

The integration is very easy. It associates the posts to a given URL, so I added the following Liquid snippet at the end of a post template:

<div class="fb-comments" data-href="http://kuniga.me//blog/2020/07/11/from-wordpress-to-jekyll.html" />

The only downside was to lose all the comments from previous posts.

Analytics

I opted to use Google Analytics functionality. I just need to embed some script on top of every page template and it automatically gathers data which is helpful to determine the popularity of the posts.