Generating Atom and RSS with Pandoc
On feeds
As part of revamping and reorganizing my site recently, I decided to add support for Atom and RSS feeds. I know I am a little behind the times here. Feeds on the Web had a big moment more than ten years ago, but since Google Reader was shut down, they have mostly lived a kind of shadow existence. The other big tech companies stopped supporting them, and many people aren’t even aware that they exist anymore. Browsers don’t make them obvious or have built-in tools to subscribe to them, though you can still get add-ons.
But feeds are still around, and still an important part of the independent Web, and so in the name of being the change you want to see in the world, I put one together for my site.
Pandoc
I use Pandoc to generate the HTML on this site from Markdown input. Pandoc has two features that can be repurposed to generate feeds: it has a citation processor which allows storing bibliographic information in a YAML metadata file, and it has a templating system where that metadata is exposed.
A feed is, basically, just a series of citations to webpages—that is, hyperlinks!—together with some other metadata which is also used in bibliographies, like a title and an author. So by writing a YAML file containing the “citation” data representing the feed entries and processing it with an Atom or RSS template, you can output valid Atom and RSS feeds with Pandoc.
Metadata file
I save the citation metadata in a file called
feeds.yaml
, which looks like this:
title: recursewithless.net
references:
- title: Generating Atom and RSS feeds with Pandoc
issued: 2024-07-07
URL: https://recursewithless.net/projects/pandoc-feeds.html
abstract: How I set up feeds for recursewithless.net
- title: Chairs restoration project
issued: 2024-07-04
URL: https://recursewithless.net/projects/chairs-restoration.html
abstract: I bought some 120 year old chairs and they needed some
work
...
There is a title
field for the whole feed, and then a
list of references
representing updates on the site, each
of which has a title
, an URL
, an
issued
date, and possibly an abstract
(which
could contain as much content as you like, including a copy of a whole
post or page). It’s simple metadata in a format which is easy to update
and keep tidy by hand.
One could generate this file from the contents of other files—say,
all the files in a blog/
subdirectory—but for now I want to
keep things simple. Maintaining the file by hand makes it easy to
represent multiple updates to the same page as different entries in the
feed, and to tailor the abstract
for feed readers.
Atom
Atom template
To generate an Atom feed from this metadata, we need a template. Here’s what mine looks like:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<author>
<name>rwl</name>
<email>rwl@recursewithless.net</email>
</author>
<title>$title$</title>
<id>https://recursewithless.net/</id>
<link rel="self" href="https://recursewithless.net/atom.xml" />
<updated>$updated$</updated>
$for(references)$
<entry>
<id>$references.URL$</id>
<title>$references.title$</title>
<updated>$references.issued$T00:00:00+02:00</updated>
<link href="$references.URL$" />
<summary>$references.abstract$</summary>
</entry>
$endfor$
</feed>
There’s a header with some metadata (author, title, id, link) for the
whole feed, followed by a list of <entry>
items.
These are generated by looping over the references
variable
in the metadata.
The only notable things here are the dates: the header contains an
<updated>
tag representing the last time the whole
feed was updated—about which more momentarily. There is also an
updated
tag for each entry
.
Unfortunately, the Atom and RSS standards require dates to be
represented in different formats: Atom uses RFC 3339, while RSS
uses the much older RFC
822. As a best effort to support them both I decided to keep the
dates in YYYY-MM-DD format in the metadata. To get that into
full RFC-3339 format which passes the W3C’s feed validator, I append
a timestamp for midnight in my timezone after the issued
field of each entry: T00:00:00+02:00
.
Atom Makefile recipe
I then run Pandoc via a Makefile recipe to generate the actual Atom
feed file, which I called atom.xml
, and which is included
in the build for my whole site:
atom.xml: feeds.yaml lib/templates/atom.xml
pandoc -M updated="$$(date --iso-8601='seconds')"\
--metadata-file=feeds.yaml \
--template=lib/templates/atom.xml \
-t html \
-o atom.xml < /dev/null
Note that I set an additional metadata variable (-M
)
here called updated
using the Unix date
program. This gives the date and time when the build actually runs,
which is filled into the updated
tag in the feed header.
For some reason, despite what the standards say, date
’s RFC
3339 output format (which looks like
2024-07-07 20:06:00+02:00
) doesn’t pass the W3C’s Atom
validator, but its ISO 8601 output format (which looks like
2024-07-07T20:06:00+02:00
; note the “T”) does, so that’s
what I’m using. I’m not sure whether the validator or my version of
date
is wrong; if you know, please explain to me what I
should do here.
I tell Pandoc that it’s generating “HTML” (-t html
),
even though it’s really generating XML, just to suppress a warning about
an unknown output format. And I redirect standard input from
/dev/null
because there is no additional input file that
needs to be processed—just the metadata file. (Without this, Pandoc
waits forever for input from the terminal.)
And that’s it!
RSS
RSS is basically the same, but there are a couple of quirks to take care of, again related to dates.
RSS template
Here’s the template for RSS:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>$title$</title>
<description>Updates from rwl</description>
<link>https://recursewithless.net/</link>
<atom:link href="https://recursewithless.net/rss.xml" rel="self" type="application/rss+xml" />
<lastBuildDate>$updated$</lastBuildDate>
<ttl>1440</ttl>
$for(references)$
<item>
<title>$references.title$</title>
<description>$references.abstract$ (Updated $references.issued$)</description>
<link>$references.URL$</link>
<guid>$references.URL$@$references.issued$</guid>
</item>
$endfor$
</channel>
</rss>
Basically the same exact information here, just slightly different
tag names. The validator recommends adding the
<atom:link rel="self" ...>
element with the URL of
the feed itself, as in Atom. It also recommends a
<guid>
element for each item in the feed with a
unique identifier for the item, which is allowed to be any text. I
combine the URL with the date there so that different updates to the
same URL on different days will get different identifiers.
Note also that I don’t put any <pubDate>
tag in
the items. This is because according to the standard, that element can
only contain an RFC 822 date, and I couldn’t see a way to convert
YYYY-MM-DD dates to RFC 822 within Pandoc’s templating system.
Fortunately, <pubDate>
is optional, so I just leave
it out and instead put the issued
date in a note at the the
end of the <description>
tag.
RSS Makefile recipe
Finally, here’s the Makefile recipe for the RSS feed:
rss.xml: feeds.yaml lib/templates/rss.xml
pandoc -M updated="$$(date '+%a, %d %b %Y %T %z')"\
--metadata-file=feeds.yaml \
--template=lib/templates/rss.xml \
-t html \
-o rss.xml < /dev/null
Again, everything is the same as for Atom, except that my version of
date
doesn’t have a built-in RFC 822 output, so I generate
it directly with a format string that satisfies the W3C’s validator.
And with that, I can put a new entry in feeds.yaml
, run
make all
, and my website feeds get updated along with the
rest of it!