iCalendar Support Part 2: a real parser!

First published .

Over the last month, I've continued the work I started in Part 1, building an honest-to-goodness parser that reads iCalendar data into Elisp data structures. Over the last week, I reached an important milestone: the code can now parse and reproduce all the examples in RFC5545, the iCalendar standard, as well as all my own calendar data.

It parses iCalendar data like this:

BEGIN:VEVENT
UID:19970610T172345Z-AF23B2@example.com
DTSTAMP:19970610T172345Z
DTSTART:19970714T170000Z
DTEND:19970715T040000Z
SUMMARY:Bastille Day Party
END:VEVENT

to Elisp parse trees like this:

(icalendar-vevent #s(icalendar-meta ...)
                  nil
  ((icalendar-uid #s(icalendar-meta ...)
      (icalendar-text #s(icalendar-meta ...)
         "19970610T172345Z-AF23B2@example.com"
         nil)
      nil)
   (icalendar-dtstamp #s(icalendar-meta ...)
      (icalendar-date-time #s(icalendar-meta ...)
         (45 23 17 10 6 1997 2 -1 0)
         nil)
      nil)
   (icalendar-dtstart #s(icalendar-meta ...)
      (icalendar-date-time #s(icalendar-meta ...)
         (0 0 17 14 7 1997 1 -1 0)
         nil)
      nil)
   (icalendar-dtend #s(icalendar-meta ...)
      (icalendar-date-time #s(icalendar-meta ...)
         (0 0 4 15 7 1997 2 -1 0)
         nil)
      nil)
   (icalendar-summary #s(icalendar-meta ...)
      (icalendar-text #s(icalendar-meta ...)
         "Bastille Day Party")
         nil)
      nil))

The code is available on sr.ht at this URL:

https://sr.ht/~wyleyr/icalendar-el/

and there are some instructions in the README to get started, if you want to play around with it. Here I want to share some of the decisions I made to get to this point and the reasons for them, and a couple of wrong turns I took along the way.

Representing the iCalendar data types

The first step in creating a parser is to decide what data structures it should produce. Following a suggestion by Adam Porter (which in retrospect I may have misinterpreted), I started by defining cl-struct types for each of the basic iCalendar data types: dates, date-times, text, etc. This turned out to be a bad idea. I quickly found myself writing a bunch of boilerplate code to convert between my structs and the "native" Emacs representations of these values, and abandoned the effort. Instead, I decided to use the native representations as much as possible, and create simple new representations (mostly just based on lists) where none already existed. I hope this will make the library maximally useful for people calling it from elsewhere in Emacs.

With that decision made for the basic data types, I still needed to decide on a representation for the parse tree. Here I followed Ihor Radchenko's suggestion and based the representation on Org mode's parse tree, defined in org-element-ast.el. I'm already relatively familiar with this representation, I like that it's is easy to work with, and I like that it's already proven its worth in a more complicated use case.

In my custom-rolled parse tree, have not implemented the lazy evaluation which helps make Org's parser fast. That would be a premature optimization at this point, and I suspect it may not be needed at all for iCalendar data (which unlike Org data will be small and static in "most" cases, since most people don't edit iCalendar data by hand). But Ihor is planning to make org-element-ast a more generalized library in Emacs, and if I ever need this optimization, using the same basic parse tree structure means it will be easy to switch to that library.

Expanding the define-* macros

In Part 1 I had already settled on a design that was inspired by cl-icalendar, and is an old Lisp tradition: using macros to turn a specification (in this case, the standard) into code. This makes the specification both readable by the programmer and enforceable by the program. The parser is built around four central macros, each of which corresponds to one of the major categories in the iCalendar grammar: define-type for value types, define-param for parameters, define-property for properties, and define-component for components. At the end of Part 1, these macros didn't do much except define some regular expressions and a matching function to provide syntax highlighting in iCalendar mode.

My first thought, then, was to simply expand these macros to accept more arguments and do more work. I added arguments to codify the semantics given in the standard, and defined functions in the macro body to parse each of the defined types. (ical:define-property ical:dtstart "DTSTART" ...), for example, would define the functions ical:parse-dtstart-property, ical:read-dtstart-property-value, ical:print-dtstart-property-node, and so on. Because these functions knew about the :default-type keyword argument passed to the macro, they could parse values of the appropriate type or raise errors if they weren't present.

This design turned out to be a mistake. Defining the parsing functions in the macro bodies resulted in an explosion of symbols, and made loading the parser very slow. I eventually changed to a design where the parsing functions were defined once, outside the macros, and accept a type symbol as their first argument. The macros now just cache the type-related metadata in their arguments as symbol properties on the defined type symbol, which the parsing functions can look up and use as needed.

I should have made this change much earlier. As the parsing functions in the macro bodies grew more complex, debugging them became painful: redefining them required reloading the whole library, and I would often get inscrutable errors at macro expansion time that were hard to locate and debug. Everything became instantly much easier, cleaner, and snappier when I finally realized I could just have a single parsing function that accepts a type symbol argument, instead of one parsing function for each type. An old lesson, learned again: Lisp macros are powerful, but you shouldn't use them where a function will do.

One part of the design that I'm still unsure about is the distinction I've introduced between "parsing" and "reading". Basically, parsing consists of matching a regular expression, followed by reading of the matched string to an Elisp data structure. The advantage of drawing this distinction is that reading looks like a pure function: whereas parsing depends on global state like the current buffer and the value of point, reading can "just" be a pure function that maps a string to something else. Unfortunately, looks can be deceiving. Except for very simple readers, reading still depends on a different piece of state: the match data, which is set up by the corresponding parsing function. This makes the reader functions more difficult to use and test independently of the parsing functions.

I'm not sure how to resolve this at the moment. Perhaps the easiest thing would be to pass the match data explicitly, as an optional argument, when a reader requires it. That would restore their purity, but it doesn't simplify the process of calling them: the caller still has to do the match, and make the match data available. Another option would be to do the matching inside the reader itself; but the parsing functions also need access to the match data (e.g. they should throw an error, instead of calling a reader, when a value doesn't match), and it would be silly to perform the match twice. I need to think about this some more.

Documentation

I focused a lot on making good documentation as I went along. This is because I think Emacs' documentation system is one of its best features, and because the lack of documentation for and in the existing icalendar.el code is one of the things that makes it most difficult to work with. So each of the define-* macros accepts a docstring, computes some additional documentation, and saves it all in the type-documentation property of the relevant symbol. This documentation is then accessible via M-x describe-symbol (i.e., C-h o). I spent a lot of time adding documentation for each value type, parameter, property and component based on the text of the standard, which I hope strikes a good balance of providing enough information to be useful without being too verbose. Each definition also has a link to the relevant section of the standard, so it's easy to get all the details if you need them.

I actually didn't know about describe-symbol until recently, but it's an extremely useful help function. I saw that cl-lib provides documentation for types and structs via describe-symbol, and in figuring out how it does this, I discovered the variable describe-symbol-backends. (So at least something good came out of my experiment with structs!) I wrote a simple backend, icalendar-describe-symbol-backend, to provide documentation of all the iCalendar type symbols, as well as rx regular expressions. This has already saved me from having to go back to the standard quite a few times.

Tests

Another feature I added relatively late was a file of ERT tests for the parser. These are simple tests that run a parsing function on a snippet of iCalendar data, then print the resulting parse tree to a string, and pass if the printed result is equal to the snippet. Taking the time to enter all the examples from RFC5545 as snippets, and getting the resulting tests to pass, gave me a lot more confidence that my parser is doing the right thing in every case. Again, I should have done this much earlier.

I had been aware of ERT before, but never used it myself. It is really slick: it's super easy to define and run tests, it was easy to build a macro around test definitions to reduce redundancy, and it's a pleasure to use interactively. I was also impressed with how fast the tests run. It's a wonderful Emacs feature and one I will rely on more as I continue this project.

What's next?

The next thing I want to work on is validation. The parser currently does not do anything to check that the structure of the parse tree is actually allowed by the standard. It is probably good to be flexible about this when parsing external data, but not when printing a locally-grown parse tree to iCalendar format. There are various constraints, such as which properties are required in which components, that can and should be checked before the data is written to a file or sent over the wire. Some of these are relatively complicated and will require functions specific to the case of a single node type.