BNF Grammar
At this point I want to take a moment and cover the concept of language grammars. In my experience most parsing resources will assume that you have at least a little knowledge about the concept making these resources difficult to use if you don't. If you already know a bit about language grammars, feel free to skip this page.
Essentially grammars are a way to write a definition for a language with the goal of making it easier to talk about that language. I am using the term language here pretty broadly to mean any agreed upon data format. That means that the 8601 Duration format could be thought of as a language though most people would not consider it one.
There are many options to choose from when trying to document a grammar, just like there are many language options to choose from when building software. I am going to use the Backus-Naur Form (BNF). We will going to walk through a BNF grammar describing the ISO 8601 Duration format piece by piece but if you wanted to look at the full thing you can scroll to the bottom of this page.
For those unfamiliar with grammar forms, they consist of a series of rules,
the left side of a rule is a name, the right side is a description of what
that rule means, in BNF the sides are separated by ::=
.
I find it easiest to read a grammar from the bottom up. For our duration format
we would start with the rule digit
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
On the right side of this rule we see each number from 0 to 9 separated by a
|
, which here means or, so a digit is a single digit number from 0
through 9. Next we have the integer
rule.
<integer> ::= <digit> | <integer><digit>
An integer
is either a single digit or an integer
followed by a digit
.
This is where things can get a little confusing as this style of
notation might feel backwards, at least it does to me. Lets use the
example 999
, if you think about it starting with the right most 9
,
you would assign that position <integer><digit>
, the <integer>
here
would point to the middle 9
, this would also be assigned <integer><digit>
and again the <integer>
would represent the left most 9
, this would
finally be assigned <digit>
. Here is a little flow chart to hopefully
help visualize what I am trying to say.
<integer> = 9
┗━━━━━━━━━━━┓
<integer> = <integer>9
┗━━━━━━━━━━━┓
<integer> = <integer>9
When the right side of a rule looks like this it is referred to as a
"left recursive" rule. In this case we could also write it as <digit><integer>
making it "right recursive". The more important take away is that if you see a
rule in its own definition then it could go on forever in one direction or the other.
Moving up the grammar, next we have the remainder
rule.
<remainder> ::= .<integer>
This rule is defined by a .
followed by an integer
. At this point it
should become clear that as we move up the grammar, each rule will combine
the previous rules, possibly with some additions which is why I like to
start from the bottom. Next up we have number
.
<number> ::= <integer> | <remainder> | <integer><remainder>
A number
is either an integer
or a remainder
or an integer
followed
by a remainder
. This is a bit verbose but essentially we need a way
to say that both are optional but at least one must exist. So 0
would
work, also .877
would work and finally 0.877
would work.
The next few rules finally start to get into the specifics of the format.
There are two rules for each of our number + letter pairs. Since we are
using BNF the only operator we get is the |
for or, some other grammar
notations use other operators to make things a little more concise. If you have
ever written a regular expression, you would be familiar with the +
or ?
operators for declaring recursion or optional values. In BNF we are required
to create a new rule for the optional case if we want to have both optional
and non-optional. Take the seconds
and seconds-opt
rules as an example.
<seconds-opt> ::= <seconds> | ""
<seconds> ::= <number>S
The bottom one fits with what we went over on the previous page, a number
followed by the capital letter S
. The top one is just a way to make the
previous rule optional. There is an entry like the above for each of our
duration parts. After those we get to the time
rule, this rule will
hopefully make it clear why we needed those optional rules.
<time> ::= T<hours-opt><minths-opt><seconds> |
T<hours-opt><minths><seconds-opt> |
T<hours><minths-opt><seconds-opt>
minth is used here as both
M
onth andM
inute
This rule has 3 options, all three start with the letter T
and each is
followed by all 3 of the time rules, in each case one of the time rules
is not optional. This is saying that each of these parts can be optional
but not all of them can be absent. So T1H
or T2M1S
are okay but T
is not a valid time
. Skipping ahead to the date
rule we see a similar
pattern.
<date> ::= <years-opt><minths-opt><weeks-opt><days> |
<years-opt><minths-opt><weeks><days-opt> |
<years-opt><minths><weeks-opt><days-opt> |
<years><minths-opt><weeks-opt><days-opt>
So a date
can be 1D
or 1M1W
but couldn't just be empty. You might
have noticed that we again have an optional version of time
, just
like we did for the duration unit rules. When we look at the duration
rule, the top rule in the grammar, it should become clears as to why.
<duration> ::= P<date-length><time-length-opt> |
P<time-length>
As we covered in the previous page, durations can have from 1 to 7
number letter pairs. So this rule is saying that we can have a duration
only with a date part (P1M
) or a date part and a time part time part
(P1DT1H
) or just a time part(PT1S
) but it cannot be empty (P
).
If you haven't already below is the full BNF grammar I wrote for this the ISO 8601 Duration format.
<duration> ::= P<date-length><time-length-opt> |
P<time-length>
<date> ::= <years-opt><minths-opt><weeks-opt><days> |
<years-opt><minths-opt><weeks><days-opt> |
<years-opt><minths><weeks-opt><days-opt> |
<years><minths-opt><weeks-opt><days-opt>
<time-opt> ::= <time-length> | ""
<time> ::= T<hours-opt><minths-opt><seconds> |
T<hours-opt><minths><seconds-opt> |
T<hours><minths-opt><seconds-opt>
<years-opt> ::= <years> | ""
<years> ::= <number>Y
<minths-opt> ::= <months-or-minutes> | ""
<minths> ::= <number>M
<weeks-opt> ::= <weeks> | ""
<weeks> ::= <number>W
<days-opt> ::= <days> | ""
<days> ::= <number>D
<hours-opt> ::= <hours> | ""
<hours> ::= <number>H
<seconds-opt> ::= <seconds> | ""
<seconds> ::= <number>S
<number> ::= <integer> | <remainder> | <integer><remainder>
<remainder> ::= .<integer>
<integer> ::= <digit> | <integer><digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9