The joy of pattern matching – part II

The Template matcher

In “The joy of pattern matching – part I” we used a simple template language as an example to talk about pattern matching. Let us review that language again. It includes sequences of upper- and lowercase characters where lowercase characters are literals and uppercase characters represent any lowercase character sequence.

Consider now that we try to match the template sequence “aAa” with the sequence “abba”.

aAa = abba

Matching aAa against abba

Would the Erlang matcher be able to do this match if the sequences were represented as Erlang lists, binary strings, or whatever Erlang term we might represent the sequences with? No, not really. The reason is, that for Erlang matching to work [1], the left-hand side must have the same structure as the right-hand side – a fair constraint as this makes the matching process both more efficient and more well-defined (such as having at most one possible match, not many, to choose from). In our example above, one could say, that the two sides do not have the same structure because the A in “aAa” can, in our template language, represent any number of lowercase characters, thus it has a different structure than the right-hand side flat list, which has exactly four elements. So the Erlang matcher would not handle this. It does not magically realise that A must be the string “bb” in order for the two sequences to match.

If one were to unify the structures of the two sides in order to help the Erlang matcher, one could construct an example such as

[a, A, a] = [a, [b, b], a]

instead. Both sides now have three elements and the Erlang matcher is therefore able to match the A against the intended [b, b] (A is matched to be the complete list element [b, b]), but by doing this structure change we also turned the problem into a simpler one. Therefore, the answer to the question if the Erlang matcher would be able to match “aAa = abba” and figure out that A should be “bb”, remains “no”. The Erlang matcher is just not built that way.

As mentioned in part I there is some syntactic sugar in Erlang to match the tail portion of a list. A match expression such as “aA = aba” would then actually be possible in Erlang and can be written either as

"a" ++ A = "aba".

or

[$a | A] = [$a, $b, $c].

, if you prefer the list syntax. This works because the tail of a list (the part after the “|” character above, or after the “++” in the string syntax version) is always part of the list structure, thus both sides in this case share the same structure – a flat list. It does “not” work however if the prefix (“a” in this case) is a bound variable – it can only work on literal strings unfortunately. Erlang cannot either do the reverse – to match A ++ “a” = “aba”. This is an “illegal pattern” in Erlang because there is no way to match out a unbound prefix out of a list like this in Erlang. There is not even a syntax available for it. Even worse would be to attempt matching A ++ “b” ++ “c” ++ B = “abcd” which gives you yet another “illegal pattern” of course. The Erlang pattern matcher is cool and all, but it does not do magic for you.

All hope is not lost however. We can still have fun with more advanced pattern matching in Erlang as a more adept pattern matcher (a matcher that would support the powers of our template language) is fairly easy to write in Erlang (you just won’t get the nice, compact syntactic sugar for it as you get with the built-in Erlang matcher). So let us, as an exercise, now build that template matcher and see where that takes us, shall we?

Building the template matcher

In our template language we let uppercase characters represent placeholders (or variable names) that could match against lowercase sequences (where the empty sequence is the shortest such sequence). If we want such a template matcher library to be a bit more useful in different programming contexts we could instead let the sequences contain any characters and define the template variables to have a more seldom seen pattern – such as for instance the pattern $(VARIABLE_NAME). This means that templates could now look like the following examples.

"$(A)"
"Stefan $(SURNAME)"
"Because I think $(SOMETHING) is better than $(SOMETHING_ELSE)"
"a$(A)$(A)a"

Examples of template strings supported by our general template matcher

The last of the examples shows a repeated variable – in the string “a$(A)$(A)a” the variable A occurs twice. When we now implement the template matcher we would like to maintain the Erlang approach that says that, once a variable is bound to a variable it will remain bound to that variable and the match should fail if it would cause a variable to change its value at a later stage in the matching process.

Implementation

First, let us define the API. It is a function called match that takes two arguments – a template string and a string that you should attempt to match the template against, so calling match(Template, String) would be equivalent to do a Template = Sequence in our template language.

-export([match/2]).

-spec match(list(), list()) -> map() | {error, any()}.
match(Template, String) ->

The matching process then. It consists of iterating through the template and string arguments while maintaining a dictionary of discovered variable bindings. If we manage to reach the end of both strings at the same time, without detecting a conflict where a variable is re-bound to a new, different value we have found a match. The high-level function would look like the following piece of code where we start the matching process with an empty dictionary (line 6 – empty maps are denoted #{} in Erlang). The return value of the match function is the resulting dictionary of variable bindings which can be used for other computations, or, if the match fails, an error tuple {error, Reason}.

match(Template, String) ->
    try
        match(Template, String, #{})
    catch
        error:{badmatch,{error,eof}} ->
            {error, eof}
    end.

What does match/3 look like then? First of all, if the template is empty and the string is empty, then we have found a match and we return the resulting dictionary

match([], [], D) ->
    D;

If the beginning of the template string is the start of a variable – “$(” – we read the variable name up until its ending “)” (line 16), return the variable name and what remains of the template string after the variable and check if this variable is already bound in the dictionary (line 18). If the variable is not bound (line 19) we return the resulting dictionary where we iteratively try to bind and match the variable name to a larger and larger value, starting with the empty value [], in what remains of the template against the string. If, on the other hand, the variable already had a value (CurrentValue – line 22) we replace the variable name in the template with the value of that variable and return the result of trying to match the new template against the string using the dictionary, D.

match("$(" ++ Rest, String, D) ->
    {ok, VarName, RemainsAfterVariable} = read_var_name(Rest),
    case maps:get(VarName, D, undefined) of
        undefined -> %% if we don't have an existing binding we try all possible bindings
            bind(VarName, [], RemainsAfterVariable, String, D);

        CurrentValue -> %% have existing binding, substitute the value of the variable and try to match
            match(lists:flatten([CurrentValue | RemainsAfterVariable]), String, D)
    end;

If the first characters in the template and the string are equal (line 25 – see how we use the Erlang’s matcher to enforce equality by using C in both first and second argument) and is not the beginning of a variable name – does not start with “$(” – we return the match result we get when we just remove the first character in both arguments and then match the remaining strings against each other (note: since this clause comes after the clause above we know that C is not the beginning of “$(” in this case because otherwise we would have chosen that clause instead)

  
match([C | TemplateRest], [C | StringRest], D) ->
    match(TemplateRest, StringRest, D);

Otherwise, if none of the clauses above was chosen (because they did not match) we know that we do not have a match, so we return an error tuple.

match(_, _, _) ->
    {error, no_match}.

If it were not for the length of the name, the function bind should have perhaps better been called bind_and_match, because what it does is to try and bind a variable name to a value and then return the result of the first successful match it finds (or none). So what does the bind function look like?

First, if there is nothing remaining in the Template nor the String we simply extend the dictionary, D, with the VarName pointing to the Value we have found so far and return the dictionary (the lists:reverse is because we build up the value in reverse as we scan through the string – a well-known Erlang approach)

bind(VarName, Value, [], [], D) ->
    D#{VarName => lists:reverse(Value)};

Then, if we try to bind a variable VarName when there is nothing left after the variable in the template, then the value of that variable is simply what remains in the string, so we extend D by letting VarName bind to what is left in the string

bind(VarName, _, [], String, D) when length(String) > 0 ->
    D#{VarName => String};

If we reach the end of the string but we have characters left in the template, we extend the dictionary, D, with the Value we have found so far and then return the result of matching the remainder of the template against the empty string – which would of course fail unless the template contains only variables that can be bound to zero length values.

 
bind(VarName, Value, Template, [], D) when length(Template) > 0 ->
    NewD = D#{VarName => lists:reverse(Value)},
    match(Template, [], NewD);

If the rest of the template is the same as the string, we bind the accumulated value AccValue to the variable name and return the dictionary, D, (once again, note how the Erlang matcher constrains the two variables to be equal – this clause would not be taken unless its arguments matches the clause, thus argument 3 and 4 must be equal.

bind(VarName, AccValue, String, String, D) ->
    D#{VarName => lists:reverse(AccValue)};

Then, lastly, we are in the process of finding the value for VariableName and have so far found the value AccValue and the rest of the string starts with some character C (line 46). We try to create a new dictionary, NewD, by binding the variable to AccValue (line 47 – once again we do lists:reverse because we have accumulated the value in reverse), then we try to match the remainder using this new dictionary (line 48). If this fails (line 49), we instead add C to the accumulated variable value and try, recursively, to bind the variable name to that value instead, using what is left of the string after having removed C (line 50). However, if we find a match (on line 48) we return yet another dictionary, YetAnotherD (line 53), which is our resulting dictionary.

bind(VarName, AccValue, Template, [C | StringRest] = String, D) ->
    NewD = D#{VarName => lists:reverse(AccValue)},
    case match(Template, String, NewD) of
        {error, _Reason} ->
            bind(VarName, [C | AccValue], Template, StringRest, D);

        YetAnotherD ->
            YetAnotherD
    end.

(The read_var_name function is left as an exercise for the reader, or you can have a look at the full code in the git repository[2])

Test drive

So now we can try our template matcher with some examples

1> template:match("$(A)", "Hello world!").
#{"A" => "Hello world!"}
2> template:match("Stefan $(SURNAME)", "Stefan Hellkvist").
#{"SURNAME" => "Hellkvist"}
3> template:match("Because I think $(SOMETHING) is better than $(SOMETHING_ELSE)", "Because I think Mesi is better than Ronaldo").
#{"SOMETHING" => "Mesi","SOMETHING_ELSE" => "Ronaldo"}
4> template:match("a$(A)$(A)a", "abba").
#{"A" => "b"}
5> template:match("$(A)ab$(A)$(B)$(A)", "abb").
#{"A" => [],"B" => "b"}
6> template:match("a$(A)a$(B)", "abaa").
#{"A" => "b","B" => "a"}
7> template:match("a$(A)$(B)a", "abba").
#{"A" => [],"B" => "bb"}

, where [] in the answers denote the empty string (strings and lists are the same thing in Erlang…unfortunately).

Note that, the 7th example is our example from part I which has more than one solution. Although our method manages to find a match (A=””, B=”bb”) it might not be the one you expected to see perhaps. The template matcher will give you the first solution it finds, not all of them.

The template library on github also includes the reverse function of the match – the replace. This function takes a template and a dictionary and will replace variables in the template with whatever binding these variables have in the dictionary. That implementation is much more straight forward though.

In part III of this series about pattern matching I will make use of this template match (and replace) library to try to do some rudimentary natural language processing.

References
  1. http://erlang.org/doc/reference_manual/expressions.html#pattern
  2. https://github.com/peffis/template

The joy of pattern matching – part I

The passion

Let us assume a template language of all character sequences containing any number of lowercase and uppercase characters ([A-Za-z]*), where lowercase characters are literals – constants with a name – and uppercase characters are placeholders for any sequence of lowercase characters (including the empty sequence).

"abc"
"aABd"
""
"abcasdasdlkjasdLKJKJDASSASaaS"

Some examples of the template language

Then, given such a template, play the game of trying to match the template with a sequence containing only lowercase characters, where “match” is the problem of finding values for all the uppercase characters so that the two sequences become equal. For instance, try to figure out what sequences the uppercase characters would represent in the below examples:

"abA" = "abc"
"aAa" = "abba"
"aAaA" = "acdacd"
"aABa" = "abba"

Match is denoted with the character '='

I’m curious. Did you enjoy playing this game? And did you think about all the different solutions for the fourth example? For me, even this simple game gives me a very odd satisfaction which I cannot really explain. Could it be some kind of OCD[1]?

I returned to using Erlang some time around year 2000 after having spent several years building systems in Pascal, C, C++, Python, Perl, PHP, Javascript and (far too much) Java. From the start, Erlang just felt “right” and over the years I have grown more and more fond of it. I often wonder why it has the appeal it has on me. Why does not Go (I do a lot of Go programming as well) have the same appeal as Erlang? I sometimes give vague reasons like “Erlang matches well how I solve computer science problems” without knowing what I really mean, or I try to find more well-defined reasons like “on a multi-core CPU it really is an awesome way to make good use of the hardware” or “the fail-fast methodology with supervision trees makes for very robust software”, but the more I think about it, what really attracts me, is the pattern matching abilities of this beautiful language in combination with its functional programming paradigm.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. […] The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations (if any) of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence (i.e., search and replace).

– from Pattern matching, Wikipedia[2]

In Erlang the matching operator is the equal sign, ‘=’ (which is a bit confusing for people coming from other programming languages where an equal sign means “assign a value to a variable”). In Erlang, the equal sign simply means that you match the left-hand side with whatever is on the right-hand side with the side effect that you, if the match succeeds, bind any variables found in the pattern on the left-hand side. In Erlang, once a variable is bound to a value as a result of a match, it will always represent that value within that scope, forever. You can never “re-bind” a variable to a new value (or “re-assign” if you think in more traditional terms). This can be shown in our invented sequence language from above with the following three sequence example

"abcd"
"aAd"
"Abc"

A non-matching example

Here we realize that the three sequences together cannot match as the first two sequences would bind A to bc (in some imaginary “context” where we store our bindings) which would make the third sequence bcbc, which does not match the other two sequences. In Erlang, when you try to match and fail doing so, you get an exception, like in the below Erlang example

A = 1,
A = 2.
** exception error: no match of right hand side value c

, where you get an exception on line 2 because A is on line 1 bound to the integer 1 and integer 1 cannot ever match integer 2.

1 = 2.
** exception error: no match of right hand side value 2

but the below code is fine of course as 1 matches 1 (although it is code seldom seen in real life I suppose)

1 = 1.

Are you still reading this? Ok, then perhaps you must have the same odd fascination for pattern matching as I do, or perhaps you suffer from something else ­čÖé

The pattern matching in Erlang [3] not only makes your code expressive and therefore short (which is an important beauty aspect of a program), but it also gives you these extremely charming puzzles (like the game above) to solve while you program which brings some kind of creative energy to the programming experience. There just is something charming with taking two patterns and see how they would match (if they would match at all). Pattern matching is however by no means unique to Erlang. It exists in many other languages such as Snobol, Haskell and nowadays somewhat even in Python and Javascript. In languages where it does not exist however I nowadays dearly miss its powers.

Pattern matching in Erlang

Pattern matching plays a large part in any Erlang program. One example is how Erlang “selects” function clauses to execute as in the example below where sum_list/2 has two clauses starting on lines 6 and 9 and the clause to execute is selected depending on matching of the two arguments (where in this case the value of the second argument plays no role) – if the list is empty, take the first clause, otherwise select the second clause.

%% calculate the sum of all numbers in a list
sum_list(Ls) ->
    sum_list(Ls, 0).


sum_list([], C) ->
    C;

sum_list([A | Rest], C) ->
    sum_list(Rest, C + A).

You can match on any term, any type, in Erlang just as long as the left-hand side of the match follows the same structure as the right-hand side. The left-hand side may include unbound variables but the right-hand side can only include bound variables.

You can match lists

[1, A | B] = [1, 2, 3, 4, 5].
%% A = 2,
%% B = [3, 4, 5]

You can match strings (which are lists, but there is a syntactical sugar for matching strings which is worth noting)

"prefix" ++ A = "prefixa".
%% A = "a"

You can match binaries (where you have the option to use special bit-field specifiers)

%% matching out the header fields from a binary IP Packet
<<?IP_VERSION:4, HLen:4, SrvcType:8, TotLen:16,
      ID:16, Flgs:3, FragOff:13,
      TTL:8, Proto:8, HdrChkSum:16,
      SrcIP:32,
      DestIP:32, RestDgram/binary>> = Packet

You can match maps

#{a := A, b := #{c := C}} = #{a => 1, b => #{c => 2}}.
%% A = 1
%% C = 2

You can match tuples

{A, B} = {1, 2}.
%% A = 1
%% B = 2

…and well, any term constructed from any combination of types you can think of

{A, [#{key := "abc" ++ C}, B]} = {a, [#{key => "abcdef"}, 2]}.
%% A = a
%% B = 2
%% C = "def"

Is this not beautiful? Making use of matching can make your code very dense and expressive and, to some (like me), it also enhances the pleasure of writing code. Pattern matching can also be used to enforce constraints on your data structures so that, if a match fails, you know that your program is in a bad state and should be restarted (in Erlang you just let things crash and have some supervisor restart the failed service).

I will continue writing about pattern matching in more general terms in part two of this post.

References
  1. https://en.wikipedia.org/wiki/Obsessive%E2%80%93compulsive_disorder
  2. https://en.wikipedia.org/wiki/Pattern_matching
  3. http://erlang.org/doc/reference_manual/patterns.html