Archive for October, 2009

Parser combinators are as expressive as possible

Wednesday, October 28th, 2009 by Nils Anders Danielsson

What is the expressive strength of parser combinators? Which languages can be expressed using them? The question depends on the definition of parser combinators used. In our last meeting I showed that a simple parser combinator library which I have implemented in Agda is as expressive as possible: it can be used to recognise exactly those languages (sets of bit strings) for which a decision procedure can be implemented in Agda.

Consider the following language of parser (or perhaps recogniser) combinators:

data P : Bool → Set where
  ∅   : P false
  ε   : P true
  tok : Bool → P false
  _∣_ : ∀ {n₁ n₂} → P n₁ →        P n₂  → P (n₁ ∨ n₂)
  _·_ : ∀ {n₁ n₂} → P n₁ → ∞? n₁ (P n₂) → P (n₁ ∧ n₂)

The parsers are indexed on their nullability; the index is true iff the parser accepts the empty string. The first four combinators are unsurprising: failure, the recogniser for the empty string, the recogniser for a given token, and symmetric choice. The last combinator, sequencing, is more interesting: the second argument is allowed to be coinductive if the first argument does not accept the empty string:

∞? : Bool → Set → Set
∞? true  A =   A
∞? false A = ∞ A

(Read ∞ A as a thunk, i.e. a suspended computation of type A.) This ensures that grammars can be infinite, but only in a controlled way. If you check out the code you will find a formal semantics for these combinators, and a (verified and total) procedure which decides whether a given string is a member of a given language defined using the combinators. This shows that every language definable using the combinators is decidable. (For more details, see Mixing Induction and Coinduction.)

What about the other direction? Given a decision procedure of type List Bool → Bool a corresponding grammar can be produced. The idea is to build an infinite tree with one “node” for every possible input string, and to make this “node” accepting iff the decision procedure accepts the given string:

accept-if-true : ∀ n → P n
accept-if-true true  = ε
accept-if-true false = ∅

grammar : (f : List Bool → Bool) → P (f [])
grammar f = tok true  · ♯ grammar (f ∘ _∷_ true )
          ∣ tok false · ♯ grammar (f ∘ _∷_ false)
          ∣ accept-if-true (f [])

(Read ♯_ as “delay”, i.e. as the constructor of a thunk.) For a formal proof of the correctness of the grammar, see the Agda code. I have also generalised the proof to proper parser combinators, for which the statement is that, for every R, every function of type List Bool → List R can be realised using parser combinators (if set equality is used for the list of results); if the bind combinator is allowed, then the result easily generalises to arbitrary token types.

The language defined by grammar f above can be parsed using deterministic recursive descent with one token of lookahead; grammar f could be viewed as an “infinite LL(1)” grammar. I suspect that this grammar scheme can be implemented using many parser combinator libraries. I also suspect that the result above is already known. It has been known for a long time that a potentially infinite grammar can represent any countable language (I found this mentioned in Marvin Solomon’s PhD thesis from 1977), and given that fact the result above is only a small step away. Pointers to the literature are welcome.

Factorising folds for faster functions

Friday, October 16th, 2009 by Graham

When I visited Galois a couple of years ago, I worked with Andy Gill on formalising the worker/wrapper transformation that is used within GHC to improve the performance of programs. The theory that we came up with was simple and general, but was based upon modelling recursive programs in terms of a fixed point operater, and hence wasn’t able to exploit additional structure that may be present in many programs.

In today’s talk (based upon joint work with Mauro Jaskelioff and Andy Gill), I showed how the worker/wrapper theory can be specialised to the structured pattern of recursion encapsulated by fold operators, and discussed the resulting benefits. I gave one simple example, deriving fast reverse, and one more complicated example, verifying the correctness of an efficient implementation of substitution from Voigtlander’s MPC 2008 paper. Further details are in our new paper.

Left Kan extensions of containers

Saturday, October 3rd, 2009 by Thorsten

I find that it is often helpful to look at operations on functors instead of datatypes. An example is the exponential of functors which turns out to be an operation on containers. Recently, I have been looking at left Kan extensions, or short Lans. Left Kan extensions are defined as the left adjoint to composition, i.e. the set of natural transformations from F to G \circ J is naturally isomorphic to the set of natural transformations \mathrm{Lan}_J\,F to G.

There is a useful explicit representation of Lans using coends:
\mathrm{Lan}_J\,F = \int^X(J\,X \to A)\times F\,X
(for simplicity I assume that the range of the functors is Set).
The coend is actually the quotient of a large \Sigma-type:
 \dots = (\Sigma X.(J\,X \to A)\times F\,X)/\sim
where \sim is the equivalence relation generated by
 (X,g\circ J\,h,x) \sim (Y,g,F\,h\,x)
for any h\in X\to Y.

Neil Ghani found a nice application of Lans to give initial algebra semantics to GADTs. As an example consider an encoding of Fin as a GADT:

\begin{code}
data Fin : Set -> Set where
  fz : Fin (Maybe a)
  fs : Fin a -> Fin (Maybe a)
\end{code}
Can we understand Fin as an initial algebra of an endofunctor on the category of endofunctors. A problem is the occurence of Maybe on the right hand side. Using Lans we can shift this on the right, i.e. Fin = \mu H.\mathrm{Lan}_\mathrm{Maybe}\,(1 + H).

What has this to do with containers? A container is a functor of the form F\,X = \Sigma s\in S.P\,s \to X where S\in\mathrm{Set} and P\in S\to\mathrm{Set}. We write F = S \lhd P. Now given also G = T\lhd Q we can use the coend formula to work out the container representing \mathrm{Lan}_F\,G:
 \begin{eqnarray*}
\mathrm{Lan}_F\,G\,A & = & \int^X (F\,X \to A) \times G\,X \\
& = & \int^X (\Sigma s\in S.P\,s \to X) \to A) \times \Sigma t\in T.Q\,t \to X \\
& \simeq & \Sigma t\in T.\int^X (\Sigma s\in S.P\,s \to X) \to A) \times Q\,t \to X \\
& \simeq & \Sigma t\in T.(\Sigma s\in S.P\,s \to Q\,t) \to A)\\
& = & t \in T \lhd \Sigma s\in S.P\,s \to Q\,t
\end{eqnarray*}
The interesting step isthe 4th equality from the top. I worked it out using the explicit representation of coends as a \Sigma-type, but maybe there is a more abstract way to prove it. I also checked the equations by plugging it into the definition of a Lan as an adjunction using the explicit representation of container morphisms.

Now if we apply the formula to Lan_{\mathrm{Maybe}}\,G, where \mathrm{Maybe} = b\in 2 \lhd b=1 we obtain
Lan_{\mathrm{Maybe}}\,G = t\in T \lhd \Sigma b\in 2.(b=1)\to Q\,t
which can be further simplified to
 \dots = \in T \lhd 1 + Q\,t
Hence, Lan_{\mathrm{Maybe}}\,G\,A = A \times G\,A. Indeed, we could have proven this result directly using the Yoneda lemma (which also shows that it doesn’t depend on G being a container). Applying this isomorphism to Fin it turns out that there is an isomorphic datatype which isn’t a GADT, namely:

\begin{code}
data Fin' : Set -> Set where
  fz' : a -> Fin' a
  fs' : a -> Fin' a -> Fin' a
\end{code}
I leave it to the reader to work out the isomorphism between Fin and Fin’.

Btw, what about Right Kan Extensions? I can show that if F is a container then \mathrm{Ran}_F\,G always exists (even if G is not a container). However, \mathrm{Ran}_F\,G isn’t necessary a container even if both F and G are.