Friday, October 10, 2008

The Birth of a Language

Matlab: An imperfect marriage

Matlab is great. Matlab has a function for whatever you want to do with a matrix (convolve it, invert it, sautee it, etc...). Matlab, at it's core, is an elegant array language every bit as powerful as APL, K and their ilk (while saving us from their painful syntax).

Matlab is also an abomination. Everything about the language except its matrix processing capacities was designed by an apparent drunkard. Cell arrays, handle graphics, strings as naked character arrays, dynamic return values, oh God! By far the most annoying aspect of using Matlab is the terrible sight of red-texted error message after running some CPU-intensive code for hours. It's just heartbreaking. And you know what? This is America, and we can do better.

Introducing Zahir


The language I want to code in looks an awful lot like Matlab but has better manners. If Matlab plans on giving up the ghost in the middle of a computation, it should have the decency of stopping itself. Of course, given that Matlab is a dynamically typed language (and a sloppy one at that) this is often a difficult request to fulfill. Which is why we need a new language. A better language. A statically typed language. I call it Zahir. Why "Zahir"? Well, read Borges' story; it's a commentary on my unfortunate coding style.

Aside from safety, are there any other benefits to going static? Yes!
Beyond the blissful protection from easily prevented run-time exceptions, we also get a lot of performance for very little effort. Statically compiled languages, especially of the functional variety, are orders of magnitude easier to optimize that dynamic languages. For example, OCaml gets stunning performance with a barely-optimizing lazy-loaf of a compiler. I'm not going to prematurely optimize the Zahir code (hell, for now it's just a treewalking interpreter), but I expect that speedy performance is ultimately in the cards for this language. Beyond the big two (safety and speed), there's a raft of lesser but still worthwhile benefits that accompany a static type system (such as algebraic datatypes and pattern matching).

Zahir may be The One, but is it ready?


I don't know how other languages got created, but mine sure isn't coming out in some magical 14-hour coding-session-of-brilliance. I've been working on and off on the Zahir language for a long time and despite not even nearing alpha quality I still end up cutting away nearly as much code as I add.

Recently I've felt some optimism that I'm finally crawling towards a suitably powerful language skeleton, onto which the matrix-flesh of a good Matlab replacement can be sewn. I'm writing this post to give myself a boost of clarity and organization.

The Zahir Type System


Implicit scalar expansion

This is the feature which makes imitating Matlab-style programming possible. When you say sin(x) in Matlab it doesn't matter whether x is a scalar or a matrix of arbitrary dimension. sin simply handles its arguments intelligently and lets the user write elegant code which reflects the underlying mathematical formulae. Zahir has to provide the same semantics but with the added benefit of static type checking. The powerful static languages (ML dialects and Haskell) wisely stay away from this sort of compile-time voodoo. For them, however, scientific and engineering applications are second class citizens which can survive on the scraps of inconvenient syntax thrown their way.

One reasonable-seeming transformation I'm playing with right now is inspired by the NTD semantics of the SequenceL language. When x is a Sequence the code sin(x) can be transformed into map(sin, x) by the type checker. Giving a context-dependent type to the symbol sin is a risky endeavour and certainly complicates type checking. Nonetheless, for an engineering oriented language, it's worth it.

Polyvariadic Functions
For the above map transformation to really fulfill the big role I'm planning for it in Zahir, the map function will have to be considerably more powerful than its equivalents found in ML or Haskell. Specifically, I need map to work for functions of any number of arguments. This would make map a polyvariadic function. What's a polyvariadic function and why do we need these monstrosities? Well, think of adding two vectors: a + b. In Zahir, this corresponds to calling the function + with the arguments (a,b). If scalar expansion works correctly, then we get element-wise array operations for free. Once transformed this snippet becomes map(`+, a, b). Though it looks simple, this code is devilishly hard to assign a static type. As of now, the best type I can conceive of for map is map :: ((... 'a#n -> 'b), ... 't('a#n)) -> 't('b) | Sequence 't. This jumbled mess says that map take a function of n arguments, a variable number of sequences, and returns a sequence of the function's return type. What's that Sequence 't doing after the vertical bar? Well, that's a type constraint, which brings us to...

Type and Constructor Classes
To pull of generic scalar expansion over arbitrary datatypes (which might be unwise, but nonetheless appeals to me), I need some form of overloading. The most elegant solution here seems to be to dip into the Haskell universe and borrow their class system. In Haskell, type and constructor classes are sets of types over which common functions have been defined. A Sequence for example, could be characterized by this declaration:


class Sequence 't where
map :: ('a -> 'b), 't('a) -> 't('b)
end


If we're using polyvariadic functions as mentioned above, Sequences start looking a little funkier:


class Sequence 't where
map :: ((... 'a#n -> 'b), ... 't('a#n)) -> 't('b)
end


To include matrices as instances of Sequence, we would write:

instance Sequence mat where
map(f, ... args) = %% some definition here
end


Note that the above is a constructor class instance over a generic parameter 'a. This means that our definition of map will work for any matrix regardless of whether it's a mat{num} or mat{complicatedNonsenseObject}.