Copyright (C) 2016, 2017, H. Conrad Cunningham
Advisory: The HTML version of this document may require use of a browser that supports the display of MathML. A good choice as of October 2017 is a recent version of Firefox from Mozilla.
TODO:
TODO
Few computer science graduates will design and implement a general-purpose programming language during their careers. However, many graduates will design and implement---and all likely will use---special-purpose languages in their work. These special-purpose languages are often called domain-specific languages [Wikipedia].
Paul Hudak defines a domain-specific language (or DSL) as "a programming language tailored to a particular application domain" [Hudak 1998], that is, to a particular kind of problem.
General-purpose languages (GPLs), such as Java, Python, C, and Haskell, seek to be broadly applicable across many domains. They can, in theory, compute any function that is computable by a finite procedure; they are said to be Turing complete.
DSLs might be Turing complete, but often they are not. DSLs are little languages [Bentley 1986] that "trade generality for expressiveness" [Mernik 2005].
Ideally, a DSL should enable experts in an application area to program without programming---that is, to express the problems they want the computer to solve using familiar concepts and notations, without having to master the intricacies of programming in a general-purpose language [Hudak 1998, van Deursen 2000].
For example, the DSLpic
(long available on Unix-based computers) enables writers to produce line drawings in typeset documents; they can focus on the layout of the drawings without being required to develop programs in C (the primary general-purpose language used on Unix) [Bentley 1986].
Other DSLs on the Unix platform include:
lex
for generating lexical analyzers
yacc
for generating parsers)
make
for building software from its various sources
Markup languages are also DSLs.
HTML is a DSL for formatting documents on the Web.
LaTeX is a powerful markup language used for books and articles (especially in STEM disciplines).
Markdown is a simple markup language for documents that may be needed in various formats. This document is written using the variant of Markdown supported by the Pandoc tool.
The designers of a DSL must select relevant concepts, notations, and processes from the application domain and incorporate them into the DSL design [Hudak 1998].
What is and is not a DSL is a fuzzy concept. Not all writers agree on the definition.
Martin Fowler suggests that one key characteristic of a DSL is its language nature [Fowler 2011].
Fowler argues that a DSL must be fluent. That is, the meaning of the language comes not just from the individual expressions (i.e., the words) but also from how the expressions are composed together (i.e., into sentences and paragraphs). It has both a vocabulary and a grammar.
Fowler also suggests that another key characteristic of a DSL is its limited expressiveness [Fowler 2011]. He argues that a DSL should have only those features needed to support its target domain. A DSL should not attempt to solve all problems for all users for all time. It should not seek to define an entire software system. It should instead focus on providing an effective, uncluttered solution to a specific aspect of the overall system.
Fowler classifies DSLs into two styles [Fowler 2008 2011]:
Although the terminology is relatively new, the ideas are not.
An external DSL is a language that is different from the main programming language for an application, but that is interpreted by or translated into a program in the main language. The external DSL is a standalone language with its own syntax and semantics.
The Unix little languages pic
, lex
, yacc
, and make
exhibit this style. They are separate textual languages with their own syntax and semantics, but they are processed by C programs (and may also generate C programs).
External DSLs may use ad hoc techniques (e.g., hand-coded recursive descent parsers), parser-generation tools (e.g., lex
and yacc
in the C/Unix environment and Happy and Alex on the Haskell Platform), or parsing libraries (e.g., the Haskell library Parsec, Scala's parser combinator library, and the Lua library LPeg).
Example external DSLs in the instructor's notes for other classes include some of the Scala-based Secret Panel (State Machine) and the Lua-based Lair Configuration DSLs.
To distinguish between external DSLs and GPLs, Fowler cites the need for a DSL to limit its features to the minimal set needed for the specific domain [Fowler 2011].
For example, Fowler considers the programming language R a GPL not a DSL. Although R has special-purpose features to support statistical programming, it has a full set of general purpose features for a wide range of programming tasks.
An internal DSL transforms the main programming language itself into the DSL--the DSL is embedded in the main language [Fowler 1998].
The techniques for constructing internal DSLs vary from language to language.
The language Lisp (which was defined in the 1960s) supports syntactic macros, a convenient mechanism for extending the language by adding application-specific features that are expanded at compile time. The Lisp macro approach has been refined and included in languages such as Scheme, Clojure, and Elixir.
Internal DSLs in the language Ruby exploit the language's flexible syntax and runtime reflexive metaprogramming facilities. The Ruby on Rails web framework includes several such internal DSLs. (I have also implemented a little language for surveys as internal DSL in Ruby [Cunningham 2008].)
Haskell's algebraic data type system has stimulated research on "embedded" DSLs for several domains including reactive animation and music [Hudak 1998, 2000].
In object-oriented languages, internal DSLs may also exploit object structures and subtyping. Example internal DSLs in the instructor's notes include the following:
Scala-based Computer Configuration DSL
Scala-based Email Message Building DSL
Lua-based Lair Configuration DSL
DSLs in the instructor's notes
Ruby-based Survey DSL
To distinguish between an internal DSL and a standard command-query Application Programmer Interface (API), Fowler cites the need for a DSL to be fluent [Fowler 2011]. The command-query API provides the vocabulary; the DSL provides the grammar for composing the vocabulary "words" into "sentences". The implementation of a DSL is often supported by an API with a fluent interface, a vocabulary of operations designed to be composed smoothly into larger operations.
The difference between shallow and deep embeddings of an internal DSL concerns the relationship between the implementations of a DSL's syntax and its semantics.
In a shallow embedding of an internal DSL, the implementation's types and data structures directly represent the semantics of the domain but do not represent the syntactic structure of the domain objects.
For example, the regular expression package from the Thompson Haskell textbook [Thompson 2011], section 12.3, is a shallow embedding of the regular expression concept. It models the semantics but not the syntax of regular expressions. It uses functions to represent the regular expressions and higher order functions (combinators) to combine the regular expressions in valid ways.
Similarly, the Scala-based Computer Configuration and Email Message Building DSLs and the Lua-based Lair DSLs are relatively shallow embeddings of the DSLs.
The advantage of a shallow embedding is that it provides a simple implementation of the semantics of the domain. It is usually straightforward to modify the semantics by adding new operations. If these capabilities are all that one needs, then a shallow embedding is convenient.
A disadvantage is that it is sometimes difficult to relate what happens during execution to the syntactic structure of the program, especially when errors occur.
In a deep embedding of an internal DSL, the implementation's types and data structures model both the syntax and semantics of the domain. That is, it represents the domain objects using abstract syntax trees.
For example, section 19.4 of the Thompson Haskell textbook [Thompson 2011] redesigns the regular expression package as a deep embedding. It introduces types that represent the syntactic structure of the regular expressions as well as their semantics.
The advantage of a deep embedding is that, in addition to manipulating the semantics of the domain, one can also manipulate the syntactic representation of the domain objects. The syntactic representation can be analyzed, transformed, and translated in a many ways that are not possible with a shallow embedding.
As an example, consider the deep embedding of the regular expression DSL. It can enable replacement of one regular expression by an equivalent simpler one, such as replacing (a*)*
by a*
.
Of course, the disadvantages of deep embedding are that they are more complex to develop, understand, and modify than shallow embeddings.
What about the Expression Language we discuss in a separate case study?
The concrete syntax and semantics of the Expression Language is different from Haskell, so at that level it is an external DSL. The parsers recognize valid arithmetic expressions in the input text and create an appropriate abstract syntax tree inside the Haskell program. The abstract syntax tree differs from the textual arithmetic expression and from its parse tree.
However, the abstract syntax tree does capture the essential aspects of the syntax. And the abstract syntax tree itself can be considered a deep embedding of an internal DSL for the abstract syntax. For the remainder of the processing of the expression, the abstract syntax tree preserves the important syntactic (structural) features of the arithmetic expressions.
We transformed the abstract syntax trees by simplifying them (and by generating their symbolic derivatives). We also translated the expressions to instruction sequences for a Stack Virtual Machine.
The accompanying Sandwich DSL case study (in Haskell and Scala) gives another example of how one can create a deeply embedded DSL in a simple situation.
When we look around, we can see DSLs everywhere!
Consider how I create this document on DSLs. I write the text with the text editor Emacs, which includes a number of extensions, perhaps written in DSL(s). I indicate the document's structure using a DSL, Pandoc's dialect [MacFarlane 2017] of the markup language Markdown. If I add mathematical notation to the document, I write these in a subset of the LaTeX DSL. I then execute the pandoc
tool on the input file.
Pandoc (which itself is written in Haskell) reads the Markdown input, converts it to an abstract syntax tree internally, and then writes an appropriate HTML (a DSL) output file (that you are likely reading). I also direct Pandoc to write a LaTeX (a DSL) output file, on which I execute the tool pdflatex
to create a PDF document. I could generate documents in other standard formats such as Microsoft Word's .docx
format (essentially a DSL) and EPUB (a DSL).
For the HTML output, I direct Pandoc to generate MathML, a standard DSL in the XML family that describes mathematical expressions for display on the Web. (It is currently supported by the FireFox browser but not all others, which is why I recommend viewing these documents with FireFox.)
I wrote the original version of this document directly in HTML before I started using Pandoc. Other documents used in this course were written originally using LaTeX. I used Pandoc to convert these documents to Markdown, which gave me the starting point for my recent changes.
To add a new input format to Pandoc requires a new reader program that can parse the input and generate an appropriate abstract syntax tree.
To add a new output format to Pandoc requires a new writer program that can access the abstract syntax tree and generate appropriately formatted output.
Pandoc's abstract syntax tree is made available to writers using either Haskell algebraic data types or JSON (JavaScript Object Notation) structures.
JSON is an external DSL in that it uses a subset of JavaScript's concrete syntax to express the structure of the data. But, because it is JavaScript code, it also defines an equivalent internal data structure, so it also has aspects of an internal DSL. Pandoc could use a JSON Schema to define the supported format. A JSON Schema is a JSON document with a specific format (a DSL) that defines the format of other JSON documents (other DSLs).
My workflow uses many DSLs directly or indirectly.
The whole conversion process in Pandoc revolves around the abstract syntax tree.
TODO: Revise and refine this to better address the design of Haskell DSLs.
In a previous module, we examine families of related functions to define generic, higher-order functions to capture the computational patterns for each family. We seek to raise the level of abstraction in our programs.
Design of DSLs is similar, except that we seek to design a language to express the family members in some application domain rather than design a higher-order function.
We should first analyze the domain systematically and then use the results to design an appropriate DSL syntax and semantics. We analyze the domain using Scope-Commonality-Variability (SCV) analysis [Coplien 1998] and produce four outputs.
scope -- the boundaries of the domain. That is, identify what we must address and what we can ignore.
terminology -- the definitions of the specialized terms, or concepts, relevant to the domain.
commonalities -- the aspects of the domain that do not change from one application to another within the domain. We sometimes call these the frozen spots.
variabilities -- the aspects of the domain that may change from one application to another within the domain. We sometimes call these the hot spots.
In the SCV analysis, we must seek to identify all the implicit assumptions in the application domain. These implicit assumptions need to be made explicit in the DSL's design and implementation.
We use the SCV analysis to guide our choices for elements of the DSL design [Mernik 2005, Thibault 1999]. The scope focuses our attention on what we are trying to accomplish. The terminology and commonalities suggest the DSL statements and constructs. The commonalities also suggest the semantics of the constructs and the nature of the underlying computational model. The variabilities represent syntactic elements to which the DSL programmer can assign values.
Drawing on the experience in designing, implementing, and evolving the JMock DSL (which provides mock objects for testing), Freeman and Pryce make four recommendations for constructing an internal DSL in Java [Freeman 2006]. To some extent, these recommendations apply to design of all DSLs.
Separate syntax and semantics (interpretation) into separate layers.
The concern of the syntax layer is to provide a fluent, readable language for users familiar with the application domain. These may not be programmers--or at least not programmers who are experts in the host language.
The concern of the semantic layer is to provide a correct, efficient, and maintainable interpreter for the language. The developers and maintainers of this layer are typically experts in the host language who can take advantage of the implementation language's capabilities and idioms.
As with most nontrivial software development tasks, mixing these concerns can make the implementation difficult to develop, understand, and maintain. It is better to translate the syntax to to an appropriate semantic model (e.g., an abstract syntax tree) for processing, hiding the details of each behind well-designed interfaces.
Use, and perhaps abuse, the host language and its conventions to enable the writing of readable DSL programs.
For internal DSLs, the syntax layer may need to violate the conventional programming styles and naming conventions of the host language to achieve the desired readability and fluency.
For example, DSLs in object-oriented languages may use method chaining and method cascading extensively to achieve the desired fluency. These practices usually discouraged in the usual programming practices.
Don’t trap the user in the internal DSL.
The DSL should encapsulate its internal implementation details to avoid unexpected dependence on implementation features that might change over time and to enable naive users to use the DSL safely.
However, it is difficult to anticipate all possible uses of a DSL over time. Thus it is helpful to enable expert users of the host language to extend the DSL by providing alternative implementations of key abstractions.
The implementation of the DSL should itself be approached as a software family. Although the DSL syntax may look different than the host language, the DSL implementation should seek work seamlessly with other host language programs.
We should track the changes that expert users make. These help identify possible future enhancements of the DSL and its implementation.
Map error reports to the syntax layer rather than to the semantics layer, which is hidden from the DSL user.
Good error reports are critical to a successful DSL. As much as possible, errors in both syntax and semantics should be stated in terms of the syntactic structure of the specific DSL program. This is often difficult to accomplish, but it is important because it is unlikely that the DSL's users will be familiar with the internal details of the implementation.
Deep embedding of DSLs can make it easier to trace errors back to recognizable syntactic structures.
TODO
TODO
In Fall 2016, I adapted and revised much of this work for possible use in CSci 450, but I did not use it that semester. These notes are based, in part, on a previous HTML-source handout on domain specific languages for exercises in Fall 2014 CSci 450 and Spring 2016 CSci 555 classes. The notes draw ideas from several of the references listed in the final section.
In Spring 2017, I continued to develop these notes, adding discussion of the boundaries between DSLs and other computing artifacts, of the Expression Language case study, and of my use of DSLs in the workflow for producing this document.
I continue to develop these notes in Summer and Fall 2017.
I maintain these notes as text in Pandoc's dialect of Markdown using embedded LaTeX markup for the mathematical formulas and then translate the notes to HTML, PDF, and other forms as needed. The HTML version of this document may require use of a browser that supports the display of MathML.
Domain-specific languages (DSLs); language nature; fluency; limited expressiveness; DSLs versus general-purpose programming languages; DSLs versus APIs; external versus internal DSLs; shallow versus deep embedding of internal DSLs; use of algebraic data types to implement DSLs.