Panscript

From LXF Wiki


Panscript is a proposal to reinvent the language of the web. One language for creating, processing and delivering page content, whether on the web or in printed documents or, hopefully, other access media such as speech synthesis and Braille.

Please feel free to join in!

If you want to talk something over before editing this page, use the Discussion page.

Table of contents

The problem with page authoring

Modern web pages are a horrible mix of languages with widely differing grammars and syntaxes. For example a typical client page contains:

  • HTML or XHTML, plus maybe some DHTML. All written in the form:
<element property1="foo" property2="bar" ...>Content</element>
  • CSS. Written in the form:
element.class {property1:foo; property2:bar; ...}
or embedded in the HTML as:
style="property1:foo; property2:bar; ...;"
  • javascript, aka ECMAscript, which looks vaguely like:
instruction1;
instruction2 {
   lots of stuff inside curly brackets
}

Then, the source page actually stored on the server may also include such luxuries as:

  • SSI, PHP, Rails, ASP, JSP, CFML, Wikitext, etc. etc. which all look different again.
  • Maybe some XSLT, SQL or other languages, used to access and pre-process data from back-end resources. Yet more oddities.

Meanwhile it is quite common to want a document available both in printed form and as a set of web pages.

The original content itself may be created in a variety of formats, some open ones being:

  • Wordprocessors, of which ODF (Open Office native format) is at least accessible but horribly verbose to the point of obscurity, and PDF is also common but totally obscure and not semantics-friendly.
  • Bitmap images, these days typically PNG but also GIF and JPEG.
  • Vector images, usually reliant on a proprietary tool to convert them to usable output such as SVG or Flash.
  • Mathematical equations, in TeX/LaTeX or MathML.

Often these source documents will contain features (headers, footers, page breaks and numbering, page cross-references and such) which are unsuited to web pages. Likewise, HTML and friends contain links and stuff which are not suited to print publishing. Re-purposing content can involve a great deal of effort.

It's all horrible! Uh, sorry, did you catch that? It's horrible! All of it!!

The big idea

The idea began when Guy Inchbald got fed up with the prospect of learning one awful language after another - HTML, XHTML, javascript, CSS, PHP, ... just to get a web page looking and working the way it should even when printed out. One language to do it all - why not? All you will need to do is learn one language, dump your stuff on the page and go. So that's what panscript is all about - one language for creating, processing and delivering page content, whether on the web or in printed documents or, hopefully, other access media such as speech synthesis and Braille.

Clearly, this language will be very rich - probably as complex as many human languages. But it should bring major benefits, such as:

  • A single "way of doing things", including:
    • A common syntax used for laying out the stuff.
    • A common grammar for detailing everything unambiguously and readably.
  • The ability to seamlessly embed one level of functionality within another, for example to invoke a programmatic script within a stylesheet.

Other anticipated features include:

  • Modularity, so that certain sub-sets of the language apply only to certain uses - for example the client-side and server-side subsets would differ, though there would be a 'core' subset common to both. Crucially, any user agent would ignore anything that did not belong it the agent's specified sub-set: so a single page can contain (say) both hyperlinking and print formatting.
  • Object-friendly, for compatibility with other widely-used schema on the existing web.

For example an author could first learn the styling and layout markup, then move on to client-side scripting or page print formatting, already able to understand the basics of the code and needing only to expand their vocabulary a little. Or an application developer could start designing the page layout, again finding they have many of the skills already in place.

XML was the community's first stab at much of this, but it suffers from being long-winded and repetitive (full of word salad) to the point of incomprehensibility - thus defeating the original reason for having all those words in the first place. Also, because it is essentially a compiled language, processing it is a slow business - and repeatedly processing all that repetitive word salad is badly clogging up the world of web services. It's time to move on.

Feature wish list

Many of the existing languages have some great features we wouldn't want to lose:

  • HTML. Hyperlinking, of course. Frankly, I can't think of anything else. Oh, yes, except for the cellpadding property for Tables - doing that in CSS without endless repetition requires complex class definition. And frames provide a rudimentary form of transclusion - they may have been politically incorrect since the last Century but, for some reason known only to those of us who like to get things done, they just won't go away.
  • X(HT)ML. As for HTML, plus XML is modular so you can build different vocabularies for different things. XML also uses a generic parsing engine that understands all languages within the family and so has a small footprint. Another bonus is that, with careful schema design, semantics or meaning is naturally embodied in the names of the markup tags.
  • CSS. Well, it prettifies and it's powerful. And it cascades. But it's otherwise quite horrible.
  • javascript. Limits client-side functionality, in order to improve security and prevent things like covert disk access. Reasonably object-aware.
  • SSI. Brilliant idea for inserting stuff from elsewhere. You can create anything from a simple text block to a complex programmatic function and re-use it over and over in different pages.
  • Wikitext. Dead easy to learn and quick to write, using repetitive key presses to achieve complex variations. And transclusion - Mediawiki's buzzword for its simple-to-use equivalent of SSI.
  • TeX/LaTeX and chums. Mathematical equations and document layout, all done without the yards of verbosity and semantics-vs-style doublespeak that MathML and its chums bring.
  • SVG. Graphical elements in the same syntax as the text content (e.g. XHTML).
  • SQL. The most natural language for talking to databases, although fighting off challenges from XML, Java and the "noSQL" movement. A cool feature of Oracle SQL is aliases. You can set up an alias say a for my_database.client_tablespace.address_table. From then on, every time your query refers to that table, you just type "a".

Sheesh! How we have to cherry-pick to get a decent feature set!

Structures and models

Different kinds of language have different ideas about how to organise information.

  • Markup languages such as HTML tend to separate information into structure, content and style.
  • Programming languages such as javascript tend to distinguish instructions from data, with various kinds of formal breakdown into finer distinctions such as constants and variables.
  • Data access languages often have a hierarchical approach to data structures. SQL is pretty much based around tables, while networks and file systems have their own hierarchical structures and ways of organising them.
  • Object-oriented approaches recognise classes, properties and methods, besides of course objects. This is often overlaid on one of the other models, for example the XML DOM (Document Object Model) is overlaid on the XML markup.

So one problem is how to respect and interact with all four kinds of model. If we want recursive hierarchies, for example calling a script from within a stylesheet that is itself within another script, then we will need to be very clear about these things.

A particular problem occurs with the concept of style, where the w3c community have got in a pretty muddle:

  • In "Strict" XHTML, style is regarded as pure visual decoration, a property of some existing structural element. These properties are specified using a separate, specialist language such as CSS.
  • Web designers must often create arbitrary structural elements (such as divs, spans and tables) that have no other function than to provide visual styling. Such visual effects include complex backgrounds, pretty interactive buttons, and boxes around groups of form (GUI) elements that are related at a "What-I-want-to-do" level.
  • When formatting content for delivery in other media, such as page-based printing, then XML style languages such as XSL-FO invoke more complex "style" elements such as headers, footers, page numbering, paper sizes, the special requirements of output formats such as pdf or Braille, and so on.
  • In XSLT, "style" takes on a wider meaning in yet another way, to include the re-purposing of data by, for example, filtering the output from a database query or converting an XML page to a different structural schema.

We need to understand the differing natures of all these kinds of "style", and how to relate to them. For now, notice that some come closest to "markup", others to "programming".

Another problem revolves around the identifying of different types of structural element. Here are some examples:

  • Data typing. What types of data are allowed (n-bit integer, floating point, text, etc), and how can the type of a piece of data be changed if it needs to be? PHP is loosely typed, which saves a lot of pain, but when it decides to change the type for some reason the consequences are not always predictable.
  • Schema design. In a database, should postal addresses be stored as properties (columns) of people's names, or should they have their own address table, or should each part (Line1, Line2, town, county, postcode) have its own table? How should a web form be designed, if an existing database schema does not suit the form entry fields needed by the user?
  • When is a piece of code to be executed, and when is it to be treated as text or data to be acted upon by other code? For example a client-side script must be passed unprocessed by the server, but then executed by the client.

Verbosity

Back in the bad old days there came a time when most presentation formats, such as postscript pages or Word documents, had source code that was at best barely human-readable.

XML was intended to cure this by using long-winded "human-readable" markup, and was described by its designers as a "verbose" language. Sadly, it all went too far, causing two unfortunate effects:

  • Word salad. A typical XML document is so full of long wordy-wordy markup that its content and meaning are as hard to discover as ever. OK they are readable, but you have got to find them in the salad bowl first! This is made all the harder because XML has worked so hard to make the markup look as much like ordinary text content as possible. So you're not looking for tomatoes in a green salad, you are looking for bits of Webb's Wonder in among all the other lettuces.
  • Process-intensiveness. All that verbiage has to be pulled in and ploughed through by the poor computer, every time it opens the document. If it's a plain parser (i.e. not using the DOM), it then has to plough through that salad bowl every time it references anything in the document. With the rise of multi-layered XML-to-XML connectivity in the delivery web services, this is becoming a serious obstacle to effective delivery, and while computers grow faster every year, it will be a long time before battery-hungry mobile devices get powerful enough to cope.

Yet any practical web language must be human-readable. This is because the humble text editor is just too darn useful ever to die - it will always be an important development tool.

So -- how to square the circle?

Web 2

"Web 2" means different things to different people.

To some it is the interactive web - blogs, wikis, instant messaging and YouTube. This is one of the main areas where a single development language, which is also a simple-to-learn markup language, can bring great benefits.

To others, Web 2 is the semantic web. There are two approaches to embodying semantics in web content:

  • Create yet another specialist markup language to embody the semantic meaning of each element. An example would be an XML implementation of RDF.
  • Add semantic extensions to an existing markup language. (X)HTML microformats do this by adding semantic values to element properties such as class or rel.

Semantics can be multi-layered. Consider an XHTML (strict) element with some microformatted information, such as this fragment from an imaginary article on Sherlock Holmes:

<h2><span class="address">221b Baker Street</h1></span>

The "h2" tag provides semantic information about the place of the fragment within the article - it is a second-level heading. Meanwhile the "address" value provides semantic information about the place of the fragment within Sherlock Holmes' life. Now, suppose we want to add some arbitrary third kind of semantic, say that the address is fictional, or that further information is available to subscribers, or ... . We need a general, extensible semantic framework whose syntax does not distinguish between levels in the way that XHTML tends to.

Also, we want to avoid the risk of snowballing complexity. Suppose I want to add markup compatible with both a "semantic web" XML standard foo and a popular microformat vBar. Here is another imaginary fragment which is trying to do this:

<p class="address">
  <foo:FOO xmlns:foo="http://www.w3.org/2009/04/21-foo-syntax>
    <foo:address class="vBar">
      <foo:line1 class="firstline">221b Baker Street</foo:line1>
    </foo:address>
  </foo:FOO>
</p>

But will the separate standards be able to recognise or ignore each other's markup as required? Will a vBar engine find its marker inside a foo markup tag? How do we indicate that the paragraph class references a CSS stylesheet, while the foo:address class references the vBar microformat? And so on. Oh, and fancy debugging that load of spaghetti? Call it human-readable? I don't!

So it would be nice to define a single "right" way of doing things. One thing seems clear: all those semantic elements and namespace uris wished on us by XML are just horrible (and the more they get used, the more they weigh web services down with the massive processing overload of all that verbiage). Adding semantics as properties of a single element is far neater. here's something more like the kind of structure I envisage:

<p class="address" foo="address.line1" vBar="address.firstline">221b Baker Street</p>

And if you really want to link with your XML based web services, then you can knock up a nice XSLT on your application server and install a couple more CPU's, can't you? :-p

Other languages

No language, no matter how powerful, will ever exist in isolation. It will always have to interact with other languages. So it must be made possible, even easy, to embed "islands" in a foreign language structure, and likewise to embrace foreign islands.

Why "panscript"?

I wanted a good, memorable name. I also wanted it to start with 'P', so that the "LAMP" architecture could adopt it seamlessly (dream on!).

Pan was the ancient Greek god of shepherds and their flocks, so who better to name my language after than someone who looks after lots of similar things. (Useless factoid: Pan was also the god of popular music. At the end of Kenneth Grahame's book The wind in the willows, Pan makes an appearance as the piper at the gates of dawn. Rock fans will immediately recognise the title of the first Pink Floyd album.)

The prefix Pan-, meaning "all", is also ancient Greek, so panscript has a neat double-meaning.

Roadmap

This needs a collaborative effort. I have neither the time nor the skills to do it all myself.

I don't intend to go very deep until the basic idea and syntax have been thrashed out. Think of this page as a kind of working whiteboard.

  1. Set down an intelligible draft of the the top-level scheme of things and the basic syntax. - done.
  2. Choose a name and move the content of this page there. - done.
  3. Get the top-level scheme of things knocked into shape, and the basic syntax defined.
  4. Define the definition, media.text and comment syntax and options to a basic functional level. This also needs to include links for hyperlinks, transclusion and images.
  5. Develop a proof-of-concept Viewer (possibly an XSLT / Firefox plugin or a bit of PHP).
  6. Extend this roadmap and keep following it. :-)

Current issues

Things to sort out with the draft spec:

  • You may notice that some thing or another has more than one name. This is because I am still making my mind up what to call it, but I don't want to stop writing about it in the mean time.
  • There are some places where material might be added, whose effect is not specified. For example in a panscript definition element's content, but outside of any contained elements: [[ rogue material [contained element] ]]. Need to iron these out.
  • Not too happy at the way closing id's can be subsumed into the properties if an element has no content. Need to define how properties are specified, before I can figure the cleanest approach.
  • Alias declarations need to be formalised. Some options/thoughts:
    • Make them all properties of the Definition element. This ensures the alias is defined before it is encountered in the code.
    • Make them properties of the associated class/element. Makes parsing a nasty business, if the alias gets used before it is defined.
    • Define class aliases using one syntax, and element aliases (c.f. bookmarks) another. This may be needed, as the former acts more widely than the latter.
  • I have not addresses semantics, beyond the brief discussion of The big idea.

Recent thoughts

These need thinking through and working back into the page:

  • Need to make it clearer that the generic semantic is an overall container in which different sub-languages can be written (somewhat similar to the concept in XML, but it needs to be a bit more flexible).
  • HTML 5 addresses more weaknesses of HTML 4 than I realised (but not all!).

pangraph

pangraph is a vector graphics language suited to hand-coding and for parsing to svg by a wiki server.

I'm sketching out the basics without reference to the present panscript syntax. The idea is to then compare the two syntaxes and pick the best features of both.

Basic concepts

The re-use of modular code leads to the idea of a document as a collection of disparate fragments, stitched together by some common framework. So the highest-level constructs, and where we need to start, are those that create this common framework: a language for stitching fragments into, whether the fragment is written in the same language or another one. The parallel with XML data islands should not be missed.

Basic syntax

The fundamental building block of Panscript is a plain text file called a module or script.

Modules are re-usable - any module will probably invoke many other modules.

Structure of a module

A script contains a hierarchy of elements, or objects.

Just to give a flavour, here is a simple "Hello world" example:

[p0.1/My First script; Copyright Guy Inchbald, 2007. Licensed under the GPLv3.
[[
    [m [[Hello world]] ]
]]
My First script]

The detailed syntax borrows a little from the MediaWiki idea of repetitive key strokes, preferably unshifted. The general syntax for a Panscript element comprises a sequence of entities:

[class/id; properties [[content]] id]

where the entities are defined as:

  • [...] (square brackets) are the opening and closing tags which bound the element.
  • / (forward slash) is a separator between the element's class and name, or identifier, while ; (semicolon) is the separator between the id and the properties. All elements and nested elements within the script are assumed to be numbered sequentially, starting from 0 (zero). The whole of /id; is optional - if it is omitted then the sequential number is used as the id.
  • The second, closing occurrence of the id is also optional. Its purpose is to make the code more readable, and to help track down typos (missing or extra brackets).
  • The properties are also optional.
  • The content is bounded by double square brackets [[...]]. It may be empty, or may contain the basic information expected (text, code, etc), and/or may contain one or more nested elements. If it is empty and there is no closing id, then the [[ ]] may be omitted (or, to put it another way, if [[ ]] is omitted and the closing id is present, then the closing id will be treated as a property).

White space

In general, any sequence of white space characters (spaces, tabs, returns) acts as a simple separator, as if it were a single space character. Exceptions occur for certain kinds of text content. Where the Panscript syntax shows no space between entities, white space may be freely inserted, for example the following is equally valid:

[class /id ;
   properties
   [[
 
      content
 
   ]]
id ]

Aliases

An alias is an alternative id for an element or class. For example if something called thingummajig.whatsit exists, then we might want to create thing as an alias for it. Then, every time we need to reference the thingummajig.whatsit, we need only write thing.

This allows our code to be human-readable, but not to run away with the word salad problem.

One or more aliases may be established for any element or class. The default id for any element is its sequential number in the script. Any other id provided is effectively an alias for this number.

Some aliases are reserved (predefined), others may be user defined.

Escaping text

Text markup code always needs an escape system so that it is possible to include reserved code characters like [ and ] in text.

It is tempting to reserve \ for the single-character escape as in \[ and \], including escaping a \ character as in \\.

To escape a string, approaches to consider are:

  • Distinct start and end escape tags (perhaps even create an 'escape' element to contain the string). But how do you escape an end tag (as you will want to in the documentation explaining it all), i.e. how can we tell whether a given occurrence of the closing sequence is a genuine tag or its text representation? It's recursive, you can get in a mess.
  • An escape tag such as three backslashes to both begin and end the sequence: \\\ escaped string goes here \\\. "Passive" escape tags in the string would still need to be actively escaped, as \\ for a single backslash and \\\\\\ for three consecutive backslashes. The odd number - 3 - of backslashes ensures that it cannot be confused for a series of escaped backslashes. But what happens if we meet say \\\[ - is this the string escape with [ as the first character in the string, or an escaped \ followed by an escaped [ ? One possible answer - for a string beginning with a reserved character, add two further backslashes: \\\\\[....
  • In text, use an alternative entity that is rendered the same as the reserved character. Numeric character codes are often used, while HTML uses for example &gt; to escape >. Effective, but clunky if needed a lot and there are all those codes to learn.
  • Add a count to the escape tag, e.g. \\\20 escapes the next 20 characters. But who wants to keep count?

Element classes

There are (provisionally) five top-level classes of element (kinds of stuff) that a script can contain:

  • script definition. This is the highest level, and contains any or all of the remaining four.
  • Content for rendering.
  • Executable code.
  • Information used by something else, such as data tables or style specifications.
  • Hidden or passive material such as comments (Could this be treated as data?).

Anything outside the script definition will be ignored. This allows a script to be embedded in other kinds of language.

The script definition

The first element in any script is its definition. This element contains all the others. The syntax for the definition uses the following values:

[panscriptversion/Name of script; properties [[''content'']] Name of script]

Where:

  • class = panscriptversion, where:
    version is the version number of the Panscript specification. If the version is not provided, the local host should default to the most recent stable Panscript specification available to it.
    the alias p is defined, allowing pversion. Where a script is embedded in another language it may or may not be possible, or wise, to do this.
  • id is the name or title of the script, and is an alias for 0 (zero). Where no id is provided, the default name will be "0". Where a script is embedded in another language, an alias must be provided.
  • properties include things like copyright notices, script version number, and so on.
  • content is the remainder of the script.

For example, here is an empty script (i.e. with no content):

[p0.1/Empty script; Copyright Guy Inchbald, 2007. Licensed under the GPLv3. Empty script]

Rendered (media) content

This is general media content (text, graphic, etc. maybe eventually audio and stuff) to be rendered by the viewing agent.

[media/id; format [[''content'']] id]

Where:

  • class = media:
    the alias m is defined, allowing m
  • id is the name of the content element, by which it can be found (similar to an HTML bookmark).
  • format defines the media type and format which the user agent should assume. Where format is not present, Unicode UTF-8 text is the default.
  • content is the actual media content.

Here is a very simple "Hello world" example:

[m [[Hello world]]]

To create a functional script we put it inside a top-level object, something like:

[p0.1/Example script; Copyright Guy Inchbald, 2007. Licensed under the GPLv3. 
   [[
   [m [[Hello world]]]
   ]]
Example script]

I may come back and define some sub-classes such as media-text, media-image (aliases t and i respectively). Who knows.

Executable code

I don't know a lot about programming languages in general, but here's what seems to be a workable approach:

[executable/id; parameters [[''instructions'']] id]

Where:

  • class = executable:
    the alias x is defined, allowing x
  • id is the name of the code element, by which it is called by other code elements.
  • parameters include variables passed to/by the element, and any correspondences between global and local names.
  • instructions is the executable code.

Data

Data is stuff that is available for other elements, such as executables or content, to draw on.

[data/id; format [[''data'']] id]

Where:

  • class = data:
    the alias d is defined, allowing d
  • id is the name of the data element, by which it is found by other elements.
  • format defines the purpose and layout of the data. Examples will probably include "array nxn", "csv", "hex", "style", and so on.
  • data is the actual data content.

Passive comments

Comments are indispensible for adding helpful explanations and for hiding unused stuff.

[comment/id; comment]

Where:

  • class = data:
    the alias c is defined, allowing c, as in [c comment].
  • id is the (optional) name of the comment, by which it can be identified by the user.
  • comment is the actual comment.

Note that a comment has no content entity, and is effectively an empty element. Any nested elements within the comment entity will be ignored. Any closing id will also technically be treated as part of the comment field, thought his does not matter. Thus, in the following element, the "[[ ]] id" is all treated as contained within the comment field.

[comment/id; comment [[ ]] id]

This is a bit unsatisfactory, as it breaks a basic rule of grammar about the [[ ]] container. But it is necessary, since commenting-out blocks of code will often place such brackets in the comment area. Well, it's not strictly necessary, but writing [c [[comment]]] every time would be more of a pain than [c comment].

Sub-classes

Many sub-classes of the high-level classes will be needed. The syntax is simply:

class.subclass

You may wonder why there are so few top-level ones. For example there are likely to be sub-classes such as javascript, css, image, heading1 and so on. Why are these commonplace things not high-level classes themselves? Wouldn't that make them easier to type, too?

Well, firstly, we can distinguish for example x.javascript from m.javascript and c.javascript. The first of these will be executed, the second treated as rendered media (text) content (very useful for tutorials on javascript!) and the third is commented-out. So the developer can plug code in and out, try it out and present it to the reader and so on, and move from one mode to another just by changing a single character in the code.

Secondly, using aliases we can create javascript or ecma or js or whatever as an alias for executable.javascript. So when we want to add some javascript we don't have to write <script class="javascript"> ... </script> or even [x.javascript [[ ... ]]] but simply [js [[ ... ]]].

So along with the many standard sub-classes that will be needed, there will probably be even more standard aliases.

There might be a need to create further sub-subclasses, such as media.heading.2 or media.list.ordered, and so on. Again, aliases make such things manageable.

The outside world

Links

Links to other objects - Panscript modules or anything else - are embedded in one of the main element types. Not yet sure whether they go in as properties or content, or either depending on their purpose.

The Panscript language is designed so that paths such as high/medium/low/lower/nearly reached me/hello blur away the structural implementation - which is the filename, which the script definition, etc. I have a gut feeling that this is a Good Thing, but need to flesh the principle out a bit.

Embedding modules in other languages

A typical code object such as a web page or a script may contain several languages. Where multiple Panscript modules are embedded in such a page, each script definition must have an explicit and distinct id. Otherwise they would all default to "0", and it would not be possible to find any given script.

Going to script 0 will always find the first script in the page. If there is only a single script in the page then you can get away with the default id, but this is not recommended in case you later come back and add another script before it.

Embedding other languages in modules

These are embedded as the content of an appropriate kind of high-level module.

Specifying the language might be done as a property of the module, or as a sub-class. Haven't thought about this yet.