[FoRK] Programming lang etc. (details for Stephen, comment for JAR)

Stephen Williams sdw at lig.net
Sat Nov 14 02:09:29 PST 2009


Jeff Bone wrote:
>
> Re:  JAR:  exactly.  Understood, agreed.
> ...
>
> Basically there are three general use cases for such things:  
> human-to-human (either different humans or same-human, over either 
> space or time), human-machine (config files, output files for human 
> consumption, etc.) and machine-to-machine (most markup scenarios, 
> realistically speaking;  OTW protocols and serialization formats, 
> etc.)  I contend that a big part of the problem is the baked-in 
> assumption that you have to optimize on one or at most two of these.  
> OGDL, YAML, various wiki markups, UNIX cookie jars and record jars, 
> and other examples abound to the contrary.  And the biggest problem 
> faced in any of these scenarios today, IMHO, is the lack of 
> type-safety in representation coupled with tenability in the reading 
> and writing dimensions.  Common wisdom would have it that you can't 
> have your lunch and eat it too, particularly w/ tradeoffs in parser 
> complexity (as in, inherent computational complexity) --- but I think 
> we've got far better potential state-of-the-art at present than we're 
> seeing used anywhere...

Some initial thoughts:
OGDL is pretty good: among the simplest, more flexible, and most 
readable tree/graph capable data formats.  It has a few flaws:
    Values should more easily be arbitrary.  OGDL needs a << mechanism.
    There should be some optional way to include types with a value.  
This could be done via a path though.
    The path syntax isn't bad, however it could use a lot more 
expressiveness, perhaps via scripting escape.  Plus graph regex.  See 
SPARQL, XPath.
    Various simple additions would be useful and still not make it very 
complex.  For instance:
       Need a dereference path operator, with shortcut for "use last".  
This allows a "this value is the same as what is at that path".  This 
might be a good choice for a general variable / expression replacement 
mechanism.  ($PATH...)
        Should be able to reference a schema (as below) via URI.
        Should support namespaces, with implied namespaces being 
common.  All Linux commands could share the Linux name space and common 
types (IP address, path...) for instance.
        Both '.' and '/' are good path separators in sometimes different 
circumstances.  Sometimes, even yet another would be useful.

All data instances are on a spectrum of self-describingness that ranges 
from almost no self description to almost complete.  I call this the 
"degree of externalization".  You can think of this as the amount of 
information that must be shared between sender and receiver through some 
other channel than the data instance.  Packed binary bits are not self 
describing, XML, or better XML+schema is closer to the fully 
self-described end.  As you can find in some of what I wrote for W3C EXI 
(slightly expanded in concept but more concise here), a full description 
of data always consists of:

o  Identity (tag / field names, semantic concept identification, etc.)
o  Structure (framing, hierarchy, containment, association)
o  Type definition (scalars, high level)
o  Validation information (mostly what XML and OGDL schemas are for)
o  Redundantly removed information (data or metadata: every structure 
contains exactly these elements so you can count on that in encoding, 
these are frequently used strings / structure / template, this is the 
frequency of occurance for character / word / type...)
o  Encoding choices (encode this as a 32 bit network order int, UTF-8 
string, restricted range character string (with escapes, I invented 
that...))

Whatever is not explicit in schema (if used) and an encoded data 
instance is implicit, either in the code (parse, serialize, application) 
or in human interpretation.
Frequently, a particular encoding+schema has information that can be 
used for more than one purpose.  For instance, while XML Schema was 
designed to be used for validation, XML EXI uses it as a source for 
encoding choices, type definition, and redundancy removal.  This allows 
a schema-informed EXI mode to produce a nearly optimal binary 
representation of the data with everything externalized, while still 
able to encode arbitrary streams of XML.

YAML is interesting because while being reasonably simple, it is also 
reasonably readable (although not nearly as concise as OGDL), while also 
communicating both type and preferred language-level data structure 
(map, hash, vector).  The OGDL schema is more in the validation end of 
the pool.  I'm far more interested in tight information communication.  
Validation is secondary, necessarily incomplete at the schema level, and 
not desirable anyway in many real-world circumstances.

So far, in this round of consideration, I'm interested in some minor 
additions to OGDL, unification with YAML/JSON (either/or), and 
particularly: a YAML block, without data, could be used as a more-full 
schema for OGDL data.  For instance, in the ifconfig example, all the 
children are map entries, some values are IPv4 addresses, there is a set 
instance, etc.  Since the map keys are in the data, the schema would 
just have ":" for instance.  For data that didn't include the tag/key, 
the schema might have "physical:" to add it.  Ideally, this could be 
packaged into a dense schema line in many cases.  Unix commands could 
output the schema line if requested with the data in OGDL by default, or 
YAML by request.

Clearly, all common logical (email, URI, date, GPS, image, etc.) and 
physical types should be directly supported in some sense.  At least 
some should have formal and sloppy versions (date) for machine complete 
vs. user entered / legacy data.  Layered encoding specifications should 
be made for a superset of RDF/OWL/NoSQL semantic data, SQL data 
(preferably with semantic schema mapping upconversion (i.e. defining the 
missing predicate (in the canonical RDF triple) in terms of 
columns/keys/foreign keys), etc.

The OGDL binary format is naive.  There can be a very good binary 
encoding, however that is a separate problem.  Except that poor choices 
in the text-based readable format can make the binary equivalent 
inefficient.

There are many similar formats with good ideas, dot for instance.  This 
is a data format, not a (document) markup format.  Both are needed 
frequently, sometimes together.  It would be nice to have the pure 
data-describing ability of OGDL while supporting document definition as 
in Wiki/reST (aka reStructuredText)/similar.  In particular, semantic 
markup / data markup of text or document markup of data should both be 
cleanly possible.

http://docutils.sourceforge.net/rst.html
http://sphinx.pocoo.org/  (Can anyone recommend a good document / web / 
PDF system that is better / more full-featured than this?  One suitable 
for highly simplified but effective web publishing and corporate / 
development document needs?)

Thanks,
Stephen
>
> Regarding your "Multiarity(tm)" etc...  loved it!  Thanks, Tom. :-)  
> You're spot on.
>
> Dr. Ernie writes:
>
>> where everything is a string
>
> Just to be clear, that's the *opposite* of what I'm talking about.  
> I'd prefer an environment where *nothing* is a string (except *actual* 
> strings.)  Everything's a well-typed value and *very* few if any 
> interesting data types have to be "tunneled" inside strings.  But 
> those well-typed values can be explicitly constructed and 
> unambiguously inferred from the lexical syntax involved.
>
>> which implies using sigils for variables
>
> In general this isn't really a problem just with shell languages, it's 
> a problem with any language that admits symbols-as-value-types.  In 
> such languages you appear to have a strict choice (with a few 
> exceptions, to be discussed below) --- either symbols are unquoted and 
> unevaluated by default, and must be explicitly dereferenced somehow to 
> get the value (if any) they might be bound to in some context, OR you 
> have to quote them in order to use them as values in themselves.  (Or, 
> you can punt and just have strings, which is what all but a very few 
> languages do.)  For the most part you can't have it both ways.
>
> You can get away from that in some limited context by having some 
> special evaluation rules.  Schemes and all UNIX shells have a useful 
> convention:  the first symbol or subexpression in an expression is 
> taken to be a variable referencing (function returning, etc.) a 
> function, and is implicitly dereferenced and applied to the arguments 
> (which in Scheme are just expressions that are eagerly evaluated, 
> while in the shell they are only lightly parsed and flat and 
> dereferencing is explicit.)  But such evaluation semantics along with 
> the syntactic ability to distinguish expressions / commands / whatever 
> the bigger-than-word language unit is, gives you a tool to use as a 
> language designer.  (Nb. note that Scheme achieves something 
> interesting by allowing either a symbol or a functor in first 
> position;  this leads, with a little thought, to a really interesting 
> gestalt:  the semantics of programming language-style variables (named 
> slots, as opposed to e.g. mathematical or logical variables, etc.) in 
> general can be understood in terms of functors.)
>
> The generalization of symbols to hierarchical constructs that are 
> still simultaneously both names and first-class objects --- let's call 
> them "path expressions" --- is a pretty interesting thing.  Consider 
> the following in a typical object language of some kind:
>
>   a.b.c
>
> This should generally be understood as "look up the value of the name 
> c in the namespace obtained by looking up the value of the name b in 
> the namespace obtained by looking up the value of the name a in the 
> (global, local, depending on context) namespace."  Dereference, 
> typically, is implicit.  Consider the similarity to the familiar
>
>   /foo/bar/baz
>
> What's the difference?  Well, for one thing, in shell-like languages 
> we can construct the latter, pass it around, etc. in shells w/o 
> assuming that it's going to be dereferenced at any given point and / 
> or yield anything particular.  To be fair it's because the shell only 
> treats it as an opaque string (modulo things like dirname and its 
> path-munging shell shortcut friends) but there's no reason why we 
> can't think about such things as objects in their own right.
>
> This leads to a really interesting set of potential evaluation rules 
> that minimize (but don't entirely eliminate) the kind of dollar-itis 
> that you find in most shell languages.  And FWIW, the first characters 
> of each of these:
>
>   ./foo
>   /foo
>   ~/foo
>
> Can all be understood as special dereference operators that name a 
> unique context in which the symbol foo is to be dereferenced.
>
> -- 
>
> So to be clear:  I'm *not* a fan of the sigils and crap syntactic line 
> noise that you find all over the place in e.g. most shells and in Perl 
> etc.  That's actually *exactly* what I'd like to minimize!  But in an 
> interactive context, and with first-class symbols and other value 
> types, it's unlikely that you can eliminate (at least) the use of e.g. 
> "$" as a prefix dereference operator when you want to get the value 
> that's "bound to" or implied by certain value-holding types that are 
> values in themselves.
>
> $0.02,
>
>
> jb
>
>
>
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork



More information about the FoRK mailing list