ESPX - an ECMAScript Parser for (almost) XML, with namespaces

Version 20020313


TinyXSL - XML transform in-Script mini-Language

Version -0.73

(both for down-level web user agents without built-in XML support)

March 13, 2002 release

Here's the download.

Copyright (c) 2000, 2001, 2002 Cyril Jandia ( http://www.cjandia.com/ )

See the file copying.txt for copying permission.

TinyXSL demo (uses the ESPX parser and TinyXSL processor)


Abstract

As its poorly-imaginative name suggests, "ESPX" is an ECMAScript-coded parser for a subset of XML 1.0 - that is, no DTD support yet (external nor internal subset).

However, since version 20010206 it comes with full support for the XML namespaces syntax and (name scoping-)semantics additions to XML. Also, as a main implementation goal, ESPX was written with strict ECMAScript compliance in mind - see "Tested user agents" below.

As far as performances are concerned, please see "The performance issue" below.

Anyway, this should be considered a beta release.

For the impatient, here's the source code as well as a simple demo. See also: the FAQ, which is a TinyXSL demo (as you know, Small is beautiful ;^)


Contents


Basic Testing

Here are the results of running ESPX against some of James Clark's [xmltests].

For convenience, there is also an all-in-one ZIP file.

Also, for comparison, here are the results of running, against these tests :

MSXML (in its version 3, Service Pack #1)
msxml-test.txt
XML for <SCRIPT> (in its version 1.1)
xml4script-test.txt


Frequently Asked Questions Here


Changes from previous releases

Note as ESPX is now in need of minimum user feedback, the pace of revisions should decrease (or even be null for a while). However, see "From here ..." below.

Changes in this release

This version fixes/adds support for the following (bug fixes and/or design changes first):

Changes in version 20020212.075

Changes in version 20020112.076

Changes in version 20020110.077

Changes in version 20020109.078

Changes in version 20011228.079

Changes in version 20011205.080

Changes in version 20011116.081

Changes in version 20010411.082

Changes in version 20010101

Initial release.


ESPX / TinyXSL files supported


Description

ESPX is not a validating parser. It does not read any form of internal DTD subset either. All it does at a minimum is check the document for basic well-formedness, of proper elements nesting, of attribute assignments and of character/predefined entity references (e.g., &#160;, &amp;...) utilization. Eventually, it builds an unoptimized tree data structure in memory, to represent the parsed document.

Since there is no form of DTD declarations support, ESPX as no other choice than to treat attribute values as CDATA (all whitespace is kept).

Also, CR/LF sequences and CR characters alone are normalized to LF once for all on input, just before parsing.

What is built in memory

As it parses the document, ESPX's XMLParser object tries to build some kind of a DOM-like tree data structure. Note the latter is not compliant with the official DOM (see DOM Level 1). At most can you rely on more or less universal features semantics like nodeName, nodeValue, nodeType, parentNode and so on. But you won't find any equivalent for insertBefore() and the like.

The parse result tree is given by the XMLParser.document property. The same parser object may be reused multiple times to parse different documents. See the <script> tag at the top of simple.htm to know how all this is put to work.


The performance issue

On a 3-year old Pentium II, 350mhz, 64mb, running NS 4.7 over Win98, ESPX parses a 12kb document and builds the DOM-like tree in less than 0.6 second, while it is done in less than 1.5 second for a document twice as big (perf1.xml and perf2.xml where markup roughly represents 33% of document size).

However, for documents above 36kb, you must be aware that the parsing/tree building durations currently experienced are simply not acceptable (more than 2 seconds). So there is quite a big place for improvement in this area. FYI: as an order of magnitude and for the same small documents (< 50kb), ESPX appears on the average to be between fifty and one hundred times slower than Microsoft's C++-coded MSXML.

Note if markup is sparse, representing less, say, than 5% of document size, then ESPX performs better (1.3 second under IE for 90kb-size spars90k.xml for example).


Things to know for a proper use

Design choices

Implementation-related


Examples

For now, only simple.htm, databind.htm and the FAQ as TinyXSL sample. Others may follow.


From here ...

As far as future work directions are concerned, they are likely to include, for the most urgent in any order:

see current limitations below.

Implementation note for an XPath-subset:
ideally, I guess, a smart implementation would use ECMAScript's property of being a reflexive language and thus would compile expressions written in such a subset into ECMAScript functions.

Of course, if you find yourself able to devise an interesting use of ESPX, and better yet, to implement any of the preceding, I can't do better than inviting you to join in the effort.


Pending ...


Wish list

For ESPX

For TinyXSL


Supported HTML 4.0 entities

Most of them, including:

From Latin-1 Entities:
Character Entity Decimal Hex Rendering in the browser
Entity Decimal
no-break space = non-breaking space &nbsp; &#160; &#xA0;    
inverted exclamation mark &iexcl; &#161; &#xA1; ¡ ¡
cent sign &cent; &#162; &#xA2; ¢ ¢
pound sign &pound; &#163; &#xA3; £ £
currency sign &curren; &#164; &#xA4; ¤ ¤
yen sign = yuan sign &yen; &#165; &#xA5; ¥ ¥
broken bar = broken vertical bar &brvbar; &#166; &#xA6; ¦ ¦
section sign &sect; &#167; &#xA7; § §
diaeresis = spacing diaeresis &uml; &#168; &#xA8; ¨ ¨
copyright sign &copy; &#169; &#xA9; © ©
feminine ordinal indicator &ordf; &#170; &#xAA; ª ª
left-pointing double angle quotation mark = left pointing guillemet &laquo; &#171; &#xAB; « «
not sign = discretionary hyphen &not; &#172; &#xAC; ¬ ¬
soft hyphen = discretionary hyphen &shy; &#173; &#xAD; ­ ­
registered sign = registered trade mark sign &reg; &#174; &#xAE; ® ®
macron = spacing macron = overline = APL overbar &macr; &#175; &#xAF; ¯ ¯
degree sign &deg; &#176; &#xB0; ° °
plus-minus sign = plus-or-minus sign &plusmn; &#177; &#xB1; ± ±
superscript two = superscript digit two = squared &sup2; &#178; &#xB2; ² ²
superscript three = superscript digit three = cubed &sup3; &#179; &#xB3; ³ ³
acute accent = spacing acute &acute; &#180; &#xB4; ´ ´
micro sign &micro; &#181; &#xB5; µ µ
pilcrow sign = paragraph sign &para; &#182; &#xB6;
middle dot = Georgian comma = Greek middle dot &middot; &#183; &#xB7; · ·
cedilla = spacing cedilla &cedil; &#184; &#xB8; ¸ ¸
superscript one = superscript digit one &sup1; &#185; &#xB9; ¹ ¹
masculine ordinal indicator &ordm; &#186; &#xBA; º º
right-pointing double angle quotation mark = right pointing guillemet &raquo; &#187; &#xBB; » »
vulgar fraction one quarter = fraction one quarter &frac14; &#188; &#xBC; ¼ ¼
vulgar fraction one half = fraction one half &frac12; &#189; &#xBD; ½ ½
vulgar fraction three quarters = fraction three quarters &frac34; &#190; &#xBE; ¾ ¾
inverted question mark = turned question mark &iquest; &#191; &#xBF; ¿ ¿
Latin capital letter A with grave = Latin capital letter A grave &Agrave; &#192; &#xC0; À À
Latin capital letter A with acute &Aacute; &#193; &#xC1; Á Á
Latin capital letter A with circumflex &Acirc; &#194; &#xC2; Â Â
Latin capital letter A with tilde &Atilde; &#195; &#xC3; Ã Ã
Latin capital letter A with diaeresis &Auml; &#196; &#xC4; Ä Ä
Latin capital letter A with ring above = Latin capital letter A ring &Aring; &#197; &#xC5; Å Å
Latin capital letter AE = Latin capital ligature AE &AElig; &#198; &#xC6; Æ Æ
Latin capital letter C with cedilla &Ccedil; &#199; &#xC7; Ç Ç
Latin capital letter E with grave &Egrave; &#200; &#xC8; È È
Latin capital letter E with acute &Eacute; &#201; &#xC9; É É
Latin capital letter E with circumflex &Ecirc; &#202; &#xCA; Ê Ê
Latin capital letter E with diaeresis &Euml; &#203; &#xCB; Ë Ë
Latin capital letter I with grave &Igrave; &#204; &#xCC; Ì Ì
Latin capital letter I with acute &Iacute; &#205; &#xCD; Í Í
Latin capital letter I with circumflex &Icirc; &#206; &#xCE; Î Î
Latin capital letter I with diaeresis &Iuml; &#207; &#xCF; Ï Ï
Latin capital letter ETH &ETH; &#208; &#xD0; Ð Ð
Latin capital letter N with tilde &Ntilde; &#209; &#xD1; Ñ Ñ
Latin capital letter O with grave &Ograve; &#210; &#xD2; Ò Ò
Latin capital letter O with acute &Oacute; &#211; &#xD3; Ó Ó
Latin capital letter O with circumflex &Ocirc; &#212; &#xD4; Ô Ô
Latin capital letter O with tilde &Otilde; &#213; &#xD5; Õ Õ
Latin capital letter O with diaeresis &Ouml; &#214; &#xD6; Ö Ö
multiplication sign &times; &#215; &#xD7; × ×
Latin capital letter O with stroke = Latin capital letter O slash &Oslash; &#216; &#xD8; Ø Ø
Latin capital letter U with grave &Ugrave; &#217; &#xD9; Ù Ù
Latin capital letter U with acute &Uacute; &#218; &#xDA; Ú Ú
Latin capital letter U with circumflex &Ucirc; &#219; &#xDB; Û Û
Latin capital letter U with diaeresis &Uuml; &#220; &#xDC; Ü Ü
Latin capital letter Y with acute &Yacute; &#221; &#xDD; Ý Ý
Latin capital letter THORN &THORN; &#222; &#xDE; Þ Þ
Latin small letter sharp s = ess-zed &szlig; &#223; &#xDF; ß ß
Latin small letter a with grave = Latin small letter a grave &agrave; &#224; &#xE0; à à
Latin small letter a with acute &aacute; &#225; &#xE1; á á
Latin small letter a with circumflex &acirc; &#226; &#xE2; â â
Latin small letter a with tilde &atilde; &#227; &#xE3; ã ã
Latin small letter a with diaeresis &auml; &#228; &#xE4; ä ä
Latin small letter a with ring above = Latin small letter a ring &aring; &#229; &#xE5; å å
Latin small letter ae = Latin small ligature ae &aelig; &#230; &#xE6; æ æ
Latin small letter c with cedilla &ccedil; &#231; &#xE7; ç ç
Latin small letter e with grave &egrave; &#232; &#xE8; è è
Latin small letter e with acute &eacute; &#233; &#xE9; é é
Latin small letter e with circumflex &ecirc; &#234; &#xEA; ê ê
Latin small letter e with diaeresis &euml; &#235; &#xEB; ë ë
Latin small letter i with grave &igrave; &#236; &#xEC; ì ì
Latin small letter i with acute &iacute; &#237; &#xED; í í
Latin small letter i with circumflex &icirc; &#238; &#xEE; î î
Latin small letter i with diaeresis &iuml; &#239; &#xEF; ï ï
Latin small letter eth &eth; &#240; &#xF0; ð ð
Latin small letter n with tilde &ntilde; &#241; &#xF1; ñ ñ
Latin small letter o with grave &ograve; &#242; &#xF2; ò ò
Latin small letter o with acute &oacute; &#243; &#xF3; ó ó
Latin small letter o with circumflex &ocirc; &#244; &#xF4; ô ô
Latin small letter o with tilde &otilde; &#245; &#xF5; õ õ
Latin small letter o with diaeresis &ouml; &#246; &#xF6; ö ö
division sign &divide; &#247; &#xF7; ÷ ÷
Latin small letter o with stroke = Latin small letter o slash &oslash; &#248; &#xF8; ø ø
Latin small letter u with grave &ugrave; &#249; &#xF9; ù ù
Latin small letter u with acute &uacute; &#250; &#xFA; ú ú
Latin small letter u with circumflex &ucirc; &#251; &#xFB; û û
Latin small letter u with diaeresis &uuml; &#252; &#xFC; ü ü
Latin small letter y with acute &yacute; &#253; &#xFD; ý ý
Latin small letter thorn &thorn; &#254; &#xFE; þ þ
Latin small letter y with diaeresis &yuml; &#255; &#xFF; ÿ ÿ

From Entities for Symbols and Greek Letters:
Character Entity Decimal Hex Rendering in the browser
Entity Decimal
Latin small f with hook = function = florin &fnof; &#402; &#x192; ƒ ƒ
bullet = black small circle &bull; &#8226; &#x2022;
horizontal ellipsis = three dot leader &hellip; &#8230; &#x2026;
trade mark sign &trade; &#8482; &#x2122;

From Special Entities:
Character Entity Decimal Hex Rendering in the browser
Entity Decimal
quotation mark = APL quote &quot; &#34; &#x22; " "
ampersand &amp; &#38; &#x26; & &
less-than sign &lt; &#60; &#x3C; < <
greater-than sign &gt; &#62; &#x3E; > >
Latin capital ligature OE &OElig; &#338; &#x152; Œ Œ
Latin small ligature oe &oelig; &#339; &#x153; œ œ
Latin capital letter S with caron &Scaron; &#352; &#x160; Š Š
Latin small letter s with caron &scaron; &#353; &#x161; š š
Latin capital letter Y with diaeresis &Yuml; &#376; &#x178; Ÿ Ÿ
modifier letter circumflex accent &circ; &#710; &#x2C6; ˆ ˆ
small tilde &tilde; &#732; &#x2DC; ˜ ˜
en space &ensp; &#8194; &#x2002;
em space &emsp; &#8195; &#x2003;
thin space &thinsp; &#8201; &#x2009;
zero width non-joiner &zwnj; &#8204; &#x200C;
zero width joiner &zwj; &#8205; &#x200D;
left-to-right mark &lrm; &#8206; &#x200E;
right-to-left mark &rlm; &#8207; &#x200F;
en dash &ndash; &#8211; &#x2013;
em dash &mdash; &#8212; &#x2014;
left single quotation mark &lsquo; &#8216; &#x2018;
right single quotation mark &rsquo; &#8217; &#x2019;
single low-9 quotation mark &sbquo; &#8218; &#x201A;
left double quotation mark &ldquo; &#8220; &#x201C;
right double quotation mark &rdquo; &#8221; &#x201D;
double low-9 quotation mark &bdquo; &#8222; &#x201E;
dagger &dagger; &#8224; &#x2020;
double dagger &Dagger; &#8225; &#x2021;
per mille sign &permil; &#8240; &#x2030;
single left-pointing angle quotation mark &lsaquo; &#8249; &#x2039;
single right-pointing angle quotation mark &rsaquo; &#8250; &#x203A;
euro sign &euro; &#8364; &#x20AC;


Limitations

There is no documentation except this document and comments in the source code.

Also, and apart from bugs to discover, the implementation is in need of improvement in several areas, including:


Tested user agents (to be updated regularly)

These have been successfully tested with ESPX / TinyXSL:

Platform Product name Version(s) Built-in XML support versions ? ECMAScript implementation level used for ESPX
Mac Microsoft Internet Explorer 5.0 5.x and above ???
Windows Microsoft Internet Explorer 4.x, 5.x 5.x and above JScript 3 in 4.x browsers (latest is JScript 5.5 (?))
Windows, Linux Netscape Navigator 4.x, 6.0 6.0 and above JavaScript1.2 in 4.x browsers (latest is JavaScript1.5 (?))
Windows Opera 5.0 ??? JavaScript1.2 (?)


Reporting bugs

Please report bugs to me. When reporting bugs please be sure to include easy-to-reproduce test cases for, either, IE 4.x or 5.x, or NS 4.x or 6.0. I'm also interested in the Linux platform- and WMLScript-testing feedback, if applicable. Create a zip file containing all the necessary files, and attach the zip file to your email.

Ideas, comments, suggestions for improvements, especially bug fixes, are always welcome, as usual. Thanks in advance.

March 13, 2002

Cyril Jandia