ESPX - an ECMAScript Parser for (almost) XML, with namespaces |
TinyXSL - XML transform in-Script mini-Language |
Here's the download.
Copyright (c) 2000, 2001, 2002 Cyril Jandia ( http://www.cjandia.com/ )
See the file copying.txt for copying permission.
Abstract |
As its poorly-imaginative name suggests, "ESPX" is an
ECMAScript
-coded parser for a subset of
XML 1.0 -
that is, no DTD support yet (external nor internal subset).
However, since version 20010206
it
comes with full support for the
XML namespaces syntax and
(name scoping-)semantics additions to XML. Also, as a
main implementation goal, ESPX was written with strict
ECMAScript
compliance in mind - see "Tested user agents"
below.
As far as performances are concerned, please see "The performance issue" below.
Anyway, this should be considered a beta release.
For the impatient, here's the source
code as well as a simple demo. See also:
the FAQ, which is a TinyXSL demo
(as you know, Small is beautiful
;^)
Contents |
Basic Testing |
[xmltests]
.
For convenience, there is also an all-in-one ZIP file.
Also, for comparison, here are the results of running, against these tests :
msxml-test.txt
<SCRIPT>
(in its version 1.1)xml4script-test.txt
Frequently Asked Questions Here |
Changes from previous releases |
Note as ESPX is now in need of minimum user feedback, the pace of revisions should decrease (or even be null for a while). However, see "From here ..." below.
This version fixes/adds support for the following (bug fixes and/or design changes first):
XMLParser._lookForInvalidCharacters()
(internal) utility function;TinyXSLProcessor.getVersion()
returns -0.73
;XMLParser.getVersion()
now returns 20020313
XMLDocument.getElementsByTagName()
(thanx to Gaurav Pal ;o)XMLNode.uniqueID()
, XMLDocument.create
Whatever()
);TinyXSLProcessor.getVersion()
returns -0.75
;XMLParser.getVersion()
now returns 20020212
]]>
no more allowed in element content (per http://www.w3.org/TR/REC-xml#NT-CharData);<!DOCTYPE ...>
;TinyXSLProcessor.getVersion()
returns -0.76
;XMLParser.getVersion()
now returns 20020112
<!DOCTYPE ...>
is now silently ignored (no more "unsupported document type declaration" error);TinyXSLProcessor.getVersion()
returns -0.77
;XMLParser.getVersion()
now returns 20020110
TinyXSLProcessor.getVersion()
returns -0.78
;XMLParser.getVersion()
now returns 20020109
xml:space
, lang
, etc.) :^(TinyXSLProcessor.getVersion()
returns -0.79
;XMLParser.getVersion()
now returns 20011228
xml:base
;XMLParser.xmlBase
property, which you can use in a fashion similar to that of XMLParser.xmlLang
;TinyXSLProcessor.getVersion()
returns -0.80
;XMLParser.getVersion()
now returns 20011205
TinyXSLProcessor.getVersion()
returns -0.81
;XMLParser.getVersion()
now returns 20011116
;TinyXSLProcessor.getVersion()
returns -0.82
;XMLParser.getVersion()
now returns 20010411
Initial release.
ESPX / TinyXSL files supported |
Description |
ESPX is not a validating parser. It does
not read any form of internal DTD subset either. All it
does at a minimum is check the document for basic well-formedness, of proper
elements nesting, of attribute assignments and of character/predefined
entity references (e.g.,  
, &
...)
utilization. Eventually, it builds an unoptimized tree data
structure in memory, to represent the parsed document.
Since there is no form of DTD declarations support, ESPX as no other choice
than to treat attribute values as CDATA
(all whitespace is
kept).
Also, CR
/LF
sequences and CR
characters alone are normalized to LF
once for all on input,
just before parsing.
As it parses the document, ESPX's XMLParser
object tries to
build some kind of a DOM-like tree data structure. Note the latter is
not compliant with the official DOM
(see DOM Level 1). At
most can you rely on more or less universal features semantics like
nodeName
, nodeValue
, nodeType
,
parentNode
and so on. But you won't find any equivalent for
insertBefore()
and the like.
The parse result tree is given by the XMLParser.document
property. The same parser object may be reused multiple times to parse
different documents. See the <script>
tag at the top of
simple.htm to know how all this is put to
work.
The performance issue |
On a 3-year old Pentium II, 350mhz, 64mb, running NS 4.7 over Win98, ESPX parses a 12kb document and builds the DOM-like tree in less than 0.6 second, while it is done in less than 1.5 second for a document twice as big (perf1.xml and perf2.xml where markup roughly represents 33% of document size).
However, for documents above 36kb, you must be aware that the parsing/tree building durations currently experienced are simply not acceptable (more than 2 seconds). So there is quite a big place for improvement in this area. FYI: as an order of magnitude and for the same small documents (< 50kb), ESPX appears on the average to be between fifty and one hundred times slower than Microsoft's C++-coded MSXML.
Note if markup is sparse, representing less, say, than 5% of document size, then ESPX performs better (1.3 second under IE for 90kb-size spars90k.xml for example).
Things to know for a proper use |
ECMAScript
uses Unicode);standalone="yes"
);
) has
to be written  
or  
(however,
see "Supported HTML 4.0 entities"
below);&
,
'
, >
, <
and
"
(however, see
"Supported HTML 4.0 entities" below);NodeList
: to access a node's
children you have to do it the ECMAScript
's preferred way,
that is, theNode[theIndex]
(of course, there is a
theNode.childCount
property to know the size of the
family);NamedNodeMap
for elements' attributes either,
instead use: theElementNode.attributes[nameOfAttribute]
, or
preferably, theElementNode.get/setAttribute(nameOfAttribute)
;
since attributes
is an ECMAScript
array, you can
also enumerate all attributes the usual way in ECMAScript
:
var attr; for(attr in theElement.attributes) { // do something with theElement.attributes[attr] }
xmlText()
is an attempt to provide something similar to
Microsoft's MSXML
DOMDocument's xml
property (implemented in ESPX as a simple
recursive function returning the XML source text recomposed from the tree
data structure) - this one helps to debug;xml:space
and
xml:lang
pre-declared attributes are properly honored on a
per-element basis - also, in ESPX the meaning of xml:space
's
"default" value is controlled by the
XMLParser.preserveWhiteSpace
boolean property (for which
false
is to strip insignificant white spaces, while
true
is to keep them all); as far as xml:lang
's
default is concerned, it is given by XMLParser.xmlLang
(which
is a string property);XMLDocumentFactory
plays a role similar to that of the
DOM's
DOMImplementation
.ECMAScript
user agents which are
still around, ESPX does not take advantage of
throw
/try ... catch
-based error
detection/handling - it seems even something like NS 4.7 doesn't know what
to do with a try ... catch
statement;private
or protected
in
more serious (read: strongly-typed) OO languages; please do not use them
unless you can't do your business an other way;XMLParser()
constructor function
for currently recognized error cases - test this part carefully;
Examples |
For now, only simple.htm, databind.htm and the FAQ as TinyXSL sample. Others may follow.
From here ... |
As far as future work directions are concerned, they are likely to include, for the most urgent in any order:
XMLParser
, if only for use by TinyXSL;ECMAScript
's property of being a reflexive
language and thus would compile expressions written in such a subset
into ECMAScript
functions.Of course, if you find yourself able to devise an interesting use of ESPX, and better yet, to implement any of the preceding, I can't do better than inviting you to join in the effort.
Pending ... |
Wish list |
Supported HTML 4.0 entities |
Most of them, including:
From Latin-1 Entities:
Character | Entity | Decimal | Hex | Rendering in the browser | |
---|---|---|---|---|---|
Entity | Decimal | ||||
no-break space = non-breaking space | |   |   | ||
inverted exclamation mark | ¡ | ¡ | ¡ | ¡ | ¡ |
cent sign | ¢ | ¢ | ¢ | ¢ | ¢ |
pound sign | £ | £ | £ | £ | £ |
currency sign | ¤ | ¤ | ¤ | ¤ | ¤ |
yen sign = yuan sign | ¥ | ¥ | ¥ | ¥ | ¥ |
broken bar = broken vertical bar | ¦ | ¦ | ¦ | ¦ | ¦ |
section sign | § | § | § | § | § |
diaeresis = spacing diaeresis | ¨ | ¨ | ¨ | ¨ | ¨ |
copyright sign | © | © | © | © | © |
feminine ordinal indicator | ª | ª | ª | ª | ª |
left-pointing double angle quotation mark = left pointing guillemet | « | « | « | « | « |
not sign = discretionary hyphen | ¬ | ¬ | ¬ | ¬ | ¬ |
soft hyphen = discretionary hyphen | ­ | ­ | ­ | | |
registered sign = registered trade mark sign | ® | ® | ® | ® | ® |
macron = spacing macron = overline = APL overbar | ¯ | ¯ | ¯ | ¯ | ¯ |
degree sign | ° | ° | ° | ° | ° |
plus-minus sign = plus-or-minus sign | ± | ± | ± | ± | ± |
superscript two = superscript digit two = squared | ² | ² | ² | ² | ² |
superscript three = superscript digit three = cubed | ³ | ³ | ³ | ³ | ³ |
acute accent = spacing acute | ´ | ´ | ´ | ´ | ´ |
micro sign | µ | µ | µ | µ | µ |
pilcrow sign = paragraph sign | ¶ | ¶ | ¶ | ¶ | ¶ |
middle dot = Georgian comma = Greek middle dot | · | · | · | · | · |
cedilla = spacing cedilla | ¸ | ¸ | ¸ | ¸ | ¸ |
superscript one = superscript digit one | ¹ | ¹ | ¹ | ¹ | ¹ |
masculine ordinal indicator | º | º | º | º | º |
right-pointing double angle quotation mark = right pointing guillemet | » | » | » | » | » |
vulgar fraction one quarter = fraction one quarter | ¼ | ¼ | ¼ | ¼ | ¼ |
vulgar fraction one half = fraction one half | ½ | ½ | ½ | ½ | ½ |
vulgar fraction three quarters = fraction three quarters | ¾ | ¾ | ¾ | ¾ | ¾ |
inverted question mark = turned question mark | ¿ | ¿ | ¿ | ¿ | ¿ |
Latin capital letter A with grave = Latin capital letter A grave | À | À | À | À | À |
Latin capital letter A with acute | Á | Á | Á | Á | Á |
Latin capital letter A with circumflex | Â | Â | Â | Â | Â |
Latin capital letter A with tilde | Ã | Ã | Ã | Ã | Ã |
Latin capital letter A with diaeresis | Ä | Ä | Ä | Ä | Ä |
Latin capital letter A with ring above = Latin capital letter A ring | Å | Å | Å | Å | Å |
Latin capital letter AE = Latin capital ligature AE | Æ | Æ | Æ | Æ | Æ |
Latin capital letter C with cedilla | Ç | Ç | Ç | Ç | Ç |
Latin capital letter E with grave | È | È | È | È | È |
Latin capital letter E with acute | É | É | É | É | É |
Latin capital letter E with circumflex | Ê | Ê | Ê | Ê | Ê |
Latin capital letter E with diaeresis | Ë | Ë | Ë | Ë | Ë |
Latin capital letter I with grave | Ì | Ì | Ì | Ì | Ì |
Latin capital letter I with acute | Í | Í | Í | Í | Í |
Latin capital letter I with circumflex | Î | Î | Î | Î | Î |
Latin capital letter I with diaeresis | Ï | Ï | Ï | Ï | Ï |
Latin capital letter ETH | Ð | Ð | Ð | Ð | Ð |
Latin capital letter N with tilde | Ñ | Ñ | Ñ | Ñ | Ñ |
Latin capital letter O with grave | Ò | Ò | Ò | Ò | Ò |
Latin capital letter O with acute | Ó | Ó | Ó | Ó | Ó |
Latin capital letter O with circumflex | Ô | Ô | Ô | Ô | Ô |
Latin capital letter O with tilde | Õ | Õ | Õ | Õ | Õ |
Latin capital letter O with diaeresis | Ö | Ö | Ö | Ö | Ö |
multiplication sign | × | × | × | × | × |
Latin capital letter O with stroke = Latin capital letter O slash | Ø | Ø | Ø | Ø | Ø |
Latin capital letter U with grave | Ù | Ù | Ù | Ù | Ù |
Latin capital letter U with acute | Ú | Ú | Ú | Ú | Ú |
Latin capital letter U with circumflex | Û | Û | Û | Û | Û |
Latin capital letter U with diaeresis | Ü | Ü | Ü | Ü | Ü |
Latin capital letter Y with acute | Ý | Ý | Ý | Ý | Ý |
Latin capital letter THORN | Þ | Þ | Þ | Þ | Þ |
Latin small letter sharp s = ess-zed | ß | ß | ß | ß | ß |
Latin small letter a with grave = Latin small letter a grave | à | à | à | à | à |
Latin small letter a with acute | á | á | á | á | á |
Latin small letter a with circumflex | â | â | â | â | â |
Latin small letter a with tilde | ã | ã | ã | ã | ã |
Latin small letter a with diaeresis | ä | ä | ä | ä | ä |
Latin small letter a with ring above = Latin small letter a ring | å | å | å | å | å |
Latin small letter ae = Latin small ligature ae | æ | æ | æ | æ | æ |
Latin small letter c with cedilla | ç | ç | ç | ç | ç |
Latin small letter e with grave | è | è | è | è | è |
Latin small letter e with acute | é | é | é | é | é |
Latin small letter e with circumflex | ê | ê | ê | ê | ê |
Latin small letter e with diaeresis | ë | ë | ë | ë | ë |
Latin small letter i with grave | ì | ì | ì | ì | ì |
Latin small letter i with acute | í | í | í | í | í |
Latin small letter i with circumflex | î | î | î | î | î |
Latin small letter i with diaeresis | ï | ï | ï | ï | ï |
Latin small letter eth | ð | ð | ð | ð | ð |
Latin small letter n with tilde | ñ | ñ | ñ | ñ | ñ |
Latin small letter o with grave | ò | ò | ò | ò | ò |
Latin small letter o with acute | ó | ó | ó | ó | ó |
Latin small letter o with circumflex | ô | ô | ô | ô | ô |
Latin small letter o with tilde | õ | õ | õ | õ | õ |
Latin small letter o with diaeresis | ö | ö | ö | ö | ö |
division sign | ÷ | ÷ | ÷ | ÷ | ÷ |
Latin small letter o with stroke = Latin small letter o slash | ø | ø | ø | ø | ø |
Latin small letter u with grave | ù | ù | ù | ù | ù |
Latin small letter u with acute | ú | ú | ú | ú | ú |
Latin small letter u with circumflex | û | û | û | û | û |
Latin small letter u with diaeresis | ü | ü | ü | ü | ü |
Latin small letter y with acute | ý | ý | ý | ý | ý |
Latin small letter thorn | þ | þ | þ | þ | þ |
Latin small letter y with diaeresis | ÿ | ÿ | ÿ | ÿ | ÿ |
From Entities for Symbols and Greek Letters:
Character | Entity | Decimal | Hex | Rendering in the browser | |
---|---|---|---|---|---|
Entity | Decimal | ||||
Latin small f with hook = function = florin | ƒ | ƒ | ƒ | ƒ | ƒ |
bullet = black small circle | • | • | • | • | • |
horizontal ellipsis = three dot leader | … | … | … | … | … |
trade mark sign | ™ | ™ | ™ | ™ | ™ |
From Special Entities:
Character | Entity | Decimal | Hex | Rendering in the browser | |
---|---|---|---|---|---|
Entity | Decimal | ||||
quotation mark = APL quote | " | " | " | " | " |
ampersand | & | & | & | & | & |
less-than sign | < | < | < | < | < |
greater-than sign | > | > | > | > | > |
Latin capital ligature OE | Œ | Œ | Œ | Œ | Œ |
Latin small ligature oe | œ | œ | œ | œ | œ |
Latin capital letter S with caron | Š | Š | Š | Š | Š |
Latin small letter s with caron | š | š | š | š | š |
Latin capital letter Y with diaeresis | Ÿ | Ÿ | Ÿ | Ÿ | Ÿ |
modifier letter circumflex accent | ˆ | ˆ | ˆ | ˆ | ˆ |
small tilde | ˜ | ˜ | ˜ | ˜ | ˜ |
en space |   |   |   | ||
em space |   |   |   | ||
thin space |   |   |   | ||
zero width non-joiner | ‌ | ‌ | ‌ | | |
zero width joiner | ‍ | ‍ | ‍ | | |
left-to-right mark | ‎ | ‎ | ‎ | | |
right-to-left mark | ‏ | ‏ | ‏ | | |
en dash | – | – | – | – | – |
em dash | — | — | — | — | — |
left single quotation mark | ‘ | ‘ | ‘ | ‘ | ‘ |
right single quotation mark | ’ | ’ | ’ | ’ | ’ |
single low-9 quotation mark | ‚ | ‚ | ‚ | ‚ | ‚ |
left double quotation mark | “ | “ | “ | “ | “ |
right double quotation mark | ” | ” | ” | ” | ” |
double low-9 quotation mark | „ | „ | „ | „ | „ |
dagger | † | † | † | † | † |
double dagger | ‡ | ‡ | ‡ | ‡ | ‡ |
per mille sign | ‰ | ‰ | ‰ | ‰ | ‰ |
single left-pointing angle quotation mark | ‹ | ‹ | ‹ | ‹ | ‹ |
single right-pointing angle quotation mark | › | › | › | › | › |
euro sign | € | € | € | € | € |
Limitations |
There is no documentation except this document and comments in the source code.
Also, and apart from bugs to discover, the implementation is in need of improvement in several areas, including:
throw/try ... catch
at all, the code is too
much tricky and/or boring sometimes.
Tested user agents (to be updated regularly) |
These have been successfully tested with ESPX / TinyXSL:
Platform | Product name | Version(s) | Built-in XML support versions ? | ECMAScript implementation level used for ESPX |
---|---|---|---|---|
Mac | Microsoft Internet Explorer | 5.0 | 5.x and above | ??? |
Windows | Microsoft Internet Explorer | 4.x, 5.x | 5.x and above | JScript 3 in 4.x browsers (latest is JScript 5.5 (?)) |
Windows, Linux | Netscape Navigator | 4.x, 6.0 | 6.0 and above | JavaScript1.2 in 4.x browsers (latest is JavaScript1.5 (?)) |
Windows | Opera | 5.0 | ??? | JavaScript1.2 (?) |
Reporting bugs |
Please report bugs to me. When reporting bugs please be sure to
include easy-to-reproduce test cases for, either, IE 4.x or 5.x, or NS 4.x
or 6.0. I'm also interested in the Linux
platform- and
WMLScript
-testing feedback, if applicable. Create a zip file
containing all the necessary files, and attach the zip file to your
email.
Ideas, comments, suggestions for improvements, especially bug fixes, are always welcome, as usual. Thanks in advance.
March 13, 2002
Cyril Jandia