What Parser? IE 5 Verses Others

What parser? IE 5 verses others..

XML is XSL is present state of XML (MaML, SVG, BioML, etc. )CML is prior state of CML

Advantage so f XML  single documents  comments on this document

Parsing XML as a tree

Browsers – IE5 as only complainant browser – use that

White space considerations

XSL  template sequence

Plugin is – applet is

How applets work

How each applet works – click on them!

SVG

FOP

Elements described as <element> attributes as @attribute

(beware element vs element! – assume XML element – like tag – unless declare periodic element)

(element and attributes names and values, when used, will be indicated by italics)

(the community is defined as the group using this language – since XML is extensible, additional markup can be added by anyone but only that markup, conventions etc that are used and accepted by the community are of any worth).

Emphasise extensibility of XML!

Separation of content from style

Watch out for careful use of information/data/etc

Introduction, problem, past solutions, html and plugins, applets, XML – what is – advantages, disadvantages, CML, previous state of development – CML DTD and simple examples, XML parsers and IE5- ChiMeraL demonstration - how does it work – cp various formats of different types of data and xml equivalents, deeper look at cml structure - xsl stylesheets, css stylesheets – local and server side transforms – DOM manipulation –Applets – JME, JMOL, MARVIN, SDA, JSPECT, format conversion.– namespaces - schema and how to link both to xml, - interactivity –Perl - future concepts ‘round trip’ – cml tool and understanding via applets – linking to databases and repositories – further development of cml syntax. Appendices – lists of terms, lists of elements – DTD – Schema and IE schema - suggested CML document structure – stylesheet templates – Perl scripts –

Browser support – lack of it

Document Layout

Title – avoid TLA! – keyworks – Chemical Markup Language – Java Applets – XML – CML – development – IE5 – XSL – stylesheets – JavaScript

Title - Every paper for the Journal must be accompanied by a summary (50–250 words) setting out briefly and clearly the main objects and results of the work; it should give a reader a clear idea of what has been achieved. The summary should be essentially independent of the main text; however, names or partial names of compounds may be accompanied by the numbers referring to the corresponding displayed formulae in the body of the text. The summary of a paper reporting a crystal structure determination should make it clear that such a determination has been performed.

e.g.

The use of rhodium catalysts modified with bulky phos-phabenzenes as ð -acceptor ligands as highly efficient hydroformylation catalysts for terminal and internal alk-enes is described. The crystal structure determination of a catalytically active phosphabenzene complex is reported. The synthesis of (Z)- and (E)-1-azido-1,4-diphenylbut-1-ene 6a and b is described and the products of their thermal decomposition are reported. The synthesis of 5H-pyrrolo[ 1,2-a]azepine 2 and of 7H-pyrrolo[1,2-a]azepin-7-one 3 via the common dihydroazepinone intermediate 11 is also described.

5.1.3 Introduction.—This should give clearly and briefly,

with relevant references, both the nature of the problem under investigation and its background.

5.1.4 Results and Discussion.—It is usual for the results to be presented first, followed by a discussion of their significance. Only strictly relevant results should be presented, and figures, tables and equations should be used for purposes of clarity and brevity. The use of flow diagrams and reaction schemes is encouraged. Data must not be reproduced in more than one form, e.g. in both figures and tables, without good reason.

5.1.5 Experimental.— - probably not relevant – replace with descriptions of ‘bits’ and discussion

5.1.6 Acknowledgements.—Contributors other than co-authors

may be acknowledged in a separate paragraph at the

end of the paper; acknowledgements should be as brief as

possible. Titles, Mr, Mrs, Miss, Dr, Professor, etc., should be

given but not degrees.

5.1.7Dedications.—Dedications are not permitted.

5.1.8Bibliographic References.—These should be given on a separate sheet at the end of the manuscript; for details see Section 5.7.

5.1.9 Graphical Abstract.—A representative scheme or structural formulae should be given for the contents list. No more than one sentence of text may be used. The maximum space available is 4 × 9.5 cm. Authors are advised to consult a recent copy of the Journal for examples.

5.3.5 Headings.—(a) Main sections (Introduction, Experi-mental,

Discussion, etc.): left-aligned, bold, initial capital

letter for first word only, no final full stop.

(b) Main side-heading: bold, initial capital for first word only,

no final full stop.

(c) Subsidiary side-heading: indented, bold, initial capital

letter for first word only, final full stop, text run on from head-ing.

(d ) Further subdivision: indented, italic, initial capital for

first word only, final full stop, text run on from heading.

used, require an explanatory footnote.

Done

Papers

Chemcom

Sent in (HTML/XML versions) but no luck as yet - need to kick people

ChemInt 2000 proposal

Sent in as doc/html/XML versions

XML - CML 'structure'

Why need CML

What is XML – content and format

How differ from HTML

What is CML

Overall structures

namespaces

sorted I think - we must get the schema 'oked' and up to the xml-cml.org site since I can't link to it unless it's there! (there isn't much point to a schema if IE can't validate against it).

<cml xmlns="

This would define a default namespace. I think there can only be ONE in a

document and they are deprecated by nany people. so rewrite as:

<cml:cml xmlns:cml="

<molecule xmlns="

This inherits the namespace above. The namespace should be the same CML

one. I do not think it a good idea if chimeral, egon, etc have the same

elements names as CML and use different semantics. So:

<molecule> will inherit the namespace above. If you don't truest

inheritance, write:

<cml:molecule>

</molecule>

<spectrum xmlns="

write <chimeral:spectrum

xmlns:chimeral="

</spectrum>

<reaction xmlns="

This is part of the CML namespace so should be

<cml:reaction>

</reaction>

</cml>

this should then be:

</cml:cml>

</cml>

</document>

>This sets up a wrapper xml - then defines cml within the existing

>CML 1.0 namespace - then molecule/spectrum etc. with our own

>version of CML and own namespace. This avoids use declaring our

>CML as *the* CML..

Links and ids

Unique ids are used to link to and href used to link from – requires parser support

Wrapper

<?xml version="1.0"?>
<document<

Document represents any XML compliant document, this might contain CML, XHTML, MaML etc.

<cml title="" id="" xmlns="x-schema:cml_schema_ie_02.xml">

The default namespace should point to the CML schema. The cml element may the consists of a number subelements - common ones would be molecule, reaction or chimeral:spectrum

</document>

Molecule

MDL ISIS Draw .skc file;

Since this is program specific format, it’s of little use for a web site. It is possible to export an image as a gif but as explained before such a solution loses all chemical information, can’t be machine read and has serious scaling problems. Hence a variety of chemical formats have been developed for the transfer and

My preferred syntax display of molecular structures. Perhaps the most commonly sued for small and medium sized structures is the MDL mol format.

Legacy Transfer and Storage format

MDL .mol file; (cut down)

Row /

Caffeine.mol

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 / caffeine.mol
-ISIS- 04110015063D
24 25 0 0 0 0 0 0 0 0999 V2000
-2.8709 -1.0499 0.1718 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.9099 0.2747 0.1062 N 0 0 0 0 0 0 0 0 0 0 0 0
-1.8026 0.9662 -0.1184 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.6411 0.2954 -0.2316 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.6549 -1.0889 -0.1279 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7352 -1.7187 0.0624 N 0 0 0 0 0 0 0 0 0 0 0 0
etc..
1 2 1 0 0 0 0
1 6 1 0 0 0 0
1 13 2 0 0 0 0
2 3 1 0 0 0 0
2 12 1 0 0 0 0
3 4 1 0 0 0 0
etc..
M END

-uses standard ASCII text

-fairly easy for human to read but requires knowledge of syntax

-most common small/medium molecule format (large molecules tend to use .pdb – more complex and not considered here)

-rows 1-3: allow commenting of the file – generally the first is a title or file name and the second gives it’s source

-row 4: first digit is the total number of atoms in the molecule, the second is the total number of bonds (note that hydrogens are often ignored)

-rows 5-10: lists each atom in the molecule, along with its x y z Cartesian coordinates (in Å) and element type (remaining columns are rarely used)

-rows 12-17: lists the bonds in the molecule, column 1 and 2 refer to the atoms the bond is between and column 3 gives it’s bond order.

-Row 19: M END is a standard string indicating the end of the mol file.

-Mol files are extremely sensitive to white space, this can cause serious problems when parsing

-Note that a real mol file for caffeine would of course be much longer

Display of mol files;

MDL Chemscape Chime and Rasmol are able to display a large number of standard molecule formats as rotatable 3d objects;

Proposed CML Transfer and Storage format

-Perl (a powerful text search and replace language) was used to write a converter able to take any MDL mol file and convert the structural information into valid XML. Converting the above file yields;

XML file (cut down)

Row /

Caffeine.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 / <?xml version="1.0"?>
<document>


<cml title="caffeine_3d" id="cml_caffeine_3d">
<molecule title="caffeine_3d" id="mol_caffeine_3d">
<list title="atoms">
<atom id="caffeine_3d_a_1">
<integer builtin="atomId">1</integer>
<float builtin="x3" units="A">-2.8709</float>
<float builtin="y3" units="A">-1.0499</float>
<float builtin="z3" units="A">0.1718</float>
<string builtin="elementType">C</string>
</atom>
<atom id="caffeine_3d_a_2">
<integer builtin="atomId">2</integer>
<float builtin="x3" units="A">-2.9099</float>
<float builtin="y3" units="A">0.2747</float>
<float builtin="z3" units="A">0.1062</float>
<string builtin="elementType">N</string>
</atom>
etc..
</list>
<list title="bonds">
<bond id="caffeine_3d_b_1">
<integer title="bondId">1</integer>
<integer builtin="atomRef">1</integer>
<integer builtin="atomRef">2</integer>
<integer builtin="order" convention="MDL">1</integer>
</bond>
<bond id="caffeine_3d_b_2">
<integer title="bondId">2</integer>
<integer builtin="atomRef">1</integer>
<integer builtin="atomRef">6</integer>
<integer builtin="order" convention="MDL">1</integer>
</bond>
etc..
</list>
</molecule>
</cml>
</document>

-as can been seen, the XML is much longer than the original mol – terseness is of minimal importance in XML

-all data is fully marked up and explicitly defined – no prior knowledge of the mol format is needed, only knowledge of standard XML conventions.

-The XML file has a tree structures, with each component being a discrete entity. Components can be easily added or removed without destroying the tree, and additional components can be easily added now, or at some point in the future. For example, a large collection of mol files can be easily converted to CML and then concatenated to a single file. Searches, queries and comparisons within this file can then be carried out with XSL stylesheets. Other XML marked up information can be added, for example CML spectra or reactions, XHTML text or MaML calculations.

-Note that each CML components has a unique id. Components can be linked or queried by this id. – for example an atom mapping would link the ids of a molecules A’s atoms to the ids of molecule Bs

-Full analysis of CML syntax will follow

Parsing and display of CML

-full descriptions of CML chimeral will follow

-CML can be stylesheet converted to any HTML or text format. This being the case, it becomes possible to convert from CML back to an older format and hence display that information with existing tools. In out case we choose to use Java applets (since they are platform independent and tend to me more flexible than Plugins). Since the stylesheet can select only the information necessary from the XML document, a single format can then be used to store and display structures, spectra, reactions or any other chemical information the author wishes.

Converting CML to a different legacy format .XYZ

Building applet code and displaying the converted CML using Jmol

Additional Chemical Information

-often there is the requirement to markup information for which there is no specific element or attribute in the DTD. In these cases, general data elements are used and distinguished by unique title attributes. These data elements are; string, integer, float, stringArray, integerArray, floatArray and floatMatrix. Using these elements avoiding having to define additional chemical XML (which can be done but could not be considered in the same namespace as the DTD). Authors should make attempts to use these existing elements before defining there own.

For example, after converting caffeine.mol file, one might add the following information;

This information doesn’t come from the mol file or any other legacy format (while such information might well be supplied within comments – this is not identified and hence is only human readible). Instead it was extracted from various databases – in particular the ChemFinder online database. The author may decide to add any information he or she wishes in this way – simply by adding further data elements in appropriate places in the cml document. This markup will be ignored by tools or stylesheets that isn’t expecting it – hence a good reason to use standardised titles and syntax. Hence title=”melting point” might be expected whilst title=”mltpt” might not.

Row /

Caffeine.mol

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 / ..
<formula>C8 H10 N4 O2</formula>
<string title="CAS">58-08-2</string>
<string title="ACX">I1001269</string>
<string title="DOT">UN 1544</string>
<string title="RTECS">EV6475000</string>
<float title="molecule weight">194.19</float>
<float title="melting point" units="degC">238</float>
<float title="specific gravity">1.23</float>
<float title="water solubility" convention="g per 100 mL at 23 degC">1-5</float>
<string title="comments">White powder or white glistening needles usually melted together. LIGHT SENSITIVE</string>
<list title="alternate names">
<string title="name">1,3,7-Trimethylxanthine</string>
<string title="name">3,7-dihydro-1,3,7-trimethyl-1H-Purine-2,6-dione</string>
<string title="name">1,3,7-Trimethyl-2,6-dioxopurine</string>
<string title="name">7-Methyltheophylline</string>
etc..
</list>
..

The ChiMeraL stylesheets expect to find the above elements and will display them as a simple HTML table.

As can be seen – the stylesheet also expects to find<float title="melting point" units="degC">238</floatbut fails since this element isn’t in caffeine.xml – hence leaves the space blank. From this it can be seen that stylesheets need to be developed concurrently to the XML documents.

<molecule title="" id="">
<formula</formula>
Molecule specific information should be included here as strings/floats/integers - add more as required, these are examples. Attribute 'title' is used to indicate the elements contents.
<string title="CAS"</string>
<string title="ACX"</string>
<string title="DOT"</string>
<string title="RTECS"</string>
<float title= "molecularweight"</float>
<float title= "meltingpoint" units="degC | K"</float>
<float title= "boilingpoint" units="degC | K"</float>
<float title="specific gravity"</float>
<float title="evaporation rate" units=""</float>
<float title= "flashpoint" units="degC | K"</float>
<float title="vapor density" units=""</float>
<float title="water solubility" convention="g/100 mL at 23 degC"</float>
<list title="alternate names>
<string title= "name"</string< BR> </list>
<string title="comments"</string>
Small molecular structures - this format is preferred but rather verbose. The value for <integer builtin="atomId"> woud normally be that used in the MDL .mol format where as id must be document unique.
<list title="atoms">
repeat
<atom id="mol_a_1">
<integer builtin="atomId">1</integer>
<float builtin="x3" units="A"</float>
<float builtin="y3" units="A"</float>
<float builtin="z3" units="A"</float>
<string builtin="elementType"</string>
<integer builtin="hydrogenCount"</integer> <BR<integer builtin="formalCharge"</integer>
</atom>
/repeat
</list>
2D structures will use builtin= "x2 | y2" but are otherwise the same
Large molecular structures - this format is terse but much harder to format/refer to in XSL
<atomArray>
<stringArray title="label">a1 a2 a3 a4 a5 a6</stringArray>
<stringArray builtin="elementType">C O H H H H</stringArray>
<floatArray builtin="x3">-0.748 0.558 -1.293 -1.263 -0.699 0.716</floatArray>
<floatArray builtin="y3">-0.015 0.420 0.202 0.754 -0.934 1.404</floatArray>
<floatArray builtin="z3">0.024 -0.278 -0.901 0.600 0.609 0.137</floatArray>
<integerArray builtin="formalCharge"</integerArray>
</atomArray>
Bond lists are used for small molecules - large ones will probably ignore bonds and calculate directly
<list title="bonds">
repeat
<bond id="mol_b_1">
<integer title="bondId">1</integer>
<integer builtin="atomRef"</integer>
<integer builtin="atomRef"</integer> <BR<integer builtin="order"</integer>
</bond>
/repeat
</list>
</molecule>

Spectrum

Describe JCAMP – CML based on it

*Background*

-Standard spectra format for computer readibility.

-I have specs for IR (10), MS (11), NMR (12) - I'll have to go looking for more but these will do as a start.

-Files are ASCII text and use distinctive ## labelled data records (LDR).

*Map to XML*

-Need to map LRD to appropriate XML elements. Currently no XML (CML) standard exists for spectra - need to develop something sensible.

-XML structure should be common over IR/UV/NMR/MS etc.

-We wil only consider *uncompressed* JCAMP

-Mapping probably best done with perl - this allows simple search and replace of text strings to element tags. Hence need to get a grounding in perl.

-NO information should be lost on JCAMP->XML conversion.

*Problems*

-The possibility of user defined LDRs (##$MYLDR)

-The complexity of JCAMP 'standard' - 35 basic plus dozens of different LDRs for each area (IR, RAMAN, NMR, MS etc.) No way we can come up with a uniform element/sub element/attribute structure (particularly since these LDRs are likly to evolve).

*Suggest*

-Simply map each LDR direct to a <elment>, e.g. ##TITLE= benzene becomes <TITLE>benzene</TITLE> and then place the entire thing inside <Spectra Convention="JCAMP 4.24">

-This will make a DTD impossible but I'm not sure we have much choice.

-The point would be to allow the XSL transforms of the data - ignoring elements the stylesheet doesn't understand. Hence, common LDRs like title, molform and the spectra information can be pulled out and formatted/sent to applets.

What Parser? IE 5 Verses Others