System to describe the chemical formulas for WEB.

Overview of methods for describing chemical formulas

Popular science article. Comparative analysis of the methods and systems of the digital representation of the chemical formulas. Discusses system: XyMTeX, SMILES, InChi, MDL Molfile and easyChem.

Introduction

History of Chemistry has several hundred years. But only in recent years, there are active processes to migrate all the accumulated information on digital storage media. In this regard, there was a question of how to digitize chemical formulas. It is well known that computer science has developed many techniques of effective management of texts, graphics, sound and video. For chemical formulas also developed similar techniques. But they are still in the process of development and have not acquired enough stable form.

I think it makes no sense to argue that the most promising source of information in our time has become the Internet. First of all, it refers to technology HTML, which literally opens a window into the human world through Internet browsers. Therefore, in this article I'm going to do the main emphasis on imaging techniques of chemical formulas by means of Internet browsers.

Strictly speaking, HTML stands as follows: language hyper text markup. But he has long been a tip of the iceberg formed of a plurality of different information technologies. This enables the browser's user to easily locate and use a variety of information worldwide. Therefore, for brevity, hereinafter I will use the term HTML in a broad sense, including all the features of modern browsers.

Primitive methods of chemical formulas representation.

Unfortunately, until now the quality of presentation of chemical formulas for most sites is quite low. This often use plain text. Often you can see similar records: CH3CH2OH. Although, if you add a markup, you can get CH3CH2OH, that looks much better. To do this in HTML have to do more complicated description: CH<sub>3</sub>CH<sub>2</sub>OH. It is somewhat more complicated. Therefore not used by all.

Yet, using the text can not be described structural formula. And here we come to the raster images. Most of the formulas that we can see on the Internet, are presented in this way.

More progressive sites use images obtained using molecular editors. Reputable reference books such as PubChem or ChemSpider use special software which allows you to automatically build the image of any special description of the molecule. Here we are already very close to the subject of the article, which is devoted to methods of formalized description of chemical formulas.

I want to briefly enumerate the main disadvantages of raster images:

  • Pictures always occupy a fairly large amount of memory. This results in unnecessary cost of processor power, memory, Internet traffic and storage space on disks.
  • Edit the image is much more difficult than to edit the text. PhotoShop is much more difficult to master than a text editor. But even if we use molecular editor, pictures still stored separately from the text. And you have to make much effort to remember what the picture should appear in the right place.
  • Raster images are very difficult to scale. For example, if you increase the small picture, it will turn into a set of squares.
  • Images can not be analyzed. That is, it is impossible to determine the atomic mass or substance name for a single image. Such information should be kept separate. And it can be an additional source of error.

Why are they so popular if they have so many flaws? Just because a long time there were no other alternatives. Old HTML allows you to display either text or bitmaps. But with the advent of technologies HTML5 and SVG everything changed.

Much more promising is the use of vector graphics. For example, the formula in SVG format can be found in wikipedia. But they have not been widely used.

Systems for chemical formulas description

Now begins the main text of the article. Obviously, as soon as there is a need to store the formula in the database, as both at once there are methods to do it. Consider a few of the most common systems.

XyMTeX

XϒMTeX is based on the highly acclaimed text system TEX.

Formula described using text commands. The source code is converted by special programs into PDF or PostScript format . For example, this description
{\red \bzdrv{1=={\blue OH};4=={\green NO$_{2}$}}}
is give the image :
$color(blue)OH$color(red)|\||`/<|$itemColor(green)NO2>`\\`|//
That is pretty easy to understand that the benzene ring is drawn by command \bzdrv, and functional groups attached to the node by numbers: 1 - upper, 4 - lower.

In fact, XyMTEX is a very extensive set of macros (such as \bzdrv), that extend the basic set of TEX (like \red).

Here are a few examples to show the same substance:
\bzdrv{1==OH} \bzdrv{2==OH} \bzdrv[r]{1==OH} \bzdrv[l]{1==OH} \bzdrv[A]{1==OH}
OH|\||`/`\\`|// ||`/`\\`|//\/OH OH|\||`/`\\`|// OH|\\|`//`\`||/ OH|\|`/`\`|/_o
Here's a more complex structure:
\nonaheterovi[di]{5s==\cyclopropanev{2==(yl)}}% {2SB==CH$_{3}$;2SA==CH$_{2}$OH;3B==OH;4==CH$_{3}$;6SB==CH$_{3}$;6SA==HO;7D==O}
`//<|CH3>`\<_(x-1)_q3_q3>`|<`-dHO><_(A-120,w+)C`H3>/`|O|\//\<_(A-60,w+)CH3><_(x1,d+)CH2OH>|<\wOH>_#1`|

Resume

There is another feature of the TEX platform - the opportunity to describe their own macros. This automates the same operation. Thus increasing the efficiency of work with a document.
Undoubtedly, the advantage of the system is a high-quality printing. Therefore, XyMTeX is very useful for typesetting books on chemistry.
However, this way of representing chemical formulas is not widely used for publishing on the Internet.
Unlike other systems that are discussed in this article, XyMTeX is only intended to render the formulas. That is, from this description can not be calculated molecular weight or molecular formula.
Furthermore, XyMTeX is quite heavy to study. And requires knowledge of the base TeX platform. Newbies should not hope that they can quickly implement a couple of formulas without studying documentation. Indeed, for each type of chemical structures need to find the appropriate macro and explore its options.

• SMILES

SMILES - this is a very unique system for the description of chemical structures with short text descriptions. Its principles were developed in the 80s of the last century. Over the past three decades, SMILES has a host of improved versions and software implementations.

The system is so simple that its basic principles can be summarized in a few sentences.

Atoms are represented by the standard abbreviation of the chemical elements, in square brackets. For example, [Na]. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. Hydrogen can be omitted. He added automatically.
You can specify an isotope [235U] (for $M(235)U) and charge [Fe+3] (for Fe+3).

Single bonds is not needed. Description BrCCO corresponds to Br-CH2-CH2-OH or Br\/\OH
Double and triple bonds indicated by the symbol = and #. For example, С=СC is H2C=CH-CH3, and С#СC is HC%C-CH3.

Branches are described with parentheses. For example, CC(C)O is /(*`|*)\OH, and CC(=O)O is /`|O|\OH.

Cycles are described by means of numerical marks:

C1CCCCC1 /\|`/`\`|
C1=CC=C2C=CC=CC2=C1 `//`\`||/\\|\//`|`\\`/
C1CC2=CC=CC=C2OC1C3=CC=CC=C3 /\//\||`/`\\<`|>`/O`\<`|>`/||`/`\\`|//\; $color(gray)@:N(n,a,l:.4)#&n_(A&a,L&l,N0)"&n"@(1,-135); @N(2,-90); @N(3,-90); @N(4,-90); @N(5,-45); @N(6,45); @N(7,90); @N(8,90); @N(9,90,.7); @N(10,90,.5); @N(11,-90); @N(12,45); @N(13,90); @N(14,135); @N(15,-135); @N(16,-90)

SMILES can describe not only individual molecules, but also chemical reactions. For example, the record:

[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI
indicates
I^- + Na^+ + H2C=CH-CH2Br -> Na^+ + Br^- + H2C=CH-CH2I

SMILES classic version allows to describe the same molecule in different ways. For example, BrCC and CCBr.
However, there is a special version - Canonical SMILES. It allows you to get a unique description of the molecule. Is the canonical record CCBr. This is very useful if you want to search for a substance from its description in the database.

SMILES always operates only with structural formulas. Therefore, inorganic molecules and reaction equations describing it look strange and redundant.:

H2SO4 OS(=O)(=O)O
K2Cr2O7 [O-][Cr](=O)(=O)O[Cr](=O)(=O)[O-].[K+].[K+]
K3[Fe(CN)6] [C-]#N.[C-]#N.[C-]#N.[C-]#N.[C-]#N.[C-]#N.[K+].[K+].[K+].[Fe+3]

SMILES description written in a single line, with no gaps.

Resume

SMILES is the most popular system for describing of molecules. Most digital processing systems of chemical data are support this format.
The system is quite simple. A trained person can easily read and write a description of simple molecules without the use of software. However, for large molecules is unlikely to succeed do without molecular editor.
SMILES is more intended to describe the structural formulas of organic molecules. Therefore, it is much less convenient for inorganic chemistry.
Graphic appearance of the formula depends on the algorithm visualization. Impossible to depict phenol in several ways, as it has been demonstrated for XyMTeX. That is, the use of SMILES for printing is very limited. Also, you can not use the color scheme and make any comments or pagination.

• The IUPAC International Chemical Identifier (InChI)

This format of moleculas description is a IUPAC standard. And this is a strong argument in his favor.
The basic principle which it is based: each substance has one and only one description.
The fact that modern chemical database may contain millions of records describing substances. Therefore require efficient ways to search. InChi designed to solve this problem.

InChi descriptions, just as 'SMILES', written as a text string. And allow to describe only the structural formula. Also, the location of nodes in space is completely determined by the algorithm of visualization. Where the similarity ends.

Here are some examples:
InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 @:M(t,a,l:.4)<_(A&a,L&l,N0)$itemColor1(gray)"&t">@; CH3@:Mh(t)@M(&t,-90,.6)@(1)-CH2@Mh(2)-OH@Mh(3)
InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H \@M(2,-45)||@M(1,45)`/@M(3,90)`\\@M(5,135)`|@M(6,-135)//@M(4,-90)
InChI=1S/C6H8O2/c7-5-1-2-6(8)4-3-5/h1-4H2 `/@M(1,45)`-@M(2,135)`\@M(6,-115,.6)`-O@M(8,-115,.6)-/@M(4,-135)-@M(3,-45)\@M(5,-55,.6)=O@M(7,-55,.6)

All description InChi divided into layers, separated by slash characters. Header is always the first layer, the second - molecular formula. Remaining layers (or sublayers) are begin with a certain letter. The prefix "c" - this section defining atom connections (except for hydrogens). And "h" - describes how many hydrogen atoms are connected to each of the other atoms. There are other prefixes that appear only when needed. They describe charges, chirality, isotopes and other properties of the molecule.

Very important process is the correct numbering of nodes. But the rules are quite complex. Therefore, substantial effort is required to apply for their understanding.

I should also mention such thing as InChiKey.
InChi record can be quite long for complex molecules. Therefore, it is not effectively used for the database searches. But it is possible to generate a hash key, which consists of 14 characters for each formula.
Thus, a unique key can be obtained from the structural formula for each of the millions of compounds.
This is InChiKey for benzene: UHOVQNZJYSORNB-UHFFFAOYSA-N.
And this is InChiKey for vitamin B12 (whose mass is 17 times more): RMRCNWBMXRMIRW-WYVZQNDMSA-L
The inverse transform from key to InChI-description is impossible.

Resume

From the outset, the InChi system was narrowly focused on the problems related to the identification and search in databases. It is obvious that these problems are successfully solved.
The text InChI description is too complicated, so that human could not make it yourself in a text editor. Therefore, molecular editors are best way to create and modify of such description.

• MDL Molfiles

This format is supported by most molecular editors. In addition, MOL-files can be downloaded from the online references: ChemSpider, ChEBI , etc.

There is an enhanced version of this format that allows you to include additional information. These files usually have the SDF extension. You can download them from the PubChem

Consider the description of such a molecule
O^-`\N^+`|O|`/|`//N`\`||<`\wHO>/\\
  5-nitropyridin-3-ol
  ACD/Labs11211323262D

 10 10  0  0  1  0  0  0  0  0  2 V2000
   15.8780   -7.4592    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   15.8780   -8.7892    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   14.7261   -6.7942    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   14.7261   -9.4542    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   13.5743   -7.4592    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   13.5743   -8.7892    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.4225   -6.7942    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   17.0298   -6.7942    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
   18.1816   -7.4592    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   17.0298   -5.4642    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  3  1  2  0  0  0  0
  4  2  2  0  0  0  0
  5  3  1  0  0  0  0
  6  4  1  0  0  0  0
  6  5  2  0  0  0  0
  5  7  1  1  0  0  0
  9  8  1  0  0  0  0
 10  8  2  0  0  0  0
  1  8  1  0  0  0  0
M  CHG  2   8   1   9  -1
M  END
Rows Description
1Header
2Name of molecular editor and date
3Commentary (is empty)
4Count of atoms, bonds, ... version of format V2000
5-14Description of each atom in a separate line. Includes x, y, z coordinates and a symbol of element.
15-24Description of each bond in a separate line: indexes of bonded atoms and the multiplicity of bond.
And here, for comparison, the description of the same substance in other systems:
SMILES O=[N+]([O-])c1cncc(O)c1
InChi InChI=1/C5H4N2O3/c8-5-1-4(7(9)10)2-6-3-5/h1-3,8H
CharChem O^-`\N^+`|O|`/|`//N`\`||<`\wHO>/\\

Resume

MOL-format allows you to store the chemical and graphic properties of the molecule simultaneously.
Obviously, the MOL-file size exceeds similar descriptions in other systems tenfold.
Despite the fact that the data are presented in text format, the MOL-format is focused on molecular editor, not a human.

• CharChem

And of course, as the CharChem author, I'm going to tell you why it was necessary to invent another system.

Currently CharChem is the only system that allows you to display chemical formulas in HTML directly from the descriptions. At the same time, no need to install any additional software. You only need a web browser.

But why not take an existing system and develop HTML-render for it? Because CharChem claims that using it to describing of chemical formulas is much easier than using any of the above systems.

Let's start with the fact that there are many substances and reaction equations, which are written by rational formulas. Let's see what the description required for the reaction:

Cr2O3 + 2KNO3 -> K2Cr2O7 + 2NO"|^"
CharChem Cr2O3 + 2KNO3 -> K2Cr2O7 + 2NO"|^"
TEX
not even XyMTeX
Cr$_{2}$O$_{3}$ + 2KNO$_{3}$ → K$_{2}$Cr$_{2}$O$_{7}$ + 2NO↑
HTML Cr<sub>2</sub>O<sub>3</sub> + 2KNO3 → K<sub>2</sub>Cr<sub>2</sub>O<sub>7</sub> + 2NO↑
SMILES O=[Cr]O[Cr]=O.[N+](=O)([O-])[O-].[K+]>>[N+](=O)([O-])[O-].[K+].[N]=O

As you can see, the CharChem description most simple and short. HTML and TeX descriptions do not contain a chemical information. A SMILES does not allow to enter coefficients.
By the way, the CharChem includes a module that allows you to automatically calculate the coefficients.(Open)

Now consider the principles of structural formulas description. The easiest to understand is the SMILES. It would seem to describe the ethyl alcohol is enough to write CCO. Three characters. Which is easier?
In CharChem you need to burn 4 characters: /\OH or \/OH. But such a record is clear even to those who know chemistry, but does not know anything about the systems of formalized description of molecules. And, you can write CH3-CH2-OH or C2H5OH.

Just as XyMTeX, the CharChem can be used for the different descriptions of the same molecule:

XyMTeX \bzdrv{2==OH} \bzdrv[l]{1==OH} \bzdrv[A]{1==OH}
Formula ||`/`\\`|//\/OH OH|\\|`//`\`||/ OH|\|`/`\`|/_o
CharChem ||`/`\\`|//\/OH OH|\\|`//`\`||/ OH|\|`/`\`|/_o
Perhaps at first glance, XymTeX descriptions looks more friendly than CharChem. But first, you need to study the manual and learn that benzene is denoted bzdrv, and it has the appropriate settings. And if it is not benzene and cyclohexane or completely unknown substance? So again we studied manual ...
CharChem enough to remember that sticks mean line from left to right (top to bottom). And if there is a backward apostrophe - then from right to left (bottom-up). And this is enough to portray a wide variety of structures.
Well, in order to "read" the description, it is necessary to trace the progress of the movement sticks.
On the test stand you can enter a description and see the results.

Here are some more complicated examples which demonstrate capabilities of CharChem:

Anthocyanin α-cyclodextrin C60 fullerene
@:Anthocyanin(R1,R2,R3,R4,R5,R6,R7)`||/<&R1>\\<&R2>|<&R3>`//`\`/|<&R4>`//`\`|`\`//<&R7>|<&R6>\\<&R5>/`|/O^+\\@; @:Tx(a,tx)<_(A&a,L.5,N0)$itemColor1(gray)"&tx">@; @Anthocyanin(@Tx(-150,3')`|$itemColor1(#F00){R1},@Tx(-90,4')/$itemColor1(#00F){R2},@Tx(-50,5')\$itemColor1(#0A0){R3},@Tx(90,3)\$itemColor1(#a243a2){R4},@Tx(40,5)|$itemColor1(#1d1d8f){R5},@Tx(90,6)`/$itemColor1(#00da00){R6},@Tx(-90,7)`\$itemColor1(#a14242){R7}) $L(.9)O\@:ACDg()_q6<_(a-60,d+)O_p6H>_p6<_(a-60,w+)O_p6H>_p6<_(a-120,L.7,w+)H><_p6<_(a-60,w+)_pO_(a85,L.7)H>_p6O_p6_(a0,L.7,w+)H>_q6@()O|@ACDg()O`/@ACDg()O`\@ACDg()O`|@ACDg()O/@ACDg() {}_(x2.92,y3.96,N0)$color(#999)_(x.92)_(x.32,y-.92)_(x-.76,y-.52)_(x-.76,y.52)_#2# $color(#666)_(x-.52,y.72,N2)_(x.48,y.84)_(x1.08,N2)_(x.44,y-.84)_(N2)#3; #4_(x.84,y-.24,N2)_(x.64,y.64)_(x-.32,y1.04,N2)_#10; #9_(x.64,y.4)_(x.96,y-.68)_#13; #12_(x.6,y-.44)_(x-.36,y-1.16)_(x-.8)<_#11>_(x-.84,y-.6,N2)_(x-.84,y.4)<=#5>_(x-.88,y-.36)_(x-.88,y.6,N2)# _(x.12,y.92)<=#6>_(x-.6,y.72)_(x.32,y1,N2)<_#7>_(x-.2,y.76)_(x.92,y.64)<_#8>_(x.52,y.6,N2)_(x1.36)=#14; #15_(x.64,y-.36,N2)_(x.44,y-1.24)=#16; #17_(x-.08,y-.72,N2)_(x-1.12,y-.8)_(x-.72,y.16,N2)<_#19># _(x-1.2)<_#21>_(x-.76,y-.16,N2)_(x-1.08,y.84)_(x-.04,y.72,N2)<_#22>_(x-.36,y1.16)<_#24>_(x-.36,y.68,N2)_(x.4,y1.24)<=#26># $color(#333)_(x.32,y.44)_(x1.28,y.92,N2)<_#28>_(x1.24,y-.16)_(x1.28,y.16)<_#29>_(x1.2,y-.96,N2)<_#30># _(x.24,y-1.16)_(x.52,y-1.16)<_#31>_(x-.52,y-1.44,N2)<_#32>_(x-1.08,y-.6)_(x-.88,y-.88)<_#33>_(x-1.56,N2)<_#36># _(x-.92,y.88)_(x-1.08,y.6)<_#37>_(x-.44,y1.48,N2)<_#40>_(x.52,y1.12)<_#42>$color(#000)_(x1.2,y-0.4,N2)_(x1.52,y1.08)<=#44># _(x1.52,y-1.12)<=#47>_(x-.6,y-1.76)<=#50>_(x-1.88)<=#53>_#57

If you look at the average formula (α-cyclodextrin), you will notice that it consists of six identical fragments. Molecule of δ-cyclodextrin consists of nine of the same fragments. But CharChem allows describes the macros. Thus, it suffices to describe only one link. The rest are generated automatically.

The CharChem system has two main principles:

  1. Most chemical formulas can be described using simple structures, resembling the most familiar to chemists record.
  2. Simple design always have functional limitations that do not allow us to describe more complex and rare formula. Therefore, there are more complex structures that resemble XyMTeX macros.
Thus, the system combines the simplicity, compact form and opportunities. In addition, the author hopes that eventually expand the CharChem mathematical apparatus that will solve various computational problems involving chemical formulas.

Resume

Drawing formulas from text descriptions directly into the html-page. No need to install any additional software.
Do not have any specialization. One system handles both simple rational and complex structural formula. It is possible to describe not only a single substance, but also the reactions.
The system is easy to learn and use. Descriptions are rather compact. Descriptions can be created and edited in the text without the use of molecular editor.
It is possible to automate the same operation using macros.
So far, the quality of the images is not possible to use them for printing. However, this disadvantage applies to render the system but not to the description of the molecules. And work on improving of CharChem actively pursued.
At this time, is not supported by the canonical description. That is, the system is not well suited to the search algorithms substances in the database.

Comparative analysis

As a conclusion is a table, which summarizes the main characteristics of the subject systems.

Systems →
↓ Capabilities
XyMTeX SMILES InChi MDL Molfiles CharChem
Color design YES NO NO NO YES
Molecular formulas YES NO NO partially YES
Chemical reactions YES YES NO NO YES
Choosing appearance for the formula YES partially NO YES YES
Polymers YES NO NO NO YES
Macrocommands YES NO NO NO YES
Analysis of formulas NO YES YES YES YES
Representation in the canonical form NO YES YES NO NO

End of article. Author: PeterWin. November, 2013.