Overview of methods for describing chemical formulas
Popular science article. Comparative analysis of the methods and systems of the digital representation of the chemical formulas. Discusses system: XyMTeX, SMILES, InChi, MDL Molfile and easyChem.
IntroductionHistory of Chemistry has several hundred years. But only in recent years, there are active processes to migrate all the accumulated information on digital storage media. In this regard, there was a question of how to digitize chemical formulas. It is well known that computer science has developed many techniques of effective management of texts, graphics, sound and video. For chemical formulas also developed similar techniques. But they are still in the process of development and have not acquired enough stable form. I think it makes no sense to argue that the most promising source of information in our time has become the Internet. First of all, it refers to technology HTML, which literally opens a window into the human world through Internet browsers. Therefore, in this article I'm going to do the main emphasis on imaging techniques of chemical formulas by means of Internet browsers. Strictly speaking, HTML stands as follows: language hyper text markup. But he has long been a tip of the iceberg formed of a plurality of different information technologies. This enables the browser's user to easily locate and use a variety of information worldwide. Therefore, for brevity, hereinafter I will use the term HTML in a broad sense, including all the features of modern browsers. Primitive methods of chemical formulas representation.Unfortunately, until now the quality of presentation of chemical formulas for most sites is quite low. This often use plain text. Often you can see similar records: CH3CH2OH. Although, if you add a markup, you can get CH3CH2OH, that looks much better. To do this in HTML have to do more complicated description: CH<sub>3</sub>CH<sub>2</sub>OH. It is somewhat more complicated. Therefore not used by all. Yet, using the text can not be described structural formula. And here we come to the raster images. Most of the formulas that we can see on the Internet, are presented in this way.
More progressive sites use images obtained using molecular editors. Reputable reference books such as PubChem or ChemSpider use special software which allows you to automatically build the image of any special description of the molecule. Here we are already very close to the subject of the article, which is devoted to methods of formalized description of chemical formulas. I want to briefly enumerate the main disadvantages of raster images:
Why are they so popular if they have so many flaws? Just because a long time there were no other alternatives. Old HTML allows you to display either text or bitmaps. But with the advent of technologies HTML5 and SVG everything changed. Much more promising is the use of vector graphics. For example, the formula in SVG format can be found in wikipedia. But they have not been widely used. Systems for chemical formulas descriptionNow begins the main text of the article. Obviously, as soon as there is a need to store the formula in the database, as both at once there are methods to do it. Consider a few of the most common systems. • XyMTeXXϒMTeX is based on the highly acclaimed text system TEX. Formula described using text commands. The source code is converted by special programs into PDF or PostScript format . For example, this description
{\red \bzdrv{1=={\blue OH};4=={\green NO$_{2}$}}}
is give the image :
$color(blue)OH$color(red)|\||`/<|$itemColor(green)NO2>`\\`|//
That is pretty easy to understand that the benzene ring is drawn by command \bzdrv,
and functional groups attached to the node by numbers: 1 - upper, 4 - lower.In fact, XyMTEX is a very extensive set of macros (such as \bzdrv), that extend the basic set of TEX (like \red). Here are a few examples to show the same substance:
\nonaheterovi[di]{5s==\cyclopropanev{2==(yl)}}%
{2SB==CH$_{3}$;2SA==CH$_{2}$OH;3B==OH;4==CH$_{3}$;6SB==CH$_{3}$;6SA==HO;7D==O}
`//<|CH3>`\<_(x-1)_q3_q3>`|<`-dHO><_(A-120,w+)C`H3>/`|O|\//\<_(A-60,w+)CH3><_(x1,d+)CH2OH>|<\wOH>_#1`|
Resume
There is another feature of the TEX platform - the opportunity to describe their own macros.
This automates the same operation. Thus increasing the efficiency of work with a document.
Undoubtedly, the advantage of the system is a high-quality printing.
Therefore, XyMTeX is very useful for typesetting books on chemistry.
However, this way of representing chemical formulas is not widely used for publishing on the Internet.
Unlike other systems that are discussed in this article, XyMTeX
is only intended to render the formulas.
That is, from this description can not be calculated molecular weight or molecular formula.
Furthermore, XyMTeX is quite heavy to study. And requires knowledge of the base TeX platform.
Newbies should not hope that they can quickly implement a couple of formulas without studying documentation.
Indeed, for each type of chemical structures need to find the appropriate macro and explore its options.
• SMILESSMILES - this is a very unique system for the description of chemical structures with short text descriptions. Its principles were developed in the 80s of the last century. Over the past three decades, SMILES has a host of improved versions and software implementations. The system is so simple that its basic principles can be summarized in a few sentences.
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets. For example, [Na].
Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I.
Hydrogen can be omitted. He added automatically.
Single bonds is not needed.
Description
Branches are described with parentheses.
For example, Cycles are described by means of numerical marks:
SMILES can describe not only individual molecules, but also chemical reactions. For example, the record:
[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI
indicates
I^- + Na^+ + H2C=CH-CH2Br -> Na^+ + Br^- + H2C=CH-CH2I
SMILES classic version allows to describe the same molecule in different ways.
For example, SMILES always operates only with structural formulas. Therefore, inorganic molecules and reaction equations describing it look strange and redundant.:
SMILES description written in a single line, with no gaps. Resume
SMILES is the most popular system for describing of molecules.
Most digital processing systems of chemical data are support this format.
The system is quite simple.
A trained person can easily read and write a description of simple molecules without the use of software.
However, for large molecules is unlikely to succeed do without molecular editor.
SMILES is more intended to describe the structural formulas of organic molecules. Therefore, it is much less convenient for inorganic chemistry.
Graphic appearance of the formula depends on the algorithm visualization.
Impossible to depict phenol in several ways, as it has been demonstrated for XyMTeX.
That is, the use of SMILES for printing is very limited.
Also, you can not use the color scheme and make any comments or pagination.
• The IUPAC International Chemical Identifier (InChI)
This format of moleculas description is a IUPAC standard. And this is a strong argument in his favor. InChi descriptions, just as 'SMILES', written as a text string. And allow to describe only the structural formula. Also, the location of nodes in space is completely determined by the algorithm of visualization. Where the similarity ends. Here are some examples:
All description InChi divided into layers, separated by slash characters. Header is always the first layer, the second - molecular formula. Remaining layers (or sublayers) are begin with a certain letter. The prefix "c" - this section defining atom connections (except for hydrogens). And "h" - describes how many hydrogen atoms are connected to each of the other atoms. There are other prefixes that appear only when needed. They describe charges, chirality, isotopes and other properties of the molecule. Very important process is the correct numbering of nodes. But the rules are quite complex. Therefore, substantial effort is required to apply for their understanding.
I should also mention such thing as InChiKey. Resume
From the outset, the InChi system was narrowly focused on the problems related to the identification and search in databases.
It is obvious that these problems are successfully solved.
The text InChI description is too complicated, so that human could not make it yourself in a text editor.
Therefore, molecular editors are best way to create and modify of such description.
• MDL MolfilesThis format is supported by most molecular editors. In addition, MOL-files can be downloaded from the online references: ChemSpider, ChEBI , etc. There is an enhanced version of this format that allows you to include additional information. These files usually have the SDF extension. You can download them from the PubChem Consider the description of such a molecule
O^-`\N^+`|O|`/|`//N`\`||<`\wHO>/\\
5-nitropyridin-3-ol ACD/Labs11211323262D 10 10 0 0 1 0 0 0 0 0 2 V2000 15.8780 -7.4592 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 15.8780 -8.7892 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 14.7261 -6.7942 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 14.7261 -9.4542 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 13.5743 -7.4592 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.5743 -8.7892 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 12.4225 -6.7942 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 17.0298 -6.7942 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0 18.1816 -7.4592 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 17.0298 -5.4642 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 3 1 2 0 0 0 0 4 2 2 0 0 0 0 5 3 1 0 0 0 0 6 4 1 0 0 0 0 6 5 2 0 0 0 0 5 7 1 1 0 0 0 9 8 1 0 0 0 0 10 8 2 0 0 0 0 1 8 1 0 0 0 0 M CHG 2 8 1 9 -1 M END
Resume
MOL-format allows you to store the chemical and graphic properties of the molecule simultaneously.
Obviously, the MOL-file size exceeds similar descriptions in other systems tenfold.
Despite the fact that the data are presented in text format, the MOL-format is focused on molecular editor, not a human.
• CharChemAnd of course, as the CharChem author, I'm going to tell you why it was necessary to invent another system. Currently CharChem is the only system that allows you to display chemical formulas in HTML directly from the descriptions. At the same time, no need to install any additional software. You only need a web browser. But why not take an existing system and develop HTML-render for it? Because CharChem claims that using it to describing of chemical formulas is much easier than using any of the above systems. Let's start with the fact that there are many substances and reaction equations, which are written by rational formulas. Let's see what the description required for the reaction:
Cr2O3 + 2KNO3 -> K2Cr2O7 + 2NO"|^"
As you can see, the CharChem description most simple and short.
HTML and TeX descriptions do not contain a chemical information. A SMILES does not allow to enter coefficients.
Now consider the principles of structural formulas description.
The easiest to understand is the SMILES.
It would seem to describe the ethyl alcohol is enough to write Just as XyMTeX, the CharChem can be used for the different descriptions of the same molecule:
CharChem enough to remember that sticks mean line from left to right (top to bottom). And if there is a backward apostrophe - then from right to left (bottom-up). And this is enough to portray a wide variety of structures. Well, in order to "read" the description, it is necessary to trace the progress of the movement sticks. On the test stand you can enter a description and see the results. Here are some more complicated examples which demonstrate capabilities of CharChem:
If you look at the average formula (α-cyclodextrin), you will notice that it consists of six identical fragments. Molecule of δ-cyclodextrin consists of nine of the same fragments. But CharChem allows describes the macros. Thus, it suffices to describe only one link. The rest are generated automatically. The CharChem system has two main principles:
Resume
Drawing formulas from text descriptions directly into the html-page.
No need to install any additional software.
Do not have any specialization. One system handles both simple rational and complex structural formula.
It is possible to describe not only a single substance, but also the reactions.
The system is easy to learn and use. Descriptions are rather compact.
Descriptions can be created and edited in the text without the use of molecular editor.
It is possible to automate the same operation using macros.
So far, the quality of the images is not possible to use them for printing.
However, this disadvantage applies to render the system but not to the description of the molecules.
And work on improving of CharChem actively pursued.
At this time, is not supported by the canonical description.
That is, the system is not well suited to the search algorithms substances in the database.
Comparative analysisAs a conclusion is a table, which summarizes the main characteristics of the subject systems.
End of article. Author: PeterWin. November, 2013. |