ECCC 2 - Canonical numbering of 3-D Molecules

Chr. Benecke, A. Kerber, R. Laue

Adaption to 2-D Structures


In order to apply the previously described algorithm for a canonical numbering of 2D Structures, we first determine which information can be used to describe the structure.

Building the Initial Classification

Of course, using the atom type description, we can build initial equivalence classes of atoms. This classification can easily be refined taking into account, for example, the number of double bonds an atom is connected with, the number of H-atoms connected to a heavy atom or other typical atom properties ( e.g. atom is part of an aromatic ring,...).
Please remember, that the more properties you use, the faster the algorithm will get.
To ensure that equivalent molecules with different atom numbering start with the same initial classification, we have to determine a total order with respect to the atom properties.

In our example, this order can be defined as follows:

  • if (Name of Atom1 > Name of Atom2 in lexicogr. order)
    then Atom1 > Atom2; end; else
    if (Name of Atom1 < Name of Atom2 in lexicogr. order)
    then Atom1 < Atom2; end; else:
    Atom names are equivalent. Then
  • if (# of H-atoms adjacent to Atom1 > # of H-atoms adjacent to Atom2)
    then Atom1 > Atom2; end; else
    if (# of H-atoms adjacent to Atom1 < # of H-atoms adjacent to Atom2)
    then Atom1 < Atom2; end; else:
    Number of H-atoms adjacent are the same. Then
  • if (# of DoubleBonds of Atom1 > # of DoubleBonds of Atom2)
    then Atom1 > Atom2; end; else
    if (# of DoubleBonds of Atom1 < # of DoubleBonds of Atom2)
    then Atom1 < Atom2; end; else:
    Number of DoubleBonds are the same. Then
  • if (Atom1 is part of arom. ring and Atom2 is not part of arom. ring)
    then Atom1 > Atom2; end; else
    if (Atom1 is not part of arom. ring and Atom2 is part of arom. ring)
    then Atom1 < Atom2; end; else:
    Both atoms are part of arom. ring. Then
  • if (...)
    .
  • .
    .
  • .
    .
  • else:
    Atom1 and Atom2 are equivalent. END.

    Using this order, the starting classification should be fine (that means, even if the molecule consists of only e.g. C-atoms, there is a good chance to get at least two separate classes which drastically reduces the time spent for canonical numbering).

    Having found this classification, two cases can occur.

    Using the Iterated Classification

    In the second case, we use the atoms in the single membered classes in order to refine the other classes. First we memorize the classes with only one member in the order induced by their members.
    Starting with the largest class we memorized but did not use for refinement yet, we split all other classes that consist of more than one atom. Using the connectivity information of the molecule, the class to be splitted is divided into one class that contains atoms which are connected to the one the memorized class consists of and another class that contains the atoms which are not connected to it.
    Of course, there is not always a new class created, but if the resulting classes now only contain one atom, they are also memorized in a fixed order (e.g. the class with the atoms connected to the atom we are refining with, are memorized first.).

    This procedure is used as long as there is a memorized class available which we did not use for refinement so far. If, after that, all classes consist of only a single atom, the end of the algorithm is reached. If not, we have to go on with the Backtracking.

    Backtracking

    Starting the Backtracking part, we first determine the class containing the fewest number of atoms. This strategy is expected to act best in this case.
    Before choosing an atom out of this class to continue with, we have to set a rule for this choice:

    Assume, that we already memorized three classes with a single element member. We now rearrange the connection table for the molecule, so that these atoms (A1, A2 and A3) are the first to occur in a row. In addition to that, we only take into account the lower half of the connection table, for obvious reasons.

    A1 *
    A2 1 *
    A3 1 0 *
    A4 . . . *
    .
    .

    In this examplary connection table, A2 is connected to A1, A3 to A1, and A3 not to A2.
    We will now define an order for connection tables that enables us to decide, whether one connection table is larger than another. This is done in a very natural way, just by appending the numbers in the rows to the preceding row. Here, it will look like: 1 1 0 . . . Another Connection table is said to be larger than this, if the first n numbers are equivalent but the (n+1). number is larger than the one in the first connection table.

    What we can do now, is select only those atoms of the class as a new A4, where the connection table is larger than it would become if we had chosen another atom.
    If only one atom fullfills this condition, we do not need to start the backtracking, but split the actual class in one containing only this atom and another class containing the rest of the atoms. After memorizing the new single atom member class, we continue with the iterated classification.
    If there are more than one atoms whose choice would lead to the largest connection table, we have to build new branches for the backtracking algorithm. However, all the up to this point largest connection tables must be stored so we can immediately cut off useless branches when a larger connection tables occurs.


    If no more refinement is necessary, that means if all classes consist of one single atom, the canonical numbering of the structure is the list of the memorized classes containing only one atom. To show this in a practical example, please follow the link to the next chapter.
    Previous: Canonical Representatives for Nested Structures
    Next: Example

    © Chr. Benecke, Oct. 1995