Monday, February 1, 2016

Checksums for circular biological sequences

The SEGUID checksum

Data pertaining to biological sequences such as DNA, RNA and protein are often stored and transferred in electronic form.

This information is often stored simultaneously on several locations while people are working on or with them. This is a potential source of error, since it is easy to introduce small errors that are hard to spot. 

Biological sequences are essentially information, so cryptographic checksums can be used to verify the integrity of sequences just like any other type of information.

Cryptographic checksums have been implemented for protein sequences to provide a stable identifier that only depends on the primary sequence. 

A checksum called the SEquence Global Unique IDentifier (SEGUID) was suggested by Babnigg and Giometti 2006. SEGUID is the Secure Hash Algorithm 1 (SHA-1) checksum calculated on the biological  sequence in uppercase and displayed using the base64 encoding.

The SEGUID identifier has been used to create translation tables between different databases holding the same sequences but typically using different id numbers or ways to identify the sequence.

At 27 characters, the checksum is relatively short. The SEGUID for the DNA sequence Gattaca is:


The SHA-1 algorithm has been broken, meaning that so called "hash collisions" can be constructed given enough time and resources. A hash collision means that the same checksum two different pieces of information gives the same checksum.

No hash collisions has been reported for the SHA-1 algorithm by accident and widely used softwares such as GIT still use SHA-1. Alternatives such as the more secure SHA-3 produces quite a bit longer checksums, and are for this reason less readable. 

The url-safe uSEGUID checksum

Unfortunately the original SEGUID checksum used the original Base64 encoding. This encoding contains the "+" and "/" characters. For instance the DNA sequence CAGG gives the SEGUID:


The backslash "/"and the "+" prevents the use of the checksum in URLs or as a part of a filename on most operating systems. There is an alternative Base64 encoding cells Base64url that substitutes "/"and "+" for "_" and "-" solving this problem. The uSEGUID (short for "url safe SEGUID") is defined as the SEGUID checksum but with Base64url encoding. The uSEGUID for CAGG is:


It is worth noting that the uSEGUID and SEGUID checksums can be constructed from each other by two character substitutions.

The cSEGUID checksum for circular sequences

The same storage and transmission problems that apply to protein sequences also apply to circular DNA sequences, such as plasmids. However, the uSEGUID checksum is not directly useful for circular DNA sequences, since there is up to 2n unique and equivalent representations for a double stranded circular sequence of length n.

For example, if we consider the circular double stranded 6 bp DNA sequence AGCCTA, the twelve sequences below are equal representations:


In my own line of work, I often have to evaluate plasmids constructed in-silico by students. A unique checksum for the correct sequence would make it easier to do this.

For this reason I developed the cSEGUID checksum as a general solution for this problem. The cSEGUID is defined as the uSEGUID checksum calculated from the lexicographically smallest rotation of any rotation of the sequence itself or its reverse complement. The smallest rotation oAGCCTA is marked in blue above.

The cSEGUID for the DNA sequence AGCCTA is:


This is also the uSEGUID of AAGCCT which is the smallest rotation of the sequence.

The cSEGUID is guaranteed to be as unique as the uSEGUID since a circular string that is not a concatenation of two substrings is guaranteed to have only one smallest rotation.

The lSEGUID checksum for linear double stranded DNA sequences

For completeness there should be a checksum definition for linear double stranded DNA molecules. These can take several shapes as they come with either blunt or staggered ends:
Molecule                 Repr #1          Repr #2

blunt dsDNA:             GATT             AATC
                         CTAA             TTAG

5' overhang:            aGATT            aAATC             
                         CTAAa            TTAGa

3' overhang:             GATTa            AATCa
                        aCTAA            aTTAG

5' and 3' overhang:     aGATTa            AATC
                         CTAA            aTTAGa

3' and 5' overhang:      GATT            aAATCa
                        aCTAAa            TTAG

The table above describes five different double stranded DNA molecules. The two columns contain equivalent representations of the molecules. A checksum for linear DNA sequences should give different values for each of the molecules, but should be the same for each representation. I have defined a checksum called lSEGUID that fulfils these criteria.

The lSEGUID checksum for a blunt DNA sequence is defined as the uSEGUID checksum of the lexicographically smallest of the upper (watson) or lower (crick) strands. For instance, the lSEGUID for the molecule below is the uSEGUID for AACT, since this is lexicographically smaller than GATT.

Molecule                 Repr #1          Repr #2

blunt dsDNA:             GATT             AATC
                         CTAA             TTAG

For DNA sequences that are not blunt, the algorithm starts by selecting the  lexicographically smallest representation. The smallest representation of the staggered molecule below has been marked in blue.

Molecule                 Repr #1          Repr #2

5' overhang:             aGATT            aAATC             
                          CTAAa            TTAGa

Starting from the smallest representation, The lSEGUID checksum it defined as the uSEGUID checksum of a string concatenation called "repr" below with the following definition:

repr = chr(65)*upper_overhang+watson+chr(10)+chr(65)*lower_overhang+crick

The string above can easily be printed in any computer language to produce the representation. Chr(65) is the ASCII single white space character " " and chr(10) is the ASCII line break "\n". The upper_overhang and  lower_overhang are integers describing the number of white spaces needed in order to produce the correct stagger. 

For the molecule below, upper_overhang is zero and lower_overhang has has a value of one.


For the molecule above, the repr string becomes:

                    "aAATC\n TTAGa"

The uSEGUID of "aAATC\n TTAGa" is:  


Implementations of uSEGUID, cSEGUID and lSEGUID are available from the pydna Python package in the pydna.utils module

For uSEGUID, cSEGUID and lSEGUID of a blunt DNA molecule, there is also a standalone seguid calculator software and an online app.

No comments:

Post a Comment