Data pertaining to biological sequences such as DNA, RNA and protein are often stored and transferred in electronic form.
This information is often stored simultaneously on several locations while people are working on or with them. This is a potential source of error, since it is easy to introduce small errors that are hard to spot.
Biological sequences are essentially information, so cryptographic checksums can be used to verify the integrity of sequences just like any other type of information.
Cryptographic checksums have been implemented for protein sequences to provide a stable identifier that only depends on the primary sequence.
A checksum called the
SEquence
Global Unique
IDentifier (SEGUID) was suggested by Babnigg and Giometti 2006. SEGUID is the Secure Hash Algorithm 1 (
SHA-1) checksum calculated on the biological sequence in uppercase and displayed using the
base64 encoding.
The SEGUID identifier has been used to create translation tables between different databases holding the same sequences but typically using different id numbers or ways to identify the sequence.
At 27 characters, the checksum is relatively short. The SEGUID for the DNA sequence
Gattaca is:
tp2jzeCM2e3W4yxtrrx09CMKa/8
The SHA-1 algorithm has been broken, meaning that so called "
hash collisions" can be constructed given enough time and resources. A hash collision means that the same checksum two different pieces of information gives the same checksum.
No hash collisions has been reported for the SHA-1 algorithm by accident and widely used softwares such as GIT still use SHA-1. Alternatives such as the more secure SHA-3 produces quite a bit longer checksums, and are for this reason less readable.
The url-safe uSEGUID checksum
Unfortunately the original SEGUID checksum used the original Base64 encoding. This encoding contains the "+" and "/" characters. For instance the DNA sequence CAGG gives the SEGUID:
uZdvA+J+luF/IK4TAj+GBTMz688
The backslash "/"and the "+" prevents the use of the checksum in URLs or as a part of a filename on most operating systems. There is an alternative Base64 encoding cells
Base64url that substitutes "/"and "+" for "_" and "-" solving this problem. The uSEGUID (short for "url safe SEGUID") is defined as the SEGUID checksum but with Base64url encoding. The uSEGUID for CAGG is:
uZdvA-J-luF_IK4TAj-GBTMz688
It is worth noting that the uSEGUID and SEGUID checksums can be constructed from each other by two character substitutions.
The cSEGUID checksum for circular sequences
The same storage and transmission problems that apply to protein sequences also apply to circular DNA sequences, such as plasmids. However, the uSEGUID checksum is not directly useful for circular DNA sequences, since there is up to 2n unique and equivalent representations for a double stranded circular sequence of length n.
For example, if we consider the circular double stranded 6 bp DNA sequence AGCCTA, the twelve sequences below are equal representations:
AGCCTA TAGGCT
GCCTAA AGGCTT
CCTAAG GGCTTA
CTAAGC GCTTAG
TAAGCC CTTAGG
AAGCCT TTAGGC
In my own line of work, I often have to evaluate plasmids constructed
in-silico by students. A unique checksum for the correct sequence would make it easier to do this.
For this reason I developed the cSEGUID checksum as a general solution for this problem. The cSEGUID is defined as the uSEGUID checksum calculated from the
lexicographically smallest rotation of any rotation of the sequence itself or its reverse complement. The smallest rotation o
f AGCCTA is marked in blue above.
The cSEGUID for the DNA sequence
AGCCTA is:
OQ1RGvO0Y6C-zYuUxVjE84O-yvI
This is also the uSEGUID of AAGCCT which is the smallest rotation of the sequence.
The cSEGUID is guaranteed to be as unique as the uSEGUID since a circular string that is not a concatenation of two substrings is guaranteed to have only one smallest rotation.
The lSEGUID checksum for linear double stranded DNA sequences
For completeness there should be a checksum definition for linear double stranded DNA molecules. These can take several shapes as they come with either blunt or staggered ends:
Molecule Repr #1 Repr #2
blunt dsDNA: GATT AATC
CTAA TTAG
5' overhang: aGATT aAATC
CTAAa TTAGa
3' overhang: GATTa AATCa
aCTAA aTTAG
5' and 3' overhang: aGATTa AATC
CTAA aTTAGa
3' and 5' overhang: GATT aAATCa
aCTAAa TTAG
The table above describes five different double stranded DNA molecules. The two columns contain equivalent representations of the molecules. A checksum for linear DNA sequences should give different values for each of the molecules, but should be the same for each representation. I have defined a checksum called lSEGUID that fulfils these criteria.
The lSEGUID checksum for a
blunt DNA sequence is defined as the uSEGUID checksum of the
lexicographically smallest of the upper (watson) or lower (crick) strands. For instance, the lSEGUID for the molecule below is the uSEGUID for AACT, since this is lexicographically smaller than GATT.
Molecule Repr #1 Repr #2
blunt dsDNA: GATT AATC
CTAA TTAG
For DNA sequences that are not blunt, the algorithm starts by selecting the lexicographically smallest representation. The smallest representation of the staggered molecule below has been marked in blue.
Molecule Repr #1 Repr #2
5' overhang: aGATT aAATC
CTAAa TTAGa
Starting from the smallest representation, The lSEGUID checksum it defined as the uSEGUID checksum of a string concatenation called "repr" below with the following definition:
repr = chr(65)*upper_overhang+watson+chr(10)+chr(65)*lower_overhang+crick
The string above can easily be printed in any computer language to produce the representation. Chr(65) is the ASCII single white space character " " and chr(10) is the ASCII line break "\n". The upper_overhang and lower_overhang are integers describing the number of white spaces needed in order to produce the correct stagger.
For the molecule below, upper_overhang is zero and lower_overhang has has a value of one.
For the molecule above, the repr string becomes:
"aAATC\n TTAGa"
The uSEGUID of "aAATC\n TTAGa" is:
zjuf6OAJQNP1nSAUtAnSOHi5BOA
Implementations of uSEGUID, cSEGUID and lSEGUID are available from the pydna Python package in the
pydna.utils module.