DHDS [Download V1.0]
Distributed Homolog Detection System
The Distributed Homolog Detection System (DHDS) is a program that uses a standard distributed processing algorithm consisting of one central database and multiple processing servers. DHDS uses a series of rigorous statistical analyses to establish homology between transport proteins in TCDB.org on a large scale. DHDS has been used to compare selected classes of transport protein families within themselves, and to compare different classes of families. This tool allows definition of new relationships. DHDS is fast and efficient, and most importantly, DHDS is fully scalable and can handle a virtually unlimited number of jobs. DHDS is extremely reliable and leaves no room for incomplete or damaged jobs. Its reliability is built into its failsafe protocols, which will be described in detail in this paper. DHDS is written primarily in Object Oriented PHP with some PERL components, and driven by a MySQL Database. DHDS can run on any Unix based operating system with a minimal requirement of PHP 5.0, MySQL 4.0 and 32MB of RAM.
DHDS uses rigorous statistical algorithms to establish homology between two proteins, two families of proteins, or repeat sequences within a protein or family of proteins. These techniques depend on the ‘Superfamily principle’ [Doolittle, 1981; 1986; Saier, 1994]. This principle states that if ‘A’ is homologous to ‘B’, and ‘B’ is homologous to ‘C’, then ‘A’ is homologous to ‘C’. Much care is taken to account for unusual residue compositions, particularly in cases where proteins have a disproportionate percentage of hydrophobic residues where multiple short repeat sequences comprise a significant fraction of the proteins or protein segments being compared. Typically, once DHDS correctly aligns two proteins, in order to maximize identities and similarities and minimize gaps, DHDS will require a minimal score of 10 standard deviations by default. This corresponds to a probability of 10^-24 that this degree of similarity arose by chance. The minimal score requirement can be changed by the user. Often negative and positive controls are established by comparing a family in question to one that is unrelated and one that is related respectively, to establish good minimal values, depending on the situation.
A DHDS server application can run quietly in the background of the computer system, allowing the server/workstation to be used for other purposes. It does not demand a dedicated work environment. DHDS will connect to the central server where it will find a list of jobs in its MySQL Database in the following format: [ACCESSION#1, ACCESSION#2, FAMILY#1, FAMILY#2, TCID#1, TCID#2, MACHINE,
STATUS]. A status of “0” is default, which means the job is currently unclaimed and not being processed by any machine. DHDS will claim X number of jobs simultaneously. This number can be set by the user depending on the server’s hardware specifications. The servers at the Saier Lab handle 5 simultaneous comparisons. Once a job(s) is claimed by a server, its status is set to “1” and the server’s IP is saved under the ‘machine’ column to keep track of its activity. A status of ‘1’ means “In progress”. For each job claimed, the DHDS server will spawn three self-sustaining, fully independent tasks or “daemons”. They are named TASKd, PROTOCOL1d and PROTOCOL2d. The ‘d’ after each process name refers to ‘daemon’, which is proper Unix nomenclature for a background process. These three daemons have a simple hierarchy. TASKd is the first to be launched and is the parent to both PROTOCOL1d and PROTOCOL2d
The function of TASKd is quite simple. This process will spawn PROTOCOL1d and PROTOCOL2d and monitor their status. TASKd will know when either of these processes is complete or if they fail, and update their progress status on the central server, and transfer completed protein comparisons to the central server for further analysis.
PROTOCOL1d (P1d) is the first daemon spawned by TASKd. This is the process that collects important information about the two proteins being compared. Two instances are run, one for each accession number. First, it will perform an NCBI BLAST search and pick up close relatives. It has a default minimal e value of 0.005 which has been established to be a good value for detecting a reasonable number of close homologs and minimizing false positives. P1d requires that there be at least 500 BLAST results. Should there be fewer, it will perform a second iteration and try to retrieve 500. If 500 cannot be found, it will continue regardless, using what it found.
P1d then creates a list of gi numbers from the NCBI BLAST results. This list is sent to NCBI’s Protein Entrez to generate a TinyXML file containing sequences and protein information in a easy to parse format. To maximize efficiency, this protein list must be trimmed to a reasonable size. This is made possible using the CD-HIT program [Li & Godzik, 2006]; This program will eliminate all redundancies and all sequences with a percent identity greater than X%, so that only one protein of all the retrieved sequences with greater than X% identity will be retained. X% can be 70-100%. This value is automatically determined by P1d depending on the number of sequences found. One problem with the CD-HIT program is that retained sequences are often times fragments of complete proteins. P1d therefore uses a modified version of CD-Hit included in the Make_Table5 program,[Yen et al. 2009] so that only sequences of ‘normal’ length are retained. It works as follows: the script summarizes the sizes of all the proteins obtained by the BLAST search. A decision is made to exclude presumed fragmentary sequences. The results are tabulated and summarized, giving each protein an abbreviated symbol storing, them in FASTA format in one .FAA file. This yields a list of FASTA sequences procured from just one sequence or accession number.
These results are then transferred and stored on the the DHDS Central server. Should another DHDS server require filtered BLAST results, derived using the same accession number, it can immediately be retrieved from the central server to avoid unnecessary processing. If one DHDS server requires information about an accession number that is currently being processed by another server, P1d will wait on the other server to upload it to the central DHDS server instead of unnecessarily running it simultaneously and wasting processing power. Upon completion P1d will write a file called ‘status’ containing information about its job and whether it completed successfully or not. This file is in turn read by its parent ‘TASKd’, which then updates the job status to either ‘0’ for “failed” or leaves it as ‘1’ for “in progress”.
PROTOCOL2 (P2d) is launched by TASKd after P1d has generated two .FAA files from two accession numbers for its respective job. P2d will then begin a statistical analysis of the two lists of proteins by comparing them with a PERL script called SSearch [Pearson, 1998]. This program compares binary alignments with 500 randomly shuffled sequences and averages the two bit scores based on a standard curve, which is converted to a comparison score, expressed in standard deviations (S.D). By using this technique, most abnormal amino acid compositions are automatically corrected. P2d will then take any results with 8 standard deviations or better from SSearch and run the GSAT program [Reddy, 2010] on specific segments of interest for which SSearch detected a good standard deviation value. The GSAT program works by shuffling two sequences twice per amino acid, and three times for every amino acid in a sequence under 60 residues long, and comparing the actual aligned sequences with the alignment of the shuffled sequences, and a Z value is reported in standard deviations. This method also helps to eliminate artifacts due to unusual amino acid compositions. The benefit of combining these two methods (SSearch & GSAT) is, that it allows users to perform a quantitative analyses when comparing results from both programs on a large scale, giving the user a much better idea of what score should be interpreted as a “good” result. These results are then saved as a serialized array and sent to the DHDS central server for analysis. Upon completion, P2d creates a text file named ‘result’ within its project directory containing the job status. In turn, TASKd reports the job’s status back to the central server, and will terminate all instances of the job.
The Autopilotd service is DHDS’s fourth process and is the highest in the DHDS process hierarchy. This process is the program that actually distributes jobs to the lower level daemon applications. Upon its launch, Autopilotd will check the central server database for incomplete jobs by its own machine and will continue where it left off. If for whatever reason Taskd determines that the progress could not be salvaged, it will restart the job from the beginning. Autopilotd actively monitors the processes it spawns and will restart them if one ever malfunctions, making DHDS reliable.
DHDS has many built in failsafe protocols that make it extremely reliable. Each DHDS server handles individual HTTP requests in an object-oriented fashion, using cURL libraries. Should any HTTP request fail, it will pause for 20 seconds and try again maximally five times. If a job consistently fails after five trials, it will be marked as incomplete, and the server will continue with a new job. Incomplete/broken jobs are sent back to the queue, shuffled, and randomly redistributed to other servers so that they may attempt to complete it. Inevitably all computer systems of this nature are prone to communication errors. Several tools and features have been developed into the DHDS system to prevent and remedy any errors that may occur. When waiting on another server to complete an accession number analysis with P1d, the server in question will automatically begin its own analysis if the target’s process is over five minutes old. This effectively prevents any process from engaging in infinite loops over a potentially broken remote process. DHDS uses file transfer protocols built into the Unix subsystem instead of relying on external libraries such as libSSH2 which makes communication much more reliable and expands compatibility. DHDS communicates with SCP protocols and requires an SSH-DSA public key swap between the server and central machine. The DHDS central machine has built in failsafe features to ensure it receives all available packages. Every 15 minutes, DHDS will check its database for packages that are claimed to be complete and cross-reference them with the files on its own hard drive. If there are missing files, DHDS will individually connect to the server responsible for that particular package and retrieve it.
DHDS has a set of intricate analysis tools designed to quickly recognize potential homology and eliminate virtually all unnecessary manual labor. DHDS will generate an HTML document with a table containing a list of compared TCIDs with their accession numbers with tabulated SSearch and GSAT scores in S.D, ordered from highest to lowest. Viewing individual results will bring up a similar table but containing tabulated data about these proteins’ individual BLAST results. Each result is clearly labeled with its score in S.D from SSearch & GSAT. It also provides detailed GSAT analysis on the same page. One of the most useful features is its implementation of HMMGAP results [Reddy, 2010]. HMMGAP is an analysis tool that was designed specifically for DHDS but can operate independently and is available on TCDB’s Biotools page. HMMGAP will determine the most similar regions of two protein sequences using the Smith Waterman algorithm, and align them using the Needleman Wunsch algorithm, while highlighting and numbering all TMS regions. Thus a user can quickly identify which TMSs align with which (See Fig.1).
Fig.1 (Displaying the best alignment of two segments with numbered TMS regions)
Included directly in the results viewer is also the Movable Sequence Visualizer Tool (MSVT) [Reddy, 2010]. MSVT was also designed specifically for DHDS but functions independently and can also be found on TCDB’s Biotools page. MSVT allows users to drag and drop entire sequences and experiment with different alignments. The best alignment options are highlighted in pink, and TMS regions are presented in bold print and underlined. Users can also quickly identify the residue number by holding their mouse over the amino acid residue in question (See Fig.2).
Fig. 2 (Displaying two entire sequences with the best alignment highlighted pink, in drag & drop interface)
It is vital to have clear and precise analysis tools when dealing with the tremendous amounts of data DHDS generates. Using the tools specially designed for DHDS along with a set of pre-existing tools that have been implemented into the results viewer, potential homologies can be located in a fraction of the time otherwise required. External tools included in the results viewer are: (1) the Web-based Hydropathy, Amphipathicity and Topology program (WHAT) [Zhai & Saier, 2001a], uses a sliding window of 19 residues for alpha-helices or 9 residues for beta-strands to determine and plot the hydropathy, amphipathicy, secondary structure and predicted transmembrane topology along any protein sequence. The other program available in the results page is HMMTOP [Tusnday & Simon]. (2) HMMTOP is a combined transmembrane topology and signal peptide predictor used to predict orientation in the membrane. While WHAT presents the graphical depiction of average hydropathy, HMMTOP shows the predicted positions of the TMSs in their linear sequence in the protein, thus allowing identification of the specific residues and their proteins in the transmembrane segments.