Sequence Table Notes
This sequence table is an attempt to capture all of the information required for a specific sequence that will be represented in the SynTom database.
The fields Name, Library and Sequence are currently available from TIGR directly -- every few days the FASTA formatted sequence is FTPd to syntom (the linux box on my desktop). Those sequences can be quickly parsed to fill these fields.
The remaining fields are all "value added" -- other programs (most notably BLAST, phred, phrap and TIGR assembler) are required to produce these data.
Putative ID:This field is where the eventual sequence ID will be entered. I am currently using the single sequence reads to determine putative ID. There is concern however, that using the contig will give more accurate results (as longer sequences give a more significant BLAST score). I contend that this field should remain, as, if the contig is "real" the ID of the components of the contig (i.e. the individual sequences) should be the same. This will not always be the case, however. As an example, consider an EST which is comprised mostly of a well-defined domain or motif. Motifs may find a family of sequences in a BLAST run, but the contig may reflect a more significant hit to a well-defined sequence. For example, an EST containing a MADS box element may show significant similarity to a MADS box containing sequence, while the entire contig may have significant similarity to a different MADS box sequence.
Contig:This field will link each sequence to the contig which contains it. Singletons will have no value for this field. Contigging will occur via the program TIGR assembler. TIGR currently sends me the *.lsm file containing contig information they derive, however, I think it will be essential to have contigs created on a per library basis. Therefore:
Contig Lib:A link from each sequence to the library specific-contig from whence it comes
Same As:EST sequencing produces, by it's very nature a great number of redundant sequences. Although much of this information will be accessible through the contig table, as a BLAST postprocessing task, each sequence which shares > 95% sequence similarity over its entire length should be listed in a "thesaurus" of sorts. This field may require many (max ~ 100?) links to additional sequences. It is not the same as a contig as contigs are an attempt not only to find redundant clones, but also an attempt to find clones which extend the length of the sequence read.
Mapped:Each sequence which is informative will be mapped on the tomato genetic map. This information will be part of the map table (or series of tables) but we need a quick and easy way to see if a particular sequence already has a map position associated with it.
Chromatograph:A link to the chromatograph file -- perhaps accessible through the web via a java application.
Untrimmed:The untrimmed sequence -- the output from the phred read. Including all low quality sequence as well as the vector sequence.
Quality:The phred quality scores for the sequence.
Length:The length of the sequence (trimmed?) in base pairs
Arabidopsis:The main goal of SynTom is to be a tool useful in the analysis of Arabidopsis-Tomato synteny. The Arabidopsis field will include a link to the highest hitting Arabidopsis sequence in a blast comparison. An obvious link to the Arabidopsis table(s).
The BLAST parsing:
As mentioned earlier in the SynTom mission statement, the primary goal of this database is to be a tool useful for the analysis of Arabidopsis-Tomato synteny. In addition, however, supplemental information, including BLAST results from sequences to other genbank databases including dbEST and the non-redundant database, are important in the assignment of function to ESTs and contigs as well as for the comparison of the tomato sequences to the Arabidopsis databases.
All of the BLAST fields are links to the highest scoring sequence in chosen BLAST program run vs a chosen database.
BLASTN: compares a nucleotide query sequence against a nucleotide database. By its very nature BLASTN only finds significant similarity between two sequences which have not diverged evolutionarily I any significant manner. Most solanaceae sequences find their highest BLASTN score with other solanaceae sequences, etc
TBLASTX: compares a nucleotide sequence translated into all three open reading frames against a nucleotide database translated into all three open reading frames. This allows a great deal of evolution to occur and yet common sequence motifs are still obtainable. Most significant Arabidopsis:Tomato matches are found with tblastx.
Sol: the solanaceae database. Each month I download from GenBank all of the sequences available for the solanaceae. To this FASTA-formatted file, I add 1) sequences from TIGR 2) sequences generated in the lab 3) sequences from commercial sources. All the sequences are from: tomato, potato, petunia, tobacco and other members of the solanaceae. Matches (especially BLASTN matches) to the solanaceae database from solanaceous sequences gives strong evidence that the sequence in question's function is the same as the database sequence.
Nt: shorthand for the non-redundant nucleotide database at NCBI. Contains all the sequences EXCEPT FOR:
EST: dbest -- the database of expressed sequence tags