SynTom Table Definitions - May 4, 1999
In order to determine what fields will be necessary and to get an idea of what types of data these fields should contain, I've attempted to "capture" the process through which I analyze sequence data.
Download data from available sources: we'll start with the TIGR sequences as a starting point -- things will be different for data sources which are not given to us as raw data (i.e. chromatographs).
Run phred to convert the chromatographs into sequence + quality
Phred provides us with two pieces of information a) fasta sequence and b) quality information
Run cross_match to remove vector contamination
Cross_match provides us with "clean" DNA sequence
a) Clean DNA sequence (max size=800,000) average size=740.9
At this point the raw sequence information is loaded into the database
Creation of a FASTA formatted database
in order to begin forming contigs, each sequence must be BLASTed against the database
FOR firstsequence TO lastsequence
Export fasta sequence
BLASTALL -I fastasequence -o fastasequence.out -d soldb -p blastn
Is the first "hit" itself?
NO: big problems -> note sequence in error.log
YES: disregard fist hit
While E is better than 1E-50
Hitname is SIMILARTO fastasequence /end WHILE
} /end FOR
(should we add a loop to begin annotating? I.e. if the sequence is from GenBank, and the E value is better than 1E-30 then PUTATIVE ID?)
End Result: Each sequence in the database will have a "list" of sequences in the database which it is similar to. List size (0,2000)
In order to have contigs quickly assembled, we will send the BLAST results to either phrap or asmg (I'm partial to asmg only because I've used it a bit more AND because TIGR gives me data in asmg format)
FOR firstsequence to lastsequence
does the sequence have sequences SIMILARTO it?
NO: mark sequence read
FOR each sequence in the SIMILARTO list
mark similar sequence read
export similar sequence as fasta (as we did for formatdb)
mark sequence read
End Result: All the sequences are "clustered" in files
Run ASMG on each cluster file
Returns a *.asm file
Parse the asm files into the assembly table
We spoke about this briefly last time….
Sequence: consensus sequence (type sequence)
Assembly_id the number of the assembly (N.B. this is not provided by asmg, it is meant to be provided by the database -- number sequentially)
Method: method used to generate the assembly (asmg or phrap)
Redundancy: The average depth of the assembly. Depth is the number of component sequences at a given base in the assembly. (real)
Perc_N: The percentage of ambiguities in the consensus sequence. (real)
Seq#: The number of ESTs in the assembly. (integer)
Ed_date: date when the assembly was created (date)
Each EST as part of an assembly then has the following fields.
Seq_name: the name of the sequence (points to the sequence table)
Asm_lend: left end of overlap in the consensus sequence (integer)
Asm_rend: right end of overlap in the consensus sequence (integer)
Seq_lend*: left end of overlap in the EST (integer)
Seq_rend*: right end of overlap in EST (integer)
Db: database where sequence is located (database name -- i.e. soldb)
Offset: The base pair location where the est first overlaps the assembly at the left end –1. For example, if the asm_lend of the assembly is 42, the offset is 41.
Lsequence: nucleotides of gapped EST sequence in one line (type sequence)
Each consensus sequence is made up of many individual sequences.
BLAST the assemblies and singletons against the world
FOR each assembly and each singleton
Export fasta formatted sequence
Blastall -p blastn -I fastasequence -o fastasequencen.nt -d nt
Blastall -p blastn -I fastasequence -o fastasequencen.est -d est
Blastall -p blastn -I fastasequence -o fastasequencen.ar -d arabidopsis
Blastall -p tblastx -I fastasequence -o fastasequencetx.nt -d nt
Blastall -p tblastx -I fastasequence -o fastasequencetx.est -d est
Blastall -p tblastx -I fastasequence -o fastasequencetx.ar -d arabidopsis
Andreas Matern - May 4, 1999