procedure

SynTom Table Definitions - May 4, 1999

In order to determine what fields will be necessary and to get an idea of what types of data these fields should contain, I've attempted to "capture" the process through which I analyze sequence data.

Download data from available sources: we'll start with the TIGR sequences as a starting point -- things will be different for data sources which are not given to us as raw data (i.e. chromatographs).

Run phred to convert the chromatographs into sequence + quality

date of phred run (May 4, 1999)
version of phred (0.960108 )
who ran phred (Maura Hart)
what platform (NT)

Phred provides us with two pieces of information a) fasta sequence and b) quality information

Raw fasta sequence file (max size = 800,000) average size = 741
quality file (max size = 800,000) (average size = 741)
90% of the sequences will be ~ average size, max size is from GenBank
fasta formatted files contain the name of the sequence as well as the sequence itself

Library information

Organism (Lycopersicon esculentum)
Cultivar (TA496)
Tissue (tomato ovary)
who created the library (Alcala)
when was the library created (January 8, 1999)
who did the sequencing (TIGR)
when was the sequencing done (March 15, 1999)
vector (pBluescript SK(-))
restriction site 1 (EcoR1)
restriction site 2 (Xho1)
developmental stage (5 days pre-anthesis to 5 days post-anthesis)
host (XL1-Blue MRF')
library name (tomato ovary, TAMU)

Run cross_match to remove vector contamination

date of cross match run (May 4, 1999)
version of cross match (0.990319)
who ran cross_match (Maura Hart)
what platform (NT)
what vector database (vector.seq)
what date was the vector database made (April 13, 1999)
who made the vector database (Andreas Matern)

Cross_match provides us with "clean" DNA sequence

a) Clean DNA sequence (max size=800,000) average size=740.9

At this point the raw sequence information is loaded into the database

Creation of a FASTA formatted database

export clean FASTA sequence to one large file (normally done by 'cat *.fasta >> database')
run formatdb to create a BLASTable "database"

date of formatdb run (May 4, 1999)
version of formatdb (2.0.8)
who ran formatdb (Maura Hart)
what platform (NT)
how many sequences were processed (57921)
name of the database (soldb)
location of database (D:\databases\sol\)

in order to begin forming contigs, each sequence must be BLASTed against the database

FOR firstsequence TO lastsequence

{

Export fasta sequence

BLASTALL -I fastasequence -o fastasequence.out -d soldb -p blastn

Parse_blast.pl fastasequence.out

Is the first "hit" itself?

NO: big problems -> note sequence in error.log

YES: disregard fist hit

While E is better than 1E-50

Hitname is SIMILARTO fastasequence /end WHILE

} /end FOR

(should we add a loop to begin annotating? I.e. if the sequence is from GenBank, and the E value is better than 1E-30 then PUTATIVE ID?)

End Result: Each sequence in the database will have a "list" of sequences in the database which it is similar to. List size (0,2000)

Contig Formation

In order to have contigs quickly assembled, we will send the BLAST results to either phrap or asmg (I'm partial to asmg only because I've used it a bit more AND because TIGR gives me data in asmg format)

FOR firstsequence to lastsequence

{

does the sequence have sequences SIMILARTO it?

NO: mark sequence read

YES:

FOR each sequence in the SIMILARTO list

{

mark similar sequence read

export similar sequence as fasta (as we did for formatdb)

}

export sequence

mark sequence read

}

End Result: All the sequences are "clustered" in files

Run ASMG on each cluster file

Returns a *.asm file

Parse the asm files into the assembly table

We spoke about this briefly last time….

Sequence: consensus sequence (type sequence)

Assembly_id the number of the assembly (N.B. this is not provided by asmg, it is meant to be provided by the database -- number sequentially)

Method: method used to generate the assembly (asmg or phrap)

Redundancy: The average depth of the assembly. Depth is the number of component sequences at a given base in the assembly. (real)

Perc_N: The percentage of ambiguities in the consensus sequence. (real)

Seq#: The number of ESTs in the assembly. (integer)

Ed_date: date when the assembly was created (date)

Each EST as part of an assembly then has the following fields.

Seq_name: the name of the sequence (points to the sequence table)

Asm_lend: left end of overlap in the consensus sequence (integer)

Asm_rend: right end of overlap in the consensus sequence (integer)

Seq_lend*: left end of overlap in the EST (integer)

Seq_rend*: right end of overlap in EST (integer)

Db: database where sequence is located (database name -- i.e. soldb)

Offset: The base pair location where the est first overlaps the assembly at the left end –1. For example, if the asm_lend of the assembly is 42, the offset is 41.

Lsequence: nucleotides of gapped EST sequence in one line (type sequence)

Each consensus sequence is made up of many individual sequences.

SO:

each assembly points to multiple sequences
Each sequence may or may not point to one assembly
Sequences which are not members of an assembly are considered singletons

BLAST the assemblies and singletons against the world

FOR each assembly and each singleton

Export fasta formatted sequence

Blastall -p blastn -I fastasequence -o fastasequencen.nt -d nt

Blastall -p blastn -I fastasequence -o fastasequencen.est -d est

Blastall -p blastn -I fastasequence -o fastasequencen.ar -d arabidopsis

Blastall -p tblastx -I fastasequence -o fastasequencetx.nt -d nt

Blastall -p tblastx -I fastasequence -o fastasequencetx.est -d est

Blastall -p tblastx -I fastasequence -o fastasequencetx.ar -d arabidopsis

Andreas Matern - May 4, 1999