SYNTOM Meeting Notes - ALM - July 2, 1999

Note to self: when's the next meeting?

Last Time: (6/30/99)

Presented a number of tables, and made a lot of changes. Here's how my oversimplified ERM stood:

Divisions

Accessions

Taxa

People

Libraries

Vectors

Authorships

Sequences

Clone Synonyms

Clones

Genetic Codes

Taxa_Names

Merged_Nodes

Deleted_Nodes

The tables with Italicized names weren't discussed last time

The tables with underlined names were presented by Dave Schneider

The tables with bold italicized tables need more discussion

We ended the discussion after discussing the People table.

As so many changes were made, here are the new table definitions:

Libraries

Field_Name	Data_Type	Restrictions	Example
Library_id	Integer	Not null, primary key	327
Tissue	Varchar2(256)		Leaf
Accession	Integer	-> accession table	9 - TA496
Organism	Integer	-> taxa table	21 - L. esculentum
Developmental_Stage	Varchar2(256)		4 week old
Treatment	Varchar2(256)		Pto innoculated
Date_created	Date		May 11, 1999
Created_by	Integer	-> Lab table	3 - Martin
Host	Varchar2(256)		SOLR
Vector	Integer	-> Vector Table	7 - pBlueScript SK(-)
PFU	Integer		1010
Average_Insert_Size	Integer		350
Availability	Boolean		TRUE
Type	Varchar(256)		mRNA
Library_Name	Varchar2(256)		Tomato susceptible, Cornell
Comments	Varchar(256)		Blah blah blah
Distributed	Integer	-> Lab
Antibiotic	Varchar(n)
5' Directional
Restriction_Enzyme_1
Restriction_Enzyme_2

Vectors

Field_Name	Data_Type	Restrictions	Example
Vector_id	Integer	Not null, primary key	7
Vector_name	Varchar2(256)		pBlueScript SK(-)
Vector_sequence	Long		ACTG….
Vector_Vendor	VarChar(256)		Stratagene
Comments

Clones - This table will be modified when we get to TMF's tables

Field_Name	Data_Type	Restrictions	Example
Clone_id	Integer	Not null, primary key	7
Library_id	Integer	-> library	13
Plate_number
Well_number
Row_number
Comments

Clone_Synonyms

Field_Name	Data_Type	Restrictions	Example
Other_Clone_Name
Our_Clone_Name

People

Field_Name	Data_Type	Restrictions	Example
People_id		Not null is key
Last_name
First_name
Mail
Country
Phone
Fax
e-mail
Homepage
Initials
Project
Group		=> Labs	Martin
*Security*
Comments
Last_updated

Security is highlighted because our discussion ended there last time.

The people table needs to be entered by hand. We will not (as previously discussed) track everyone in GenBank. Only those individuals who are pertinent to the database are going to be entered here -- generally people in the main solanaceae labs.

It may be important to group people together in a lab (as shown in the Libraries table):

Labs

Field_Name	Data_Type	Restrictions	Example
Group_id	Integer	Not null is key
Group_name	String		Martin
Group_leader	Integer	-> People_id

These data also need to be entered by hand.

Authorships

Field_Name	Data_Type	Restrictions	Example
People_id
Library_id
Reference_id
Sequence_id
Others…

This allows multiple people to be part of a library, a sequence, a reference (if we choose to include them), etc.

So now it looks like it's time to start making the sequence tables. As mentioned before (http://syntom.cit.cornell.edu/database/Jun1599.htm) sequences come from two primary sources:

1 Raw sequencing data (chromatographs) generally produced outside of Cornell Theory Center

2 FASTA formatted file - generally NCBI or internally

How are these data retrieved:

As a "recipe":

1. Chromatographs are either downloaded via FTP or read off a CD-ROM

2. Once the files are stored locally, phred is run on the directory containing the files

3. Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.

4. Cross_match is then run on the phred-derived files. Cross match removes vector contamination from sequences. The program requires a vector database (which is FASTA formatted) to perform the screening. Cross_match provides FASTA formatted sequences as output.

Some notes:

Phred takes no parameters other than those related to input and output naming and format.

I've already gone over the GenBank fields so I won't do that again. But, the GenBank info is important and will be kept.

With those notes in mind:

Sequences - this table will get updated again when we talk about BLAST and contigs and cross_match output

Field_Name	Data_Type	Restrictions	Example
Sequence_id		Is key not null
Organism		-> taxa
Accession		-> accession
Clone		-> Clone_id

Chromatograph	Filename
Sequence_conversion_id	Integer	-> sequence_conversion
Raw_sequence	Long	Phred output
Quality		Phred output
Vector_screening_id	Integer	-> vector screening
Clean Sequence	Long
Last update:	Date

As the sequence conversion and vector stripping processes take place in batch, I think it would be easiest to break those pieces of sequence related information into separate tables:

Sequence_Conversion

Field_Name	Data_Type	Restrictions	Example
Conversion_id		Primary key
Conversion_program		Default phred
Conversion_version		Default 0.980904a
Conversion_person		-> people_id
Conversion_platform		Default NT
Trim	Boolean	Default 0

Trim is the only command line variable which alters the output of phred

Vector_Stripping

Field_Name	Data_Type	Restrictions	Example
Vector_stripping_id		Primary key
Vector_program		Default cross_match
Vector_version		Default 0.990319
Vector_platform		Default NT
Vector_person		-> people_id
Vector_database		-> vector_database_id
Penalty	Integer	Mismatch penalty
Gap_init	Integer	Gap initiation penalty
Gap_ext	Integer	Gap extension penalty
Ins_gap_ext	Intger	Insertion gap extension penalty
Del_gap_ext	Integer	Deletion gap extension penalty
Matrix	Varchar	Matrix instread of penalties	This isn't implemented yet in cross_match should be soon however
Raw	Bitflag	Use raw SW scores instead of complexity adjusted
Minmatch	Default=14	Minimum length of word to begin SW comparison
Maxmatch	Default=30	Maximum word length
Max_group_size
…

This is getting a bit arduous to type in, and no one probably cares too much other than to know that the most of the values can be defaulted, this needs to be changed only in batch. Docs for cross_match (explaining the parameters) is here: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html.

How do I handle a vector database? Well, it's just a number of vectors in a fasta file. So, as we already have the vector table with a sequence field, therefore….

Vector_database table

Field_Name	Data_Type	Restrictions	Example
Vector_database_id	Integer	Is key not null
Vector_id	Integer	-> vector table

Now, where are we:

Not too bad.

In our dataflow model, we've just taken the chromatographs, run them through phred and cross_match and now we have cleaned sequences.

Points to note:

I haven't tried it yet, so I don't know how easy it is to call external programs.

Q: How do we, in practice, do all this data processing?

1) get sequences by ftp or cd (probably manually)

2) call phred

3) store phred output

4) make vector database

5) call cross_match

6) store cross_match output

As my data is stored flat-file now, it's easy. Just a bunch of perl scripts "wrapped" together by a "meta" perl script. I'm assuming I just need to alter them to do the data handling for Oracle?

Cross_match output and "trash" sequences

For the curious, here's what cross_match "cleaned" sequence looks like

>sequence_name 746 0 746 ABI

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

CGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

From the phred docs: (http://bozeman.mbt.washington.edu/phrap.docs/phred.html)

The FASTA header, as written by phred, contains the following fields:

>chromat_name 1323 15 548 ABI

where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF.

As I generally do NOT let phred do the trimming, my values on the sequence header are always 0 and (slen). I've only received ABI files to date.

The X's in the sequence are the result of cross_match. Anything that gets SWAT'ed against vector with a high enough score is converted to an X. This is why I feel it necessary to store the un-cross_matched seqeunce as well as the processed sequence, so we can trace cross_match errors.

However, it's uninformative for the users of the database to have "trash" sequences in their input/ouput.

So, following TIGR's lead. Any sequence which is ³ 80% X is considered trash.

This is done (again) using Perl.

The same Perl script also removes the X's from the files and makes a new clean sequence which has the X's removed as well as the extra info on the header line. So my (fictitious) sequence from before looks like this after processing:

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

Trash also screens sequences out if they are considered "odd". For example, if a sequence has a long string of XXXs in the middle, but none at the ends:

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

Which doesn't make a lot of sense to me, then that seqeunce is "set aside" for further evaluation.

I use the unclean sequence, BLAST it against the non-redundant nucleotide database and look for matches.

So some new fields need to be added to sequence, and here's where the question of multiple longs per table comes into play.

Sequences - this table will get updated again when we talk about BLAST and contigs

Field_Name	Data_Type	Restrictions	Example
Sequence_id		Is key not null
Organism		-> taxa
Accession		-> accession
Clone		-> Clone_id

Chromatograph	Filename
Sequence_conversion_id	Integer	-> sequence_conversion
Raw_sequence	Long	Phred output
Quality		Phred output
Vector_screening_id	Integer	-> vector screening
Clean Sequence	Long
Last update:	Date
Raw_Sequence_length	Integer	Phred output
Cleaned_sequence_length	Integer	Trash ouput
Trash	Boolean	Trash output
Sequence	Long	The cleaned, trash-free sequence
Sequence_start	Integer	The bp of raw_seqeuence_length that begins the clean, trash-free sequence
Sequence_End	Integer	See seqeunce_start

How do we store BLAST?

How do we make contigs?

BLAST parameters (BLASTN, TBLASTX, etc.)

Assembler/Phrap/others….