SYNTOM Meeting Notes - ALM - July 8, 1999

Last Time: (7/2/99)

The notable changes to the tables are presented here in bold

Clone_Synonyms

Field_Name	Data_Type	Restrictions	Example
Other_Clone_Name
Our_Clone_Name
Clone_Source	-> Lab table

Labs

Field_Name	Data_Type	Restrictions	Example
Group_id	Integer	Not null is key
Group_name	String		Martin
Group_leader	Integer	-> People_id
Comments

Authorships

Field_Name	Data_Type	Restrictions	Example
People_id
Authorship_Type	"keyword"		Library, sequence, etc
Keyword_Value	Integer	-> ???	3

So, if People_id 12 authored library number 5, the entries would be:

People_id 12

Authorship_Type Library

Keyword_Value 5

How do we actually create that table definition?

Sequences

The first really complex idea which we are trying to store. We came up with some interesting strategies last time, and I'd like to go over them a bit as they will obviously influence other tables.

To summarize:

There are a number of sequence processing steps which take place to convert chromatographs to FASTA formatted (and therefore human readable) DNA sequence.

We'd like to store the parameters passed to the various processing applications and we'd like our system to be inherently flexible as many of the steps will change throughout time. This is what I'd like to talk about today.

chromatograph

FASTA sequence

Enter finished sequence into the database

1. Are objects a way of capturing these data? Or will we be creating a type of "linked list" that will be tracking the different steps

2. How do we actually call a program from Oracle? I'd like to at least try phred on some chromatographs to see how the process is done.

3. Does the database need to be "locked" while sequences are added?

Example of Phred_Parameter_File:

begin chem_list

"DP4%Ac{T3}" primer rhodamine

"DP4%Ac{T7}" primer rhodamine

. . .

"DP5%CEHV(KS)" primer rhodamine

"DP5%CEHV(SK)" primer rhodamine

"DP5%LR(KS)" primer rhodamine

"DP5%LR(SK)" primer rhodamine

"DP6%Ac{SP6}" primer rhodamine

. . .

"ET{-28m13rev}" primer energy-transfer

. . .

"DyeTerm{T7}-Set B" terminator rhodamine

. . .

end chem_list

Old Sequence Table - this table is NOT what we're going to use….

Field_Name	Data_Type	Restrictions	Example
Sequence_id		Is key not null
Organism		-> taxa
Accession		-> accession
Clone		-> Clone_id

Chromatograph	Filename
Sequence_conversion_id	Integer	-> sequence_conversion
Raw_sequence	Long	Phred output
Quality		Phred output
Vector_screening_id	Integer	-> vector screening
Clean Sequence	Long
Last update:	Date

This table needs to be completely redone. One long per table, ability to represent ANY sequence. META-sequence information.

As the sequence conversion and vector stripping processes take place in batch, I think it would be easiest to break those pieces of sequence related information into separate tables:

Sequence_Conversion

Field_Name	Data_Type	Restrictions	Example
Conversion_id		Primary key
Conversion_program		Default phred
Conversion_version		Default 0.980904a
Conversion_person		-> people_id
Conversion_platform		Default NT
Trim	Boolean	Default 0

Trim is the only command line variable which alters the output of phred

Vector_Stripping

Field_Name	Data_Type	Restrictions	Example
Vector_stripping_id		Primary key
Vector_program		Default cross_match
Vector_version		Default 0.990319
Vector_platform		Default NT
Vector_person		-> people_id
Vector_database		-> vector_database_id
Penalty	Integer	Mismatch penalty
Gap_init	Integer	Gap initiation penalty
Gap_ext	Integer	Gap extension penalty
Ins_gap_ext	Intger	Insertion gap extension penalty
Del_gap_ext	Integer	Deletion gap extension penalty
Matrix	Varchar	Matrix instread of penalties	This isn't implemented yet in cross_match should be soon however
Raw	Bitflag	Use raw SW scores instead of complexity adjusted
Minmatch	Default=14	Minimum length of word to begin SW comparison
Maxmatch	Default=30	Maximum word length
Max_group_size
…

This is getting a bit arduous to type in, and no one probably cares too much other than to know that the most of the values can be defaulted, this needs to be changed only in batch. Docs for cross_match (explaining the parameters) is here: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html.

How do I handle a vector database? Well, it's just a number of vectors in a fasta file. So, as we already have the vector table with a sequence field, therefore….

Vector_database table

Field_Name	Data_Type	Restrictions	Example
Vector_database_id	Integer	Is key not null
Vector_id	Integer	-> vector table

In our dataflow model, we've just taken the chromatographs, run them through phred and cross_match and now we have cleaned sequences.

Points to note:

I haven't tried it yet, so I don't know how easy it is to call external programs.

Q: How do we, in practice, do all this data processing?

1) get sequences by ftp or cd (probably manually)

2) call phred

3) store phred output

4) make vector database

5) call cross_match

6) store cross_match output

As my data is stored flat-file now, it's easy. Just a bunch of perl scripts "wrapped" together by a "meta" perl script. I'm assuming I just need to alter them to do the data handling for Oracle?

Cross_match output and "trash" sequences

For the curious, here's what cross_match "cleaned" sequence looks like

>sequence_name 746 0 746 ABI

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

CGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

From the phred docs: (http://bozeman.mbt.washington.edu/phrap.docs/phred.html)

The FASTA header, as written by phred, contains the following fields:

>chromat_name 1323 15 548 ABI

where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF.

As I generally do NOT let phred do the trimming, my values on the sequence header are always 0 and (slen). I've only received ABI files to date.

The X's in the sequence are the result of cross_match. Anything that gets SWAT'ed against vector with a high enough score is converted to an X. This is why I feel it necessary to store the un-cross_matched seqeunce as well as the processed sequence, so we can trace cross_match errors.

However, it's uninformative for the users of the database to have "trash" sequences in their input/ouput.

So, following TIGR's lead. Any sequence which is ³ 80% X is considered trash.

This is done (again) using Perl.

The same Perl script also removes the X's from the files and makes a new clean sequence which has the X's removed as well as the extra info on the header line. So my (fictitious) sequence from before looks like this after processing:

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

Trash also screens sequences out if they are considered "odd". For example, if a sequence has a long string of XXXs in the middle, but none at the ends:

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

Which doesn't make a lot of sense to me, then that seqeunce is "set aside" for further evaluation.

I use the unclean sequence, BLAST it against the non-redundant nucleotide database and look for matches.

How do we store BLAST?

How do we make contigs?

BLAST parameters (BLASTN, TBLASTX, etc.)

Assembler/Phrap/others….