SYNTOM Meeting Notes - ALM - July 8, 1999

 

 

Last Time: (7/2/99)

 

 

 


The notable changes to the tables are presented here in bold

 


Clone_Synonyms

 

Field_Name

Data_Type

Restrictions

Example

Other_Clone_Name

 

 

 

Our_Clone_Name

 

 

 

Clone_Source

-> Lab table

 

 

 

 

Labs

 

Field_Name

Data_Type

Restrictions

Example

Group_id

Integer

Not null is key

 

Group_name

String

 

Martin

Group_leader

Integer

-> People_id

 

Comments

 

 

 

 

 

 

Authorships

 

Field_Name

Data_Type

Restrictions

Example

People_id

 

 

 

Authorship_Type

"keyword"

 

Library, sequence, etc

Keyword_Value

Integer

-> ???

3

 

 

 

 

 

 

 

 

 

So, if People_id 12 authored library number 5, the entries would be:

 

People_id                              12

Authorship_Type                                Library

Keyword_Value                   5

 

How do we actually create that table definition?


Sequences

 

The first really complex idea which we are trying to store.  We came up with some interesting strategies last time, and I'd like to go over them a bit as they will obviously influence other tables.

 

To summarize:

 

There are a number of sequence processing steps which take place to convert chromatographs to FASTA formatted (and therefore human readable) DNA sequence.

 

We'd like to store the parameters passed to the various processing applications and we'd like our system to be inherently flexible as many of the steps will change throughout time.  This is what I'd like to talk about today. 

 

chromatograph

 

FASTA sequence

 

Enter finished sequence into the database

 

 


1.        Are objects a way of capturing these data?  Or will we be creating a type of "linked list" that will be tracking the different steps

2.        How do we actually call a program from Oracle?  I'd like to at least try phred on some chromatographs to see how the process is done.

3.        Does the database need to be "locked" while sequences are added?

 

Example of Phred_Parameter_File:

 

begin chem_list

 

"DP4%Ac{T3}"                    primer          rhodamine

"DP4%Ac{T7}"                    primer          rhodamine

. . .

"DP5%CEHV(KS)"                  primer          rhodamine

"DP5%CEHV(SK)"                  primer          rhodamine

"DP5%LR(KS)"                    primer          rhodamine

"DP5%LR(SK)"                    primer          rhodamine

"DP6%Ac{SP6}"                   primer          rhodamine

. . .

 

"ET{-28m13rev}"                 primer          energy-transfer

 

. . .

 

"DyeTerm{T7}-Set B"             terminator      rhodamine

 

. . .

 

end chem_list

 

Old Sequence Table - this table is NOT what we're going to use….

 

Field_Name

Data_Type

Restrictions

Example

Sequence_id

 

Is key not null

 

Organism

 

-> taxa

 

Accession

 

-> accession

 

Clone

 

-> Clone_id

 

 

 

 

 

Chromatograph

Filename

 

 

Sequence_conversion_id

Integer

-> sequence_conversion

 

Raw_sequence

Long

Phred output

 

Quality

 

Phred output

 

Vector_screening_id

Integer

-> vector screening

 

Clean Sequence

Long

 

 

Last update:

Date

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This table needs to be completely redone.  One long per table, ability to represent ANY sequence.  META-sequence information.

 

As the sequence conversion and vector stripping processes take place in batch, I think it would be easiest to break those pieces of sequence related information into separate tables:

Sequence_Conversion

 

Field_Name

Data_Type

Restrictions

Example

Conversion_id

 

Primary key

 

Conversion_program

 

Default phred

 

Conversion_version

 

Default 0.980904a

 

Conversion_person

 

-> people_id

 

Conversion_platform

 

Default NT

 

Trim

Boolean

Default 0

 

 

Trim is the only command line variable which alters the output of phred

Vector_Stripping

 

Field_Name

Data_Type

Restrictions

Example

Vector_stripping_id

 

Primary key

 

Vector_program

 

Default cross_match

 

Vector_version

 

Default 0.990319

 

Vector_platform

 

Default NT

 

Vector_person

 

-> people_id

 

Vector_database

 

-> vector_database_id

 

Penalty

Integer

Mismatch penalty

 

Gap_init

Integer

Gap initiation penalty

 

Gap_ext

Integer

Gap extension penalty

 

 

Ins_gap_ext

Intger

Insertion gap extension penalty

 

Del_gap_ext

Integer

Deletion gap extension penalty

 

Matrix

Varchar

Matrix instread of penalties

This isn't implemented yet in cross_match should be soon however

Raw

Bitflag

Use raw SW scores instead of complexity adjusted

 

Minmatch

Default=14

Minimum length of word to begin SW comparison

 

Maxmatch

Default=30

Maximum word length

 

Max_group_size

 

 

 

 

 

 

 

 

This is getting a bit arduous to type in, and no one probably cares too much other than to know that the most of the values can be defaulted, this needs to be changed only in batch. Docs for cross_match (explaining the parameters) is here: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html.

 

 

How do I handle a vector database?  Well, it's just a number of vectors in a fasta file.  So, as we already have the vector table with a sequence field, therefore….

 

Vector_database table

 

Field_Name

Data_Type

Restrictions

Example

Vector_database_id

Integer

Is key not null

 

Vector_id

Integer

-> vector table

 

 

 

 

 

 

 

 

In our dataflow model, we've just taken the chromatographs, run them through phred and cross_match and now we have cleaned sequences.

 

Points to note:

 

I haven't tried it yet, so I don't know how easy it is to call external programs.

 

Q:  How do we, in practice, do all this data processing?

 

1)       get sequences by ftp or cd (probably manually)

2)       call phred

3)       store phred output

4)       make vector database

5)       call cross_match

6)       store cross_match output

 

As my data is stored flat-file now, it's easy.  Just a bunch of perl scripts "wrapped" together by a "meta" perl script.  I'm assuming I just need to alter them to do the data handling for Oracle?

 

 

Cross_match output and "trash" sequences

 

For the curious, here's what cross_match "cleaned" sequence looks like

 

>sequence_name  746      0    746  ABI

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

CGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

 

 

From the phred docs:  (http://bozeman.mbt.washington.edu/phrap.docs/phred.html)

 

The FASTA header, as written by phred, contains the following fields:

 

>chromat_name 1323 15 548 ABI

 

where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF.

 

As I generally do NOT let phred do the trimming, my values on the sequence header are always 0 and (slen).  I've only received ABI files to date.

 

The X's in the sequence are the result of cross_match.  Anything that gets SWAT'ed against vector with a high enough score is converted to an X.  This is why I feel it necessary to store the un-cross_matched seqeunce as well as the processed sequence, so we can trace cross_match errors.

 

However, it's uninformative for the users of the database to have "trash" sequences in their input/ouput.

 

So, following TIGR's lead.  Any sequence which is ³ 80% X is considered trash.

 

This is done (again) using Perl. 

 

The same Perl script also removes the X's from the files and makes a new clean sequence which has the X's removed as well as the extra info on the header line.  So my (fictitious) sequence from before looks like this after processing:

 

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

 

 

Trash also screens sequences out if they are considered "odd".  For example, if a sequence has a long string of XXXs in the middle, but none at the ends:

 

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

 

Which doesn't make a lot of sense to me, then that seqeunce is "set aside" for further evaluation.

 

I use the unclean sequence, BLAST it against the non-redundant nucleotide database and look for matches. 

 

How do we store BLAST?

How do we make contigs?

BLAST parameters (BLASTN, TBLASTX, etc.)

Assembler/Phrap/others….