SYNTOM Meeting Notes - ALM - July 2, 1999

 

Note to self:  when's the next meeting?

Last Time: (6/30/99)

 

Presented a number of tables, and made a lot of changes. Here's how my oversimplified ERM stood:

 

Divisions

 

 

Accessions               

 

Taxa

 

People

 

Libraries

 

 

Vectors

 

 

Authorships               

 

Sequences

 

 

Clone Synonyms

 

 

Clones

 

 
 


Text Box: ?

Genetic Codes

 

Taxa_Names

 

Merged_Nodes

 

Deleted_Nodes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


The tables with Italicized names weren't discussed last time

The tables with underlined names were presented by Dave Schneider

The tables with bold italicized tables need more discussion


 

 

We ended the discussion after discussing the People table.

 

As so many changes were made, here are the new table definitions:

 

Libraries

 

Field_Name

Data_Type

Restrictions

Example

Library_id

Integer

Not null, primary key

327

Tissue

Varchar2(256)

 

Leaf

Accession

Integer

-> accession table

9 - TA496

Organism

Integer

-> taxa table

21 - L. esculentum

Developmental_Stage

Varchar2(256)

 

4 week old

Treatment

Varchar2(256)

 

Pto innoculated

Date_created

Date

 

May 11, 1999

Created_by

Integer

-> Lab table

3 - Martin

Host

Varchar2(256)

 

SOLR

Vector

Integer

-> Vector Table

7 - pBlueScript SK(-)

PFU

Integer

 

1010

Average_Insert_Size

Integer

 

350

Availability

Boolean

 

TRUE

Type

Varchar(256)

 

mRNA

Library_Name

Varchar2(256)

 

Tomato susceptible, Cornell

Comments

Varchar(256)

 

Blah blah blah

Distributed

Integer

-> Lab

 

Antibiotic

Varchar(n)

 

 

5' Directional

 

 

 

Restriction_Enzyme_1

 

 

 

Restriction_Enzyme_2

 

 

 

 

Vectors

Field_Name

Data_Type

Restrictions

Example

Vector_id

Integer

Not null, primary key

7

Vector_name

Varchar2(256)

 

pBlueScript SK(-)

Vector_sequence

Long

 

ACTG….

Vector_Vendor

VarChar(256)

 

Stratagene

Comments

 

 

 

 

Clones - This table will be modified when we get to TMF's tables

 

Field_Name

Data_Type

Restrictions

Example

Clone_id

Integer

Not null, primary key

7

Library_id

Integer

-> library

13

Plate_number

 

 

 

Well_number

 

 

 

Row_number

 

 

 

Comments

 

 

 

Clone_Synonyms

 

Field_Name

Data_Type

Restrictions

Example

Other_Clone_Name

 

 

 

Our_Clone_Name

 

 

 

 

 

People

 

Field_Name

Data_Type

Restrictions

Example

People_id

 

Not null is key

 

Last_name

 

 

 

First_name

 

 

 

Mail

 

 

 

Country

 

 

 

Phone

 

 

 

Fax

 

 

 

e-mail

 

 

 

Homepage

 

 

 

Initials

 

 

 

Project

 

 

 

Group

 

=> Labs

Martin

Security

 

 

 

Comments

 

 

 

Last_updated

 

 

 

 

Security is highlighted because our discussion ended there last time.

 

The people table needs to be entered by hand.  We will not (as previously discussed) track everyone in GenBank.  Only those individuals who are pertinent to the database are going to be entered here -- generally people in the main solanaceae labs.

 

It may be important to group people together in a lab (as shown in the Libraries table):

 

Labs

 

Field_Name

Data_Type

Restrictions

Example

Group_id

Integer

Not null is key

 

Group_name

String

 

Martin

Group_leader

Integer

-> People_id

 

 

 

 

 

 

These data also need to be entered by hand. 


 

Authorships

 

Field_Name

Data_Type

Restrictions

Example

People_id

 

 

 

Library_id

 

 

 

Reference_id

 

 

 

Sequence_id

 

 

 

Others…

 

 

 

 

This allows multiple people to be part of a library, a sequence, a reference (if we choose to include them), etc.

 

So now it looks like it's time to start making the sequence tables.  As mentioned before (http://syntom.cit.cornell.edu/database/Jun1599.htm) sequences come from two primary sources:

 

1                     Raw sequencing data (chromatographs) generally produced outside of Cornell Theory Center

2                     FASTA formatted file - generally NCBI or internally

 

 

How are these data retrieved:

 

As a "recipe":

 

1.        Chromatographs are either downloaded via FTP or read off a CD-ROM

2.        Once the files are stored locally, phred is run on the directory containing the files

3.        Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.

4.        Cross_match is then run on the phred-derived files.  Cross match removes vector contamination from sequences.  The program requires a vector database (which is FASTA formatted) to perform the screening.  Cross_match provides FASTA formatted sequences as output.

 

 


 

 

 

 


Some notes:

 

Phred takes no parameters other than those related to input and output naming and format.

I've already gone over the GenBank fields so I won't do that again.  But, the GenBank info is important and will be kept.

 

With those notes in mind:

 

Sequences - this table will get updated again when we talk about BLAST and contigs and cross_match output

 

Field_Name

Data_Type

Restrictions

Example

Sequence_id

 

Is key not null

 

Organism

 

-> taxa

 

Accession

 

-> accession

 

Clone

 

-> Clone_id

 

 

 

 

 

Chromatograph

Filename

 

 

Sequence_conversion_id

Integer

-> sequence_conversion

 

Raw_sequence

Long

Phred output

 

Quality

 

Phred output

 

Vector_screening_id

Integer

-> vector screening

 

Clean Sequence

Long

 

 

Last update:

Date

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

As the sequence conversion and vector stripping processes take place in batch, I think it would be easiest to break those pieces of sequence related information into separate tables:

Sequence_Conversion

 

Field_Name

Data_Type

Restrictions

Example

Conversion_id

 

Primary key

 

Conversion_program

 

Default phred

 

Conversion_version

 

Default 0.980904a

 

Conversion_person

 

-> people_id

 

Conversion_platform

 

Default NT

 

Trim

Boolean

Default 0

 

 

Trim is the only command line variable which alters the output of phred


 

Vector_Stripping

 

Field_Name

Data_Type

Restrictions

Example

Vector_stripping_id

 

Primary key

 

Vector_program

 

Default cross_match

 

Vector_version

 

Default 0.990319

 

Vector_platform

 

Default NT

 

Vector_person

 

-> people_id

 

Vector_database

 

-> vector_database_id

 

Penalty

Integer

Mismatch penalty

 

Gap_init

Integer

Gap initiation penalty

 

Gap_ext

Integer

Gap extension penalty

 

 

Ins_gap_ext

Intger

Insertion gap extension penalty

 

Del_gap_ext

Integer

Deletion gap extension penalty

 

Matrix

Varchar

Matrix instread of penalties

This isn't implemented yet in cross_match should be soon however

Raw

Bitflag

Use raw SW scores instead of complexity adjusted

 

Minmatch

Default=14

Minimum length of word to begin SW comparison

 

Maxmatch

Default=30

Maximum word length

 

Max_group_size

 

 

 

 

 

 

 

 

This is getting a bit arduous to type in, and no one probably cares too much other than to know that the most of the values can be defaulted, this needs to be changed only in batch. Docs for cross_match (explaining the parameters) is here: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html.

 

 

How do I handle a vector database?  Well, it's just a number of vectors in a fasta file.  So, as we already have the vector table with a sequence field, therefore….

 

Vector_database table

 

Field_Name

Data_Type

Restrictions

Example

Vector_database_id

Integer

Is key not null

 

Vector_id

Integer

-> vector table

 

 

 

 

 

 

 

 

 

 

Now, where are we:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Not too bad.

 

In our dataflow model, we've just taken the chromatographs, run them through phred and cross_match and now we have cleaned sequences.

 

Points to note:

 

I haven't tried it yet, so I don't know how easy it is to call external programs.

 

Q:  How do we, in practice, do all this data processing?

 

1)       get sequences by ftp or cd (probably manually)

2)       call phred

3)       store phred output

4)       make vector database

5)       call cross_match

6)       store cross_match output

 

As my data is stored flat-file now, it's easy.  Just a bunch of perl scripts "wrapped" together by a "meta" perl script.  I'm assuming I just need to alter them to do the data handling for Oracle?

 

 

Cross_match output and "trash" sequences

 

For the curious, here's what cross_match "cleaned" sequence looks like

 

>sequence_name  746      0    746  ABI

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

CGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

 

 

From the phred docs:  (http://bozeman.mbt.washington.edu/phrap.docs/phred.html)

 

The FASTA header, as written by phred, contains the following fields:

 

>chromat_name 1323 15 548 ABI

 

where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF.

 

As I generally do NOT let phred do the trimming, my values on the sequence header are always 0 and (slen).  I've only received ABI files to date.

 

The X's in the sequence are the result of cross_match.  Anything that gets SWAT'ed against vector with a high enough score is converted to an X.  This is why I feel it necessary to store the un-cross_matched seqeunce as well as the processed sequence, so we can trace cross_match errors.

 

However, it's uninformative for the users of the database to have "trash" sequences in their input/ouput.

 

So, following TIGR's lead.  Any sequence which is ³ 80% X is considered trash.

 

This is done (again) using Perl. 

 

The same Perl script also removes the X's from the files and makes a new clean sequence which has the X's removed as well as the extra info on the header line.  So my (fictitious) sequence from before looks like this after processing:

 

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC

ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA

TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

 

 

Trash also screens sequences out if they are considered "odd".  For example, if a sequence has a long string of XXXs in the middle, but none at the ends:

 

>sequence_name

GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA

GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG

GATAAATAATTAACAGCCCAAATTTAATATATGATTGATXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT

GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA

GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG

CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG

 

Which doesn't make a lot of sense to me, then that seqeunce is "set aside" for further evaluation.

 

I use the unclean sequence, BLAST it against the non-redundant nucleotide database and look for matches. 

 

So some new fields need to be added to sequence, and here's where the question of multiple longs per table comes into play.


 

Sequences - this table will get updated again when we talk about BLAST and contigs

 

Field_Name

Data_Type

Restrictions

Example

Sequence_id

 

Is key not null

 

Organism

 

-> taxa

 

Accession

 

-> accession

 

Clone

 

-> Clone_id

 

 

 

 

 

Chromatograph

Filename

 

 

Sequence_conversion_id

Integer

-> sequence_conversion

 

Raw_sequence

Long

Phred output

 

Quality

 

Phred output

 

Vector_screening_id

Integer

-> vector screening

 

Clean Sequence

Long

 

 

Last update:

Date

 

 

Raw_Sequence_length

Integer

Phred output

 

Cleaned_sequence_length

Integer

Trash ouput

 

Trash

Boolean

Trash output

 

Sequence

Long

The cleaned, trash-free sequence

 

Sequence_start

Integer

The bp of raw_seqeuence_length that begins the clean, trash-free sequence

 

Sequence_End

Integer

See seqeunce_start

 

 

 

 

 

 

 

 

How do we store BLAST?

How do we make contigs?

BLAST parameters (BLASTN, TBLASTX, etc.)

Assembler/Phrap/others….