Note to self: when's the next meeting?
Presented a number of tables, and made a lot of changes. Here's how my oversimplified ERM stood:
Divisions Accessions Taxa People Libraries Vectors Authorships Sequences Clone Synonyms Clones
Genetic Codes Taxa_Names Merged_Nodes Deleted_Nodes
The tables with Italicized names weren't discussed last time
The tables with underlined names were presented by Dave Schneider
The tables with bold italicized tables need more discussion
We ended the discussion after discussing the People table.
As so many changes were made, here are the new table definitions:
Field_Name |
Data_Type |
Restrictions |
Example |
Library_id |
Integer |
Not null, primary key |
327 |
Tissue |
Varchar2(256) |
|
Leaf |
Accession |
Integer |
-> accession table |
9 - TA496 |
Organism |
Integer |
-> taxa table |
21 - L. esculentum |
Developmental_Stage |
Varchar2(256) |
|
4 week old |
Treatment |
Varchar2(256) |
|
Pto innoculated |
Date_created |
Date |
|
May 11, 1999 |
Created_by |
Integer |
-> Lab table |
3 - Martin |
Host |
Varchar2(256) |
|
SOLR |
Vector |
Integer |
-> Vector Table |
7 - pBlueScript SK(-) |
PFU |
Integer |
|
1010 |
Average_Insert_Size |
Integer |
|
350 |
Availability |
Boolean |
|
TRUE |
Type |
Varchar(256) |
|
mRNA |
Library_Name |
Varchar2(256) |
|
Tomato susceptible, Cornell |
Comments |
Varchar(256) |
|
Blah blah blah |
Distributed |
Integer |
-> Lab |
|
Antibiotic |
Varchar(n) |
|
|
5' Directional |
|
|
|
Restriction_Enzyme_1 |
|
|
|
Restriction_Enzyme_2 |
|
|
|
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_id |
Integer |
Not null, primary key |
7 |
Vector_name |
Varchar2(256) |
|
pBlueScript SK(-) |
Vector_sequence |
Long |
|
ACTG…. |
Vector_Vendor |
VarChar(256) |
|
Stratagene |
Comments |
|
|
|
Field_Name |
Data_Type |
Restrictions |
Example |
Clone_id |
Integer |
Not null, primary key |
7 |
Library_id |
Integer |
-> library |
13 |
Plate_number |
|
|
|
Well_number |
|
|
|
Row_number |
|
|
|
Comments |
|
|
|
Field_Name |
Data_Type |
Restrictions |
Example |
Other_Clone_Name |
|
|
|
Our_Clone_Name |
|
|
|
Field_Name |
Data_Type |
Restrictions |
Example |
People_id |
|
Not null is key |
|
Last_name |
|
|
|
First_name |
|
|
|
Mail |
|
|
|
Country |
|
|
|
Phone |
|
|
|
Fax |
|
|
|
|
|
|
|
Homepage |
|
|
|
Initials |
|
|
|
Project |
|
|
|
Group |
|
=> Labs |
Martin |
Security |
|
|
|
Comments |
|
|
|
Last_updated |
|
|
|
Security is highlighted because our discussion ended there last time.
The people table needs to be entered by hand. We will not (as previously discussed) track everyone in GenBank. Only those individuals who are pertinent to the database are going to be entered here -- generally people in the main solanaceae labs.
It may be important to group people together in a lab (as shown in the Libraries table):
Field_Name |
Data_Type |
Restrictions |
Example |
Group_id |
Integer |
Not null is key |
|
Group_name |
String |
|
Martin |
Group_leader |
Integer |
-> People_id |
|
|
|
|
|
These data also need to be entered by hand.
Field_Name |
Data_Type |
Restrictions |
Example |
People_id |
|
|
|
Library_id |
|
|
|
Reference_id |
|
|
|
Sequence_id |
|
|
|
Others… |
|
|
|
This allows multiple people to be part of a library, a sequence, a reference (if we choose to include them), etc.
So now it looks like it's time to start making the sequence tables. As mentioned before (http://syntom.cit.cornell.edu/database/Jun1599.htm) sequences come from two primary sources:
1 Raw sequencing data (chromatographs) generally produced outside of Cornell Theory Center
2 FASTA formatted file - generally NCBI or internally
How are these data retrieved:
As a "recipe":
1. Chromatographs are either downloaded via FTP or read off a CD-ROM
2. Once the files are stored locally, phred is run on the directory containing the files
3. Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.
4. Cross_match is then run on the phred-derived files. Cross match removes vector contamination from sequences. The program requires a vector database (which is FASTA formatted) to perform the screening. Cross_match provides FASTA formatted sequences as output.
Some notes:
Phred takes no parameters other than those related to input and output naming and format.
I've already gone over the GenBank fields so I won't do that again. But, the GenBank info is important and will be kept.
With those notes in mind:
Field_Name |
Data_Type |
Restrictions |
Example |
Sequence_id |
|
Is key not null |
|
Organism |
|
-> taxa |
|
Accession |
|
-> accession |
|
Clone |
|
-> Clone_id |
|
|
|
|
|
Chromatograph |
Filename |
|
|
Sequence_conversion_id |
Integer |
-> sequence_conversion |
|
Raw_sequence |
Long |
Phred output |
|
Quality |
|
Phred output |
|
Vector_screening_id |
Integer |
-> vector screening |
|
Clean Sequence |
Long |
|
|
Last update: |
Date |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As the sequence conversion and vector stripping processes take place in batch, I think it would be easiest to break those pieces of sequence related information into separate tables:
Field_Name |
Data_Type |
Restrictions |
Example |
Conversion_id |
|
Primary key |
|
Conversion_program |
|
Default phred |
|
Conversion_version |
|
Default 0.980904a |
|
Conversion_person |
|
-> people_id |
|
Conversion_platform |
|
Default NT |
|
Trim |
Boolean |
Default 0 |
|
Trim is the only command line variable which alters the output of phred
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_stripping_id |
|
Primary key |
|
Vector_program |
|
Default cross_match |
|
Vector_version |
|
Default 0.990319 |
|
Vector_platform |
|
Default NT |
|
Vector_person |
|
-> people_id |
|
Vector_database |
|
-> vector_database_id |
|
Penalty |
Integer |
Mismatch penalty |
|
Gap_init |
Integer |
Gap initiation penalty |
|
Gap_ext |
Integer |
Gap extension penalty |
|
Ins_gap_ext |
Intger |
Insertion gap extension penalty |
|
Del_gap_ext |
Integer |
Deletion gap extension penalty |
|
Matrix |
Varchar |
Matrix instread of penalties |
This isn't implemented yet in cross_match should be soon however |
Raw |
Bitflag |
Use raw SW scores instead of complexity adjusted |
|
Minmatch |
Default=14 |
Minimum length of word to begin SW comparison |
|
Maxmatch |
Default=30 |
Maximum word length |
|
Max_group_size |
|
|
|
… |
|
|
|
This is getting a bit arduous to type in, and no one probably cares too much other than to know that the most of the values can be defaulted, this needs to be changed only in batch. Docs for cross_match (explaining the parameters) is here: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html.
How do I handle a vector database? Well, it's just a number of vectors in a fasta file. So, as we already have the vector table with a sequence field, therefore….
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_database_id |
Integer |
Is key not null |
|
Vector_id |
Integer |
-> vector table |
|
|
|
|
|
Now, where are we:
Not too bad.
In our dataflow model, we've just taken the chromatographs, run them through phred and cross_match and now we have cleaned sequences.
I haven't tried it yet, so I don't know how easy it is to call external programs.
Q: How do we, in practice, do all this data processing?
1) get sequences by ftp or cd (probably manually)
2) call phred
3) store phred output
4) make vector database
5) call cross_match
6) store cross_match output
As my data is stored flat-file now, it's easy. Just a bunch of perl scripts "wrapped" together by a "meta" perl script. I'm assuming I just need to alter them to do the data handling for Oracle?
For the curious, here's what cross_match "cleaned" sequence looks like
>sequence_name 746
0 746 ABI
GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA
GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG
GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC
ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA
TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT
GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA
GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG
CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG
CGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
From the phred docs: (http://bozeman.mbt.washington.edu/phrap.docs/phred.html)
The
FASTA header, as written by phred, contains the following fields:
>chromat_name
1323 15 548 ABI
where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF.
As I generally do NOT let phred do the trimming, my values on the sequence header are always 0 and (slen). I've only received ABI files to date.
The X's in the sequence are the result of cross_match. Anything that gets SWAT'ed against vector with a high enough score is converted to an X. This is why I feel it necessary to store the un-cross_matched seqeunce as well as the processed sequence, so we can trace cross_match errors.
However, it's uninformative for the users of the database to have "trash" sequences in their input/ouput.
So, following TIGR's lead. Any sequence which is ³ 80% X is considered trash.
This is done (again) using Perl.
The same Perl script also removes the X's from the files and makes a new clean sequence which has the X's removed as well as the extra info on the header line. So my (fictitious) sequence from before looks like this after processing:
>sequence_name
GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA
GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG
GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC
ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA
TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT
GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA
GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG
CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG
Trash also screens sequences out if they are considered "odd". For example, if a sequence has a long string of XXXs in the middle, but none at the ends:
>sequence_name
GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA
GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG
GATAAATAATTAACAGCCCAAATTTAATATATGATTGATXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT
GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA
GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG
CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG
Which doesn't make a lot of sense to me, then that seqeunce is "set aside" for further evaluation.
I use the unclean sequence, BLAST it against the non-redundant nucleotide database and look for matches.
So some new fields need to be added to sequence, and here's where the question of multiple longs per table comes into play.
Field_Name |
Data_Type |
Restrictions |
Example |
Sequence_id |
|
Is key not null |
|
Organism |
|
-> taxa |
|
Accession |
|
-> accession |
|
Clone |
|
-> Clone_id |
|
|
|
|
|
Chromatograph |
Filename |
|
|
Sequence_conversion_id |
Integer |
->
sequence_conversion |
|
Raw_sequence |
Long |
Phred output |
|
Quality |
|
Phred output |
|
Vector_screening_id |
Integer |
-> vector
screening |
|
Clean Sequence |
Long |
|
|
Last update: |
Date |
|
|
Raw_Sequence_length |
Integer |
Phred output |
|
Cleaned_sequence_length |
Integer |
Trash ouput |
|
Trash |
Boolean |
Trash output |
|
Sequence |
Long |
The cleaned, trash-free sequence |
|
Sequence_start |
Integer |
The bp of raw_seqeuence_length that begins the clean, trash-free sequence |
|
Sequence_End |
Integer |
See seqeunce_start |
|
|
|
|
|
How do we store BLAST?
How do we make contigs?
BLAST parameters (BLASTN, TBLASTX, etc.)
Assembler/Phrap/others….