The notable changes to the tables are presented here in bold
Field_Name |
Data_Type |
Restrictions |
Example |
Other_Clone_Name |
|
|
|
Our_Clone_Name |
|
|
|
Clone_Source |
-> Lab table |
|
|
Field_Name |
Data_Type |
Restrictions |
Example |
Group_id |
Integer |
Not null is key |
|
Group_name |
String |
|
Martin |
Group_leader |
Integer |
-> People_id |
|
Comments |
|
|
|
Field_Name |
Data_Type |
Restrictions |
Example |
People_id |
|
|
|
Authorship_Type |
"keyword" |
|
Library, sequence, etc |
Keyword_Value |
Integer |
-> ??? |
3 |
|
|
|
|
|
|
|
|
So, if People_id 12 authored library number 5, the entries would be:
People_id 12
Authorship_Type Library
Keyword_Value 5
How do we actually create that table definition?
The first really complex idea which we are trying to store. We came up with some interesting strategies last time, and I'd like to go over them a bit as they will obviously influence other tables.
To summarize:
There are a number of sequence processing steps which take place to convert chromatographs to FASTA formatted (and therefore human readable) DNA sequence.
We'd like to store the parameters passed to the various processing applications and we'd like our system to be inherently flexible as many of the steps will change throughout time. This is what I'd like to talk about today.
chromatograph FASTA sequence Enter finished sequence into the database
1. Are objects a way of capturing these data? Or will we be creating a type of "linked list" that will be tracking the different steps
2. How do we actually call a program from Oracle? I'd like to at least try phred on some chromatographs to see how the process is done.
3. Does the database need to be "locked" while sequences are added?
begin chem_list
"DP4%Ac{T3}" primer rhodamine
"DP4%Ac{T7}" primer rhodamine
. . .
"DP5%CEHV(KS)" primer rhodamine
"DP5%CEHV(SK)" primer rhodamine
"DP5%LR(KS)" primer rhodamine
"DP5%LR(SK)" primer rhodamine
"DP6%Ac{SP6}" primer rhodamine
. . .
"ET{-28m13rev}" primer energy-transfer
. . .
"DyeTerm{T7}-Set
B" terminator rhodamine
. . .
end chem_list
Field_Name |
Data_Type |
Restrictions |
Example |
Sequence_id |
|
Is key not null |
|
Organism |
|
-> taxa |
|
Accession |
|
-> accession |
|
Clone |
|
-> Clone_id |
|
|
|
|
|
Chromatograph |
Filename |
|
|
Sequence_conversion_id |
Integer |
-> sequence_conversion |
|
Raw_sequence |
Long |
Phred output |
|
Quality |
|
Phred output |
|
Vector_screening_id |
Integer |
-> vector screening |
|
Clean Sequence |
Long |
|
|
Last update: |
Date |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This table needs to be completely redone. One long per table, ability to represent ANY sequence. META-sequence information.
As the sequence conversion and vector stripping processes take place in batch, I think it would be easiest to break those pieces of sequence related information into separate tables:
Field_Name |
Data_Type |
Restrictions |
Example |
Conversion_id |
|
Primary key |
|
Conversion_program |
|
Default phred |
|
Conversion_version |
|
Default 0.980904a |
|
Conversion_person |
|
-> people_id |
|
Conversion_platform |
|
Default NT |
|
Trim |
Boolean |
Default 0 |
|
Trim is the only command line variable which alters the output of phred
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_stripping_id |
|
Primary key |
|
Vector_program |
|
Default cross_match |
|
Vector_version |
|
Default 0.990319 |
|
Vector_platform |
|
Default NT |
|
Vector_person |
|
-> people_id |
|
Vector_database |
|
-> vector_database_id |
|
Penalty |
Integer |
Mismatch penalty |
|
Gap_init |
Integer |
Gap initiation penalty |
|
Gap_ext |
Integer |
Gap extension penalty |
|
Ins_gap_ext |
Intger |
Insertion gap extension penalty |
|
Del_gap_ext |
Integer |
Deletion gap extension penalty |
|
Matrix |
Varchar |
Matrix instread of penalties |
This isn't implemented yet in cross_match should be soon however |
Raw |
Bitflag |
Use raw SW scores instead of complexity adjusted |
|
Minmatch |
Default=14 |
Minimum length of word to begin SW comparison |
|
Maxmatch |
Default=30 |
Maximum word length |
|
Max_group_size |
|
|
|
… |
|
|
|
This is getting a bit arduous to type in, and no one probably cares too much other than to know that the most of the values can be defaulted, this needs to be changed only in batch. Docs for cross_match (explaining the parameters) is here: http://bozeman.mbt.washington.edu/phrap.docs/phrap.html.
How do I handle a vector database? Well, it's just a number of vectors in a fasta file. So, as we already have the vector table with a sequence field, therefore….
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_database_id |
Integer |
Is key not null |
|
Vector_id |
Integer |
-> vector table |
|
|
|
|
|
In our dataflow model, we've just taken the chromatographs, run them through phred and cross_match and now we have cleaned sequences.
I haven't tried it yet, so I don't know how easy it is to call external programs.
Q: How do we, in practice, do all this data processing?
1) get sequences by ftp or cd (probably manually)
2) call phred
3) store phred output
4) make vector database
5) call cross_match
6) store cross_match output
As my data is stored flat-file now, it's easy. Just a bunch of perl scripts "wrapped" together by a "meta" perl script. I'm assuming I just need to alter them to do the data handling for Oracle?
For the curious, here's what cross_match "cleaned" sequence looks like
>sequence_name 746
0 746 ABI
GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA
GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG
GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC
ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA
TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT
GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA
GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG
CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG
CGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
From the phred docs: (http://bozeman.mbt.washington.edu/phrap.docs/phred.html)
The
FASTA header, as written by phred, contains the following fields:
>chromat_name
1323 15 548 ABI
where the chromatogram name immediately follows the header delimiter, which is ">", the first integer is the number of bases called by phred, the second integer is the number of bases trimmed off the beginning of the sequence, the third integer is the number of bases remaining following trimming, and the string describes the type of input file, which is either ABI or SCF.
As I generally do NOT let phred do the trimming, my values on the sequence header are always 0 and (slen). I've only received ABI files to date.
The X's in the sequence are the result of cross_match. Anything that gets SWAT'ed against vector with a high enough score is converted to an X. This is why I feel it necessary to store the un-cross_matched seqeunce as well as the processed sequence, so we can trace cross_match errors.
However, it's uninformative for the users of the database to have "trash" sequences in their input/ouput.
So, following TIGR's lead. Any sequence which is ³ 80% X is considered trash.
This is done (again) using Perl.
The same Perl script also removes the X's from the files and makes a new clean sequence which has the X's removed as well as the extra info on the header line. So my (fictitious) sequence from before looks like this after processing:
>sequence_name
GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA
GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG
GATAAATAATTAACAGCCCAAATTTAATATATGATTGATTAGGAATCCAC
ATAACACATGATGCGTTCAACTTACAGGGAACGTGTCTTTACACCTATCA
TCAAACCCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT
GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA
GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG
CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG
Trash also screens sequences out if they are considered "odd". For example, if a sequence has a long string of XXXs in the middle, but none at the ends:
>sequence_name
GGGGAGGGAAGGAGGCAGTTGAATAGGAAGACCAAACCGGGTGGAAAGTA
GATGGGCCCTAGGCGCGATCTAGATGTACTAACGAGATATAATTTTTATG
GATAAATAATTAACAGCCCAAATTTAATATATGATTGATXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXCTAACACAGTAAAGATATTCAAATTCTTAAGAGCTAGTGAATT
GGGTAACAGCCTTTGTGCCTTCAGAGACGGCATGCTTAGCCAATTCACCA
GGAAGGACCAATCGAACAGCCGTCTGAATTTCCCGAGAAGTTATAGTAGG
CTTCTTCTCGTGCCGAATTCTTTGGATCCACTAGTGTCGACCTGCAGGCG
Which doesn't make a lot of sense to me, then that seqeunce is "set aside" for further evaluation.
I use the unclean sequence, BLAST it against the non-redundant nucleotide database and look for matches.
How do we store BLAST?
How do we make contigs?
BLAST parameters (BLASTN, TBLASTX, etc.)
Assembler/Phrap/others….