Dave Schneider presented the Taxonomy tables. Here they are, as I understand them, as an ERD.
It's hard to read like this, so I'll try to remember to attach it to this handout as well….
Please note:
1. I don't know what the DIVISIONS table links to
2. I made the ACCESSIONS table as a place holder as per the meeting notes - we'll need more discussion on how to create and fill that table
3. The data source for all of this information (with the notable exception of the ACCESSIONS table and the inherited_chlorocode field in the TAXA table) is NCBI -- we can download a flatfile database and parse it into Oracle.
While constructing the sequence tables, I tried to keep two things in mind:
1. How do we get the data? and
2. What questions do we want to ask / What data do we need to keep track of?
As mentioned in earlier meetings, one of the problems we are facing in our design considerations is that our SynTom database is not only a plant genome database similar to SolGenes, RiceGenes, AtDB, etc. but it also needs to track laboratory information from a number of different labs. Secondarily, my design considerations include questions related to the bioinformatics of SynTom. I believe it is important to track not only the "processed" data that most users will interact with (cleaned sequence, contigs, BLAST results, etc.) but also the "raw" data (chromatographs, etc.) as well as the programs and parameters used to convert the raw data into processed data (phred, TIGR assembler, etc.)
So, where to begin?
It is important for the design of this database to understand the flow of data into the database. I'm going to try to show the process by which data enters the database through Data Flow Diagrams (DFDs) and through task lists (which require less drawing on my part). We've gone over this before in previous meetings, but in order to be clear, I'm going to summarize the data flow as I believe this information is important for the proper design of the tables.
This is a brief overview of how sequences come into being.
1. DNA from a specific organism is isolated. For expressed sequences, the tissue and conditions at which the isolation takes place is very important
2. The DNA is then cloned into a vector - generally a bacterial vector, but fungal and viral vectors exist as well
3. The vector containing the DNA of interest is "injected" into a host organism, again generally a bacterium
4. This clone is then grown -- increasing the number of copies of our DNA of interest
5. The clone has its vector DNA isolated and then sequenced
6. The end result of the sequencing is an electropherogram sometimes called a chromatogram or a chromatograph.
7. Chromatographs are either downloaded via FTP or read off a CD-ROM
8. Once the files are stored locally, phred is run on the directory containing the files
9. Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.
10. Cross_match is then run on the phred-derived files. Cross match removes vector contamination from sequences. The program requires a vector database (which is FASTA formatted) to perform the screening. Cross_match provides FASTA formatted sequences as output.
There you have it, 10 easy steps from your organism du jour to clean usable sequence. Further processing occurs, of course, but for all extents and purposes, the raw data ends with clean FASTA formatted sequence.
Obviously there's a lot of data that can and should be stored in order to track errors, make connections, etc.
Last time we discussed how organismal data will be stored -- each organism is given a taxon_id which uniquely identifies it. This id number can also be used (via its relationship with the parent_id ) to understand the evolutionary context of an organism i.e. how one organism is related to another. Today we're going to continue this top-down approach and break an organism into parts which will be represented by the database, until we get to an individual sequence. I'll be using the EST sequencing model for this discussion, but genomic sequence can also be stored in the manner I hope to describe.
phred
Cross Match
Each library is made up of individual clones
Each library is constructed using a specific vector
Each vector has a sequence
Each clone has a sequence
|
|
||||||||||||||
|
|||||||||||||||
|
|||||||||||||||
Let's try to represent Libraries as a relational table:
A Library is:
A collection of all of the mRNA molecules present in a tissue, all converted into cDNA molecules then inserted into vectors (other DNA molecules which can continue to replicate after addition of foreign DNA).
Field_Name |
Data_Type |
Restrictions |
Example |
Library_id |
Integer |
Not null, primary key |
327 |
Tissue |
Varchar2(256) |
|
Leaf |
Accession |
Integer |
-> accession table |
9 - TA496 |
Organism |
Integer |
-> taxa table |
21 - L. esculentum |
Developmental_Stage |
Varchar2(256) |
|
4 week old |
Treatment |
Varchar2(256) |
|
Pto innoculated |
Date_created |
Date |
|
May 11, 1999 |
Created_by |
Integer |
-> Lab table |
3 - Martin |
Host |
Varchar2(256) |
|
SOLR |
Vector |
Integer |
-> Vector Table |
7 - pBlueScript SK(-) |
PFU |
Integer |
|
1010 |
Average_Insert_Size |
Integer |
|
350 |
Availability |
Boolean |
|
TRUE |
Type |
Varchar(256) |
|
mRNA |
Library_Name |
Varchar2(256) |
|
Tomato susceptible, Cornell |
Comments |
Varchar(256) |
|
Blah blah blah |
Distributed |
Integer |
-> Lab |
|
Does that capture everything we need to represent a library?
Most of these data will come from GenBank records as well as
records we set up as libraries are produced.
There aren't that many libraries right now, so the data entry part of
this shouldn't be too bad.
How does this change the simple ERM?
Accessions Taxa Lab Libraries Vectors Sequences Clones
Now let's work on the Vectors Table:
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_id |
Integer |
Not null, primary key |
7 |
Vector_name |
Varchar2(256) |
|
pBlueScript SK(-) |
Antibiotic |
Varchar2(256) |
|
Kanomycin |
Restriction_Site_1 |
VarChar(256) |
|
EcoRI |
Restriction_Site_2 |
VarChar(256) |
|
XhoI |
Vector_sequence |
Long |
|
ACTG…. |
Vector_Vendor |
VarChar(256) |
|
Stratagene |
Comments |
|
|
|
No foreign keys….. no new tables (yet <grin>)
These data need to be entered manually for the time being. We'll need to create a CGI-like form for submission.
Now the Clones table
Field_Name |
Data_Type |
Restrictions |
Example |
Clone_id |
Integer |
Not null, primary key |
7 |
Library_id |
Integer |
-> library |
13 |
Plate_number |
|
|
|
Well_number |
|
|
|
Row_number |
|
|
|
CU_id |
|
|
|
Comments |
|
|
|
We'll need a clone synonym table that converts TIGR names into Cornell names.
Field_Name |
Data_Type |
Restrictions |
Example |
Other_Clone_Name |
|
|
|
Our_Clone_Name |
|
|
|
Clone_id |
|
|
|
|
|
|
|
The key here will be multiple field, probably Other_Clone_Name pl
Anything I missed?
The Clones table will be automatable, we'll probably have to come up with some simple scripts to lookup clone_id and get a library_id. I don't know how automatable plate, well, and row number are for these data, but I'm sure I'll find out soon J CU_id comes from Theresa's spreadsheet, therefore it is again scriptable input. Clone synonym should be automatable, as we have translator files now anyway
Another look at the simple ERM: Accessions Taxa Lab Libraries Vectors Sequences Clone Synonyms Clones
The library table has a Created_by
field which I suggested be tied to a lab (or a group, or a
company).
Here's the infamous People table . . .
Field_Name |
Data_Type |
Restrictions |
Example |
People_id |
|
Not null is key |
|
Last_name |
|
|
|
First_name |
|
|
|
Title |
|
|
|
Address1 |
|
|
|
Address2 |
|
|
|
Department |
|
|
|
University |
|
|
|
City |
|
|
|
State |
|
|
|
Zip |
|
|
|
Country |
|
|
|
Phone |
|
|
|
Fax |
|
|
|
|
|
|
|
Homepage |
|
|
|
Initials |
|
|
|
Project |
|
|
|
Group |
|
|
Martin |
Security |
|
|
|
Comments |
|
|
|
|
|
|
|
Is that the right way to handle labs? OR should labs be a separate table:
Field_Name |
Data_Type |
Restrictions |
Example |
Group_id |
Integer |
Not null is key |
|
Group_name |
String |
|
Martin |
Group_leader |
Integer |
-> People_id |
|
|
|
|
|
Let's discuss….
These data need to be entered by hand. We'll need another web form. Need to discuss: Oracle datatype for a running counter….
There's a problem with the way I've drawn these tables up, though. Multiple people can be the "author" of a library or a sequence or a reference or any kind of data…..
Authorship Table:
Field_Name |
Data_Type |
Restrictions |
Example |
People_id |
|
|
|
Library_id |
|
|
|
Reference_id |
|
|
|
Sequence_id |
|
|
|
|
|
|
|
Probably more later….. but now our diagram looks a bit more complicated:
Accessions Taxa People Libraries Vectors Authorships Sequences Clone Synonyms Clones
So now it looks like it's time to start making the sequence tables. As mentioned before (http://syntom.cit.cornell.edu/database/Jun1599.htm)
sequences come from two primary sources:
1 Raw sequencing data (chromatographs) generally produced outside of Cornell Theory Center
2 FASTA formatted file - generally NCBI or internally
How are these data retrieved:
As a "recipe":
1. Chromatographs are either downloaded via FTP or read off a CD-ROM
2. Once the files are stored locally, phred is run on the directory containing the files
3. Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.
4. Cross_match is then run on the phred-derived files. Cross match removes vector contamination from sequences. The program requires a vector database (which is FASTA formatted) to perform the screening. Cross_match provides FASTA formatted sequences as output.
Some notes:
Phred takes no parameters other than those related to input and output naming and format.
I've already gone over the GenBank fields so I won't do that again. But, the GenBank info is important and will be kept.
With those notes in mind:
Field_Name |
Data_Type |
Restrictions |
Example |
||
Sequence_id |
|
Is key not null |
|
||
Organism |
|
-> taxa |
|
||
Accession |
|
-> accession |
|
||
Clone |
|
-> Clone_id |
|
||
Should I make a conversion table? |
Filename |
|
|
||
Conversion_program |
|
Default phred |
|
||
Conversion_version |
|
Default 0.980904.a |
|
||
Conversion_platform |
|
Default NT |
|
||
Conversion_person |
|
-> people_id |
|
||
Raw_sequence |
|
|
|
||
Quality |
|
|
|
||
Vector_program |
|
Default cross_match |
|
||
Vector_version |
|
Default 0.990319 |
|
||
Vector_platform |
|
Default NT |
|
||
Vector_person |
|
-> people_id |
|
||
Vector_database |
|
??? |
|
||
|
|
|
|
How do I handle a vector database? Well, it's just a number of vectors in a fasta file. So, as we already have the vector table with a sequence field, therefore….
Vector_database table
Field_Name |
Data_Type |
Restrictions |
Example |
Vector_database_no |
Integer |
Is key not null |
|
Vector_id |
Integer |
-> vector table |
|
|
|
|
|
Stopping here for now.. as more info will be presented next time.
Andreas
June 30, 1999