SYNTOM Meeting Notes - ALM - June 30, 1999

Last Time:

Dave Schneider presented the Taxonomy tables. Here they are, as I understand them, as an ERD.

It's hard to read like this, so I'll try to remember to attach it to this handout as well….

Please note:

1. I don't know what the DIVISIONS table links to

2. I made the ACCESSIONS table as a place holder as per the meeting notes - we'll need more discussion on how to create and fill that table

3. The data source for all of this information (with the notable exception of the ACCESSIONS table and the inherited_chlorocode field in the TAXA table) is NCBI -- we can download a flatfile database and parse it into Oracle.

The Sequence Tables

While constructing the sequence tables, I tried to keep two things in mind:

1. How do we get the data? and

2. What questions do we want to ask / What data do we need to keep track of?

As mentioned in earlier meetings, one of the problems we are facing in our design considerations is that our SynTom database is not only a plant genome database similar to SolGenes, RiceGenes, AtDB, etc. but it also needs to track laboratory information from a number of different labs. Secondarily, my design considerations include questions related to the bioinformatics of SynTom. I believe it is important to track not only the "processed" data that most users will interact with (cleaned sequence, contigs, BLAST results, etc.) but also the "raw" data (chromatographs, etc.) as well as the programs and parameters used to convert the raw data into processed data (phred, TIGR assembler, etc.)

So, where to begin?

It is important for the design of this database to understand the flow of data into the database. I'm going to try to show the process by which data enters the database through Data Flow Diagrams (DFDs) and through task lists (which require less drawing on my part). We've gone over this before in previous meetings, but in order to be clear, I'm going to summarize the data flow as I believe this information is important for the proper design of the tables.

This is a brief overview of how sequences come into being.

1. DNA from a specific organism is isolated. For expressed sequences, the tissue and conditions at which the isolation takes place is very important

2. The DNA is then cloned into a vector - generally a bacterial vector, but fungal and viral vectors exist as well

3. The vector containing the DNA of interest is "injected" into a host organism, again generally a bacterium

4. This clone is then grown -- increasing the number of copies of our DNA of interest

5. The clone has its vector DNA isolated and then sequenced

6. The end result of the sequencing is an electropherogram sometimes called a chromatogram or a chromatograph.

7. Chromatographs are either downloaded via FTP or read off a CD-ROM

8. Once the files are stored locally, phred is run on the directory containing the files

9. Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.

10. Cross_match is then run on the phred-derived files. Cross match removes vector contamination from sequences. The program requires a vector database (which is FASTA formatted) to perform the screening. Cross_match provides FASTA formatted sequences as output.

There you have it, 10 easy steps from your organism du jour to clean usable sequence. Further processing occurs, of course, but for all extents and purposes, the raw data ends with clean FASTA formatted sequence.

Obviously there's a lot of data that can and should be stored in order to track errors, make connections, etc.

Last time we discussed how organismal data will be stored -- each organism is given a taxon_id which uniquely identifies it. This id number can also be used (via its relationship with the parent_id ) to understand the evolutionary context of an organism i.e. how one organism is related to another. Today we're going to continue this top-down approach and break an organism into parts which will be represented by the database, until we get to an individual sequence. I'll be using the EST sequencing model for this discussion, but genomic sequence can also be stored in the manner I hope to describe.

Flowchart: Multidocument: Other organisms

Flowchart: Data: Hosts Flowchart: Data: Chromatographs

phred

Flowchart: Data: Raw sequence Flowchart: Data: Sequence quality

Cross

Match

Flowchart: Data: Vector database

Flowchart: Data: Cleaned sequence

An overly simplified ERM:

Each library is made up of individual clones

Each library is constructed using a specific vector

Each vector has a sequence

Each clone has a sequence

Libraries

Vectors

Sequences

Clones

Tissues, Vectors and Hosts = Libraries

Let's try to represent Libraries as a relational table:

A Library is:

A collection of all of the mRNA molecules present in a tissue, all converted into cDNA molecules then inserted into vectors (other DNA molecules which can continue to replicate after addition of foreign DNA).

Field_Name	Data_Type	Restrictions	Example
Library_id	Integer	Not null, primary key	327
Tissue	Varchar2(256)		Leaf
Accession	Integer	-> accession table	9 - TA496
Organism	Integer	-> taxa table	21 - L. esculentum
Developmental_Stage	Varchar2(256)		4 week old
Treatment	Varchar2(256)		Pto innoculated
Date_created	Date		May 11, 1999
Created_by	Integer	-> Lab table	3 - Martin
Host	Varchar2(256)		SOLR
Vector	Integer	-> Vector Table	7 - pBlueScript SK(-)
PFU	Integer		1010
Average_Insert_Size	Integer		350
Availability	Boolean		TRUE
Type	Varchar(256)		mRNA
Library_Name	Varchar2(256)		Tomato susceptible, Cornell
Comments	Varchar(256)		Blah blah blah
Distributed	Integer	-> Lab

Does that capture everything we need to represent a library?

Most of these data will come from GenBank records as well as records we set up as libraries are produced. There aren't that many libraries right now, so the data entry part of this shouldn't be too bad.
How does this change the simple ERM?

Accessions

Taxa

Lab

Libraries

Vectors

Sequences

Clones

Now let's work on the Vectors Table:

Field_Name	Data_Type	Restrictions	Example
Vector_id	Integer	Not null, primary key	7
Vector_name	Varchar2(256)		pBlueScript SK(-)
Antibiotic	Varchar2(256)		Kanomycin
Restriction_Site_1	VarChar(256)		EcoRI
Restriction_Site_2	VarChar(256)		XhoI
Vector_sequence	Long		ACTG….
Vector_Vendor	VarChar(256)		Stratagene
Comments

No foreign keys….. no new tables (yet <grin>)

These data need to be entered manually for the time being. We'll need to create a CGI-like form for submission.

Now the Clones table

Field_Name	Data_Type	Restrictions	Example
Clone_id	Integer	Not null, primary key	7
Library_id	Integer	-> library	13
Plate_number
Well_number
Row_number
CU_id
Comments

We'll need a clone synonym table that converts TIGR names into Cornell names.

Field_Name	Data_Type	Restrictions	Example
Other_Clone_Name
Our_Clone_Name
Clone_id

The key here will be multiple field, probably Other_Clone_Name pl

Anything I missed?

The Clones table will be automatable, we'll probably have to come up with some simple scripts to lookup clone_id and get a library_id. I don't know how automatable plate, well, and row number are for these data, but I'm sure I'll find out soon J CU_id comes from Theresa's spreadsheet, therefore it is again scriptable input. Clone synonym should be automatable, as we have translator files now anyway

Another look at the simple ERM:

Accessions

Taxa

Lab

Libraries

Vectors

Sequences

Clone Synonyms

Clones

The library table has a Created_by field which I suggested be tied to a lab (or a group, or a company).

Here's the infamous People table . . .

Field_Name	Data_Type	Restrictions	Example
People_id		Not null is key
Last_name
First_name
Title
Address1
Address2
Department
University
City
State
Zip
Country
Phone
Fax
e-mail
Homepage
Initials
Project
Group			Martin
Security
Comments

Is that the right way to handle labs? OR should labs be a separate table:

Field_Name	Data_Type	Restrictions	Example
Group_id	Integer	Not null is key
Group_name	String		Martin
Group_leader	Integer	-> People_id

Let's discuss….

These data need to be entered by hand. We'll need another web form. Need to discuss: Oracle datatype for a running counter….

There's a problem with the way I've drawn these tables up, though. Multiple people can be the "author" of a library or a sequence or a reference or any kind of data…..

Authorship Table:

Field_Name	Data_Type	Restrictions	Example
People_id
Library_id
Reference_id
Sequence_id

Probably more later….. but now our diagram looks a bit more complicated:

Accessions

Taxa

People

Libraries

Vectors

Authorships

Sequences

Clone Synonyms

Clones

So now it looks like it's time to start making the sequence tables. As mentioned before (http://syntom.cit.cornell.edu/database/Jun1599.htm) sequences come from two primary sources:

1 Raw sequencing data (chromatographs) generally produced outside of Cornell Theory Center

2 FASTA formatted file - generally NCBI or internally

How are these data retrieved:

As a "recipe":

1. Chromatographs are either downloaded via FTP or read off a CD-ROM

2. Once the files are stored locally, phred is run on the directory containing the files

3. Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.

4. Cross_match is then run on the phred-derived files. Cross match removes vector contamination from sequences. The program requires a vector database (which is FASTA formatted) to perform the screening. Cross_match provides FASTA formatted sequences as output.

Some notes:

Phred takes no parameters other than those related to input and output naming and format.

I've already gone over the GenBank fields so I won't do that again. But, the GenBank info is important and will be kept.

With those notes in mind:

Field_Name

Data_Type

Restrictions

Example

Sequence_id

Is key not null

Organism

-> taxa

Accession

-> accession

Clone

-> Clone_id

Should I make a conversion table?

Chromatograph

Filename

Conversion_program

Default phred

Conversion_version

Default 0.980904.a

Conversion_platform

Default NT

Conversion_person

-> people_id

Raw_sequence

Quality

Vector_program

Default cross_match

Vector_version

Default 0.990319

Vector_platform

Default NT

Vector_person

-> people_id

Vector_database

???

How do I handle a vector database? Well, it's just a number of vectors in a fasta file. So, as we already have the vector table with a sequence field, therefore….

Vector_database table

Field_Name	Data_Type	Restrictions	Example
Vector_database_no	Integer	Is key not null
Vector_id	Integer	-> vector table

Stopping here for now.. as more info will be presented next time.

Andreas

June 30, 1999

SYNTOM Meeting Notes - ALM - June 30, 1999

Last Time:

The Sequence Tables

An overly simplified ERM:

Libraries Vectors Sequences Clones Tissues, Vectors and Hosts = Libraries

Libraries

Vectors

Sequences

Clones

Tissues, Vectors and Hosts = Libraries