SYNTOM Meeting Notes - ALM - June 30, 1999

 

Last Time:

 

Dave Schneider presented the Taxonomy tables.  Here they are, as I understand them, as an ERD.

 


 

 

 

 

 


It's hard to read like this, so I'll try to remember to attach it to this handout as well….

 

Please note:

 

1.        I don't know what the DIVISIONS table links to

2.        I made the ACCESSIONS table as a place holder as per the meeting notes - we'll need more discussion on how to create and fill that table

3.        The data source for all of this information (with the notable exception of the ACCESSIONS table and the inherited_chlorocode field in the TAXA table) is NCBI -- we can download a flatfile database and parse it into Oracle.

 

 

The Sequence Tables

 

 

While constructing the sequence tables, I tried to keep two things in mind:

 

1.        How do we get the data? and

2.        What questions do we want to ask / What data do we need to keep track of?

 

 

As mentioned in earlier meetings, one of the problems we are facing in our design considerations is that our SynTom database is not only a plant genome database similar to SolGenes, RiceGenes, AtDB, etc. but it also needs to track laboratory information from a number of different labs.  Secondarily, my design considerations include questions related to the bioinformatics of SynTom.  I believe it is important to track not only the "processed" data that most users will interact with (cleaned sequence, contigs, BLAST results, etc.) but also the "raw" data (chromatographs, etc.) as well as the programs and parameters used to convert the raw data into processed data (phred, TIGR assembler, etc.)

 

So, where to begin?

 

It is important for the design of this database to understand the flow of data into the database.  I'm going to try to show the process by which data enters the database through Data Flow Diagrams (DFDs) and through task lists (which require less drawing on my part).  We've gone over this before in previous meetings, but in order to be clear, I'm going to summarize the data flow as I believe this information is important for the proper design of the tables.

 

This is a brief overview of how sequences come into being. 

 

 

1.        DNA from a specific organism is isolated.  For expressed sequences, the tissue and conditions at which the isolation takes place is very important

2.        The DNA is then cloned into a vector - generally a bacterial vector, but fungal and viral vectors exist as well

3.        The vector containing the DNA of interest is "injected" into a host organism, again generally a bacterium

4.        This clone is then grown -- increasing the number of copies of our DNA of interest

5.        The clone has its vector DNA isolated and then sequenced

6.        The end result of the sequencing is an electropherogram sometimes called a chromatogram or a chromatograph.

7.        Chromatographs are either downloaded via FTP or read off a CD-ROM

8.        Once the files are stored locally, phred is run on the directory containing the files

9.        Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.

10.     Cross_match is then run on the phred-derived files.  Cross match removes vector contamination from sequences.  The program requires a vector database (which is FASTA formatted) to perform the screening.  Cross_match provides FASTA formatted sequences as output.

 

There you have it, 10 easy steps from your organism du jour to clean usable sequence.  Further processing occurs, of course, but for all extents and purposes, the raw data ends with clean FASTA formatted sequence.

 

Obviously there's a lot of data that can and should be stored in order to track errors, make connections, etc.

 

Last time we discussed how organismal data will be stored -- each organism is given a taxon_id which uniquely identifies it.  This id number can also be used (via its relationship with the parent_id ) to understand the evolutionary context of an organism i.e. how one organism is related to another.  Today we're going to continue this top-down approach and break an organism into parts which will be represented by the database, until we get to an individual sequence.  I'll be using the EST sequencing model for this discussion, but genomic sequence can also be stored in the manner I hope to describe.

 

 

 

 

Flowchart: Multidocument: Other organisms
 

 

 

 

 

 

 

 

 

 

 

 

Flowchart: Data: Hosts

Flowchart: Data: Chromatographs
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

phred

 
 

 

 

 

 

 

 

 

 

Flowchart: Data: Raw sequence Flowchart: Data: Sequence quality
 

 

 

 

 

Cross

Match

 

 
Flowchart: Data: Vector database
 

 

 

 

 

 

 

 

 

 

 

Flowchart: Data: Cleaned sequence
 

 

 

 



An overly simplified ERM:

 

 

Each library is made up of individual clones

Each library is constructed using a specific vector

Each vector has a sequence

Each clone has a sequence

 

 

 

 

Libraries

 

 

Vectors

 

 

Sequences

 

 

Clones

 

 

Tissues, Vectors and Hosts = Libraries

 


Let's try to represent Libraries as a relational table:

 

A Library is:

 

A collection of all of the mRNA molecules present in a tissue, all converted into cDNA molecules then inserted into vectors (other DNA molecules which can continue to replicate after addition of foreign DNA).

 

 

Field_Name

Data_Type

Restrictions

Example

Library_id

Integer

Not null, primary key

327

Tissue

Varchar2(256)

 

Leaf

Accession

Integer

-> accession table

9 - TA496

Organism

Integer

-> taxa table

21 - L. esculentum

Developmental_Stage

Varchar2(256)

 

4 week old

Treatment

Varchar2(256)

 

Pto innoculated

Date_created

Date

 

May 11, 1999

Created_by

Integer

-> Lab table

3 - Martin

Host

Varchar2(256)

 

SOLR

Vector

Integer

-> Vector Table

7 - pBlueScript SK(-)

PFU

Integer

 

1010

Average_Insert_Size

Integer

 

350

Availability

Boolean

 

TRUE

Type

Varchar(256)

 

mRNA

Library_Name

Varchar2(256)

 

Tomato susceptible, Cornell

Comments

Varchar(256)

 

Blah blah blah

Distributed

Integer

-> Lab

 

 

 

Does that capture everything we need to represent a library?

Most of these data will come from GenBank records as well as records we set up as libraries are produced.  There aren't that many libraries right now, so the data entry part of this shouldn't be too bad.
How does this change the simple ERM?

 

Accessions               

 

Taxa

 

Lab

 

Libraries

 

 

Vectors

 

 

Sequences

 

 

Clones

 

 

Now let's work on the Vectors Table:

 


Field_Name

Data_Type

Restrictions

Example

Vector_id

Integer

Not null, primary key

7

Vector_name

Varchar2(256)

 

pBlueScript SK(-)

Antibiotic

Varchar2(256)

 

Kanomycin

Restriction_Site_1

VarChar(256)

 

EcoRI

Restriction_Site_2

VarChar(256)

 

XhoI

Vector_sequence

Long

 

ACTG….

Vector_Vendor

VarChar(256)

 

Stratagene

Comments

 

 

 

 

No foreign keys….. no new tables (yet <grin>)

These data need to be entered manually for the time being.  We'll need to create a CGI-like form for submission.

 

Now the Clones table

 

Field_Name

Data_Type

Restrictions

Example

Clone_id

Integer

Not null, primary key

7

Library_id

Integer

-> library

13

Plate_number

 

 

 

Well_number

 

 

 

Row_number

 

 

 

CU_id

 

 

 

Comments

 

 

 

 

We'll need a clone synonym table that converts TIGR names into Cornell names.

 

Field_Name

Data_Type

Restrictions

Example

Other_Clone_Name

 

 

 

Our_Clone_Name

 

 

 

Clone_id

 

 

 

 

 

 

 

 

The key here will be multiple field, probably Other_Clone_Name pl

 

Anything I missed?

The Clones table will be automatable, we'll probably have to come up with some simple scripts to lookup clone_id and get a library_id.  I don't know how automatable plate, well, and row number are for these data, but I'm sure I'll find out soon J CU_id comes from Theresa's spreadsheet, therefore it is again scriptable input.  Clone synonym should be automatable, as we have translator files now anyway

 

Another look at the simple ERM:

Accessions               

 

Taxa

 

Lab

 

Libraries

 

 

Vectors

 

 

Sequences

 

 

Clone Synonyms

 

 

Clones

 

 

The library table has a Created_by  field which I suggested be tied to a lab (or a group, or a company). 

 


Here's the infamous People table . . .

 

Field_Name

Data_Type

Restrictions

Example

People_id

 

Not null is key

 

Last_name

 

 

 

First_name

 

 

 

Title

 

 

 

Address1

 

 

 

Address2

 

 

 

Department

 

 

 

University

 

 

 

City

 

 

 

State

 

 

 

Zip

 

 

 

Country

 

 

 

Phone

 

 

 

Fax

 

 

 

e-mail

 

 

 

Homepage

 

 

 

Initials

 

 

 

Project

 

 

 

Group

 

 

Martin

Security

 

 

 

Comments

 

 

 

 

 

 

 

 

 

Is that the right way to handle labs?  OR should labs be a separate table:

 

Field_Name

Data_Type

Restrictions

Example

Group_id

Integer

Not null is key

 

Group_name

String

 

Martin

Group_leader

Integer

-> People_id

 

 

 

 

 

 

Let's discuss….

 

These data need to be entered by hand.  We'll need another web form.  Need to discuss: Oracle datatype for a running counter….

 

There's a problem with the way I've drawn these tables up, though.  Multiple people can be the "author" of a library or a sequence or a reference or any kind of data…..

 

Authorship Table:

 

Field_Name

Data_Type

Restrictions

Example

People_id

 

 

 

Library_id

 

 

 

Reference_id

 

 

 

Sequence_id

 

 

 

 

 

 

 

 

Probably more later….. but now our diagram looks a bit more complicated:

 

 

 

 

 

Accessions               

 

Taxa

 

People

 

Libraries

 

 

Vectors

 

 

Authorships               

 

Sequences

 

 

Clone Synonyms

 

 

Clones

 

 
 

 

 

 



So now it looks like it's time to start making the sequence tables.  As mentioned before (http://syntom.cit.cornell.edu/database/Jun1599.htm) sequences come from two primary sources:

 


1                     Raw sequencing data (chromatographs) generally produced outside of Cornell Theory Center

2                     FASTA formatted file - generally NCBI or internally

 

 

How are these data retrieved:

 

As a "recipe":

 

1.        Chromatographs are either downloaded via FTP or read off a CD-ROM

2.        Once the files are stored locally, phred is run on the directory containing the files

3.        Phred is the chromatograph reader and its output is two-fold, a FASTA formatted sequence for each chromatograph as well as a space delimited quality file.

4.        Cross_match is then run on the phred-derived files.  Cross match removes vector contamination from sequences.  The program requires a vector database (which is FASTA formatted) to perform the screening.  Cross_match provides FASTA formatted sequences as output.

 

 


 

 

 

 


Some notes:

 

Phred takes no parameters other than those related to input and output naming and format.

I've already gone over the GenBank fields so I won't do that again.  But, the GenBank info is important and will be kept.

 

With those notes in mind:


 

Field_Name

Data_Type

Restrictions

Example

Sequence_id

 

Is key not null

 

Organism

 

-> taxa

 

Accession

 

-> accession

 

Clone

 

-> Clone_id

 

Should I make a conversion table?

 
Chromatograph

Filename

 

 

Conversion_program

 

Default phred

 

Conversion_version

 

Default 0.980904.a

 

Conversion_platform

 

Default NT

 

Conversion_person

 

-> people_id

 

Raw_sequence

 

 

 

Quality

 

 

 

Vector_program

 

Default cross_match

 

Vector_version

 

Default 0.990319

 

Vector_platform

 

Default NT

 

Vector_person

 

-> people_id

 

Vector_database

 

???

 

 

 

 

 

 

 

How do I handle a vector database?  Well, it's just a number of vectors in a fasta file.  So, as we already have the vector table with a sequence field, therefore….

 

Vector_database table

 

Field_Name

Data_Type

Restrictions

Example

Vector_database_no

Integer

Is key not null

 

Vector_id

Integer

-> vector table

 

 

 

 

 

 

Stopping here for now.. as more info will be presented next time.

 

Andreas

June 30, 1999