SynTom Meeting Notes ALM Tuesday, June 15, 1999
Outline:
1a) Definitions and Simple Examples
Most of this comes from: The Practical SQL Handbook - Using Structured Query Language (Third Edition) by Judith S. Bowman, Sandra L. Emerson and Marcy Darnovksy.
The SynTom (Synteny in Tomato)database's goal is to be the global repository for sequence, mapping and synteny information for all of the species in the family solanaceae, with emphasis on tomato. (from
http://syntom.cit.cornell.edu/objectives.htmlIn order to achieve this goal we are Oracle 8 to build a relational database.
Oracle is a relational database management system or DBMS.
Relational databases represent all information in TABLES.
People
Name |
Address |
John Doe |
123 Spud Terrace |
Jane Smith |
867 Rockinghorse Blvd. |
Bill Clinton |
1600 Pennsylvania Avenue |
A set of related tables forms a database.
Each row describes one occurrence of an entity - a person, a sequence, a car, etc.
Each column describes one characteristic of the entity - a person's name, a sequence's GC content, a car's color, etc.
The intersection of a row and a column is a data element or value. To find the value you want, you need to know the name of the table, what column it's in and the value of the row's primary key or unique identifier.
So, to have Oracle find Bill Clinton's address, you need to tell it to look up Bill Clinton (the unique identifier) in the People table, and extract data from the column named address.
N.B. The names in my table are not in alphabetical order. The order of the rows is insignificant to the database…..
The language used to access information in relational databases is called SQL - Structured Query Language. It's officially pronounced "ess-cue-ell" but most people pronounce it "sequel". I won't talk much about SQL, but just to illustrate, here's how we'd perform the retrieval we did in the previous example.
SELECT Address
FROM People
WHERE Name = "Bill Clinton"
Oracle would then return "1600 Pennsylvania Avenue"
Here's how we'd actually make the table:
CREATE table People
(Name varchar2(40) NOT NULL PRIMARY KEY,
Address varchar2(100) NOT NULL)
TABLESPACE User_Data;
Let's suppose we had another table in our fictional database which keeps track of people's jobs:
jobs
Name |
Job |
John Doe |
Couch Potatoe |
Jane Smith |
Computer programmer |
Bill Clinton |
Tabloid fodder |
To find out what John Doe does and where he lives, we perform a JOIN (we put two tables together for the query).
SELECT Job, Address
FROM jobs, People
WHERE Name="John Doe"
The Result would be:
Job Address
-------------------------------------------------
Couch Potatoe 123 Spud Terrace
1b) Design Principles
Designing a database involves choosing:
Relational databases allow flexibilty - decisions we make now will not limit the questions we can ask later.
Steps in Database Design:
Enough introduction. . . . .
Syntom Tables: Overview
What kinds of information are going to be stored in SynTom?
People (researchers names, e-mail, lab affiliations, phone numbers, etc.)
Organisms (genus, species, cultivar, chromosome number, etc.)
Libraries (tissues, vectors, treatments, made by whom, # of clones, etc.)
Sequences (sequence, vector screened, organism, library, made by whom, BLAST results, expressed or genomic clone it came from, etc.)
Clones (where is it, what library did it come from, where is it, has it been sequenced, what kind of clone is it, etc.)
Maps (number of linkage groups, markers, bins, what markers/clones are in each bin, etc.)
Assemblies (for any library how many assemblies are there, how many sequences in each library, what method was used to create the assembly, BLAST results, etc.)
Taxonomy (how related are two organisms, how many sequences per organism, do they share sequences, etc.)
Expression ….
Similarity….
Synteny . . .
And probably a whole lot more…..
But we can begin to classify some "core" tables that we'll need. And that's how it begins. We'll start with a (hopefully) easily classifiable starting point:
Syntom Tables: people
First Name: Andreas
Last Name: Matern
Initials: ALM
E-mail:
alm13@cornell.eduAddress: 622 Rhodes Hall
City: Ithaca
State: NY
Zip: 14853
University: Cornell University
Department: Plant Breeding and Biometry
Phone: 607-254-7473
Fax: 607-255-6683
Title: Graduate Student
Group: Tanksley
Homepage:
http://syntom.cit.cornell.eduProject: SynTom
Last Updated: June 15, 1999
Comments: Limps, sleeps late….
Security: ????
Is that enough information?
What's the primary key?
Well, thanks to the Internet, we know that everyone needs a unique e-mail. So, perhaps e-mail is a good unique identifier…..
How will this be entered? Regretfully by hand…. But each user can enter their own information, so its not so bad.
How will this be useful? Well, everytime a user enters a sequence or a clone, or a map, or any other kind of data, we'd like to track who did what and have a way of contacting them…..
SynTom Tables: Libraries
This table came right out of the data we entered to GenBank…. Do we need more information?
\There's an obvious primary key: library name
There are some obvious foreign keys:
SynTom Tables: Sequences
Enough with the trivial examples, let's get down to the nitty gritty . . . sequences.
Sources of Data:
We get sequences for SynTom from a variety of sources, including:
Chromatographs which we have to process and store for later use
GenBank downloads
FASTA files (either from "old" sequencing projects for which we have no chromatographs, or "donated" sequences for which there are no chromatographs)
The best way (IMHO) to create the sequence tables, is to look at these data input files and examine what fields are there. Then we need to combine the different inputs into one (or maybe many) tables.
GENBANK:
Here's an example of a GenBank record for a tomato genomic sequence:
LOCUS AQ367761 650 bp DNA GSS 04-FEB-1999
DEFINITION toxb0002P24r CUGI Tomato BAC Library Lycopersicon esculentum
genomic clone toxb0002P24r, genomic survey sequence.
ACCESSION AQ367761
NID g4222151
VERSION AQ367761.1 GI:4222151
KEYWORDS GSS.
SOURCE tomato.
ORGANISM Lycopersicon esculentum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;
Asteridae; Solananae; Solanales; Solanaceae; Solanum; Potatoe;
Lycopersicon.
REFERENCE 1 (bases 1 to 650)
AUTHORS Budiman,M.A. and Wing,R.A.
TITLE A Preliminary Analysis of Tomato BAC End Sequencing
JOURNAL Unpublished (1999)
COMMENT Tomato is a vegetable crop that ranks second only
to potatoes in value and importance. Among plant
geneticists and physiologists, tomato represents an ideal
dicot model beside Arabidopsis and monocot rice to derive
genomic information from. To facilitate the genome
analysis of tomato, we have constructed a tomato BAC
library that is suitable for positional cloning, physical
mapping, and genome sequencing. The library contains
129,000 clones and a random sampling of 498 clones
indicated an average insert size of 117.5 kb. With 15X
haploid genome equivalents (1C equals 953 Mb)
(Arumuganathan and Earle, 1991), the probability to
recover any particular sequence is greater than 99%. High
stability, large insert and ease in manipulation make BAC
libraries the choice for genome sequencing. Pre
characterization of a few hundred bases of insert ends
will make BAC clones extremely useful for rapid contig
assembly (Venter, Smith, and Hood, 1996). Here we present
the construction, characterization of the tomato BAC
library, and preliminary analysis of the 1536 tomato BAC
end sequences."
Contact: Wing RA
Clemson University Genomics Institute
Clemson University
100 Jordan Hall, Clemson, SC 29634, USA
Tel: 864 656 7288
Fax: 864 656 4293
Email: rwing@clemson.edu
Seq primer: GGAAACAGCTATGACCATG
Class: BAC ends
High quality sequence start: 29
High quality sequence stop: 526.
FEATURES Location/Qualifiers
source 1..650
/organism="Lycopersicon esculentum"
/cultivar="Heinz 1706"
/note="Vector: pBeloBAC 11; Site_1: HindIII; Site_2:
HindIII;
/db_xref="taxon:4081"
/clone="toxb0002P24r"
/clone_lib="CUGI Tomato BAC Library"
/tissue_type="Nuclei preparation from Leaf"
/lab_host="E. coli DH10B"
BASE COUNT 171 a 103 c 116 g 259 t 1 others
ORIGIN
1 attcgacacg caatctatac aggtcacact atataatact caagcttacg ttgttttagc
61 attccaactc gcataatcgt acattttgtg tacaaattct agtttgccca catcgtatca
121 tgatacaaat gtaggtaatg agaatcggca tccaatgcac tatggattga gttgagcact
181 ttagaatcag ttggtgaacc tccttatatt ctgaaggact tcttttattg tgtttttagt
241 atttttatta ttaggatgtt ctagtgtctg tcctaacatc catcttagtt ttagaagtct
301 acatatatag acagtcaaat tttagtagtt tagtggtctt tgcattttca ttcttatgtt
361 aaagacttga gtttccattt tggccaagtt gaatgtttaa atttttaaaa cattcaagtt
421 atattataat ttagttgagt tcacttcttt gatcattata gtattgattt ttttcttccg
481 ctatgtaaag ttagttagac caagggtccg ctcgaggcca acaatggtct tcgagtgtcg
541 gctatgctca gggtgctggc tcgggacgtg acattcattn ttttgtttat aattatgatg
601 ttgtgtttta caatttgtct atccatgatt atataatgtt tgaacgtttg
//
Now that's a lot of data! We obviously don't want to discard any of it, so. . . .
Let's take it apart!
I'll start by removing all those items that will be stored in other tables…. . .
Contact: Wing RA
Clemson University Genomics Institute
Clemson University
100 Jordan Hall, Clemson, SC 29634, USA
Tel: 864 656 7288
Fax: 864 656 4293
Email: rwing@clemson.edu
Well that all fits nicely into the People table I described before. . . .so we'll discard it all except for the e-mail address which is our primary key in the people table and therefore a foreign key in this table.
REFERENCE 1 (bases 1 to 650)
AUTHORS Budiman,M.A. and Wing,R.A.
TITLE A Preliminary Analysis of Tomato BAC End Sequencing
JOURNAL Unpublished (1999)
COMMENT Tomato is a vegetable crop that ranks second only
to potatoes in value and importance. Among plant
geneticists and physiologists, tomato represents an ideal
dicot model beside Arabidopsis and monocot rice to derive
genomic information from. To facilitate the genome
analysis of tomato, we have constructed a tomato BAC
library that is suitable for positional cloning, physical
mapping, and genome sequencing. The library contains
129,000 clones and a random sampling of 498 clones
indicated an average insert size of 117.5 kb. With 15X
haploid genome equivalents (1C equals 953 Mb)
(Arumuganathan and Earle, 1991), the probability to
recover any particular sequence is greater than 99%. High
stability, large insert and ease in manipulation make BAC
libraries the choice for genome sequencing. Pre
characterization of a few hundred bases of insert ends
will make BAC clones extremely useful for rapid contig
assembly (Venter, Smith, and Hood, 1996). Here we present
the construction, characterization of the tomato BAC
library, and preliminary analysis of the 1536 tomato BAC
end sequences."
There are going to be lots of references (and there are already lots of references in SolGenes) so let's make a Reference Table later. Perhaps the TITLE should be the key here….
SOURCE tomato.
ORGANISM Lycopersicon esculentum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;
Asteridae; Solananae; Solanales; Solanaceae; Solanum; Potatoe;
Lycopersicon.
This looks like something for an Organism and a Taxonomy database. ORGANISM is an obvious foreign key…
/organism="Lycopersicon esculentum"
/cultivar="Heinz 1706"
/note="Vector: pBeloBAC 11; Site_1: HindIII; Site_2:
HindIII;
/db_xref="taxon:4081"
/clone="toxb0002P24r"
/clone_lib="CUGI Tomato BAC Library"
/tissue_type="Nuclei preparation from Leaf"
/lab_host="E. coli DH10B"
This goes right into the library table, it's even named (foreign key).
So what do we have left for the Sequence table (genbank edition)
LOCUS AQ367761 650 bp DNA GSS 04-FEB-1999
DEFINITION toxb0002P24r CUGI Tomato BAC Library Lycopersicon esculentum
genomic clone toxb0002P24r, genomic survey sequence.
ACCESSION AQ367761
NID g4222151
VERSION AQ367761.1 GI:4222151
KEYWORDS GSS.
Seq primer: GGAAACAGCTATGACCATG
Class: BAC ends
High quality sequence start: 29
High quality sequence stop: 526.
BASE COUNT 171 a 103 c 116 g 259 t 1 others
ORIGIN
1 attcgacacg caatctatac aggtcacact atataatact caagcttacg ttgttttagc
61 attccaactc gcataatcgt acattttgtg tacaaattct agtttgccca catcgtatca
121 tgatacaaat gtaggtaatg agaatcggca tccaatgcac tatggattga gttgagcact
181 ttagaatcag ttggtgaacc tccttatatt ctgaaggact tcttttattg tgtttttagt
241 atttttatta ttaggatgtt ctagtgtctg tcctaacatc catcttagtt ttagaagtct
301 acatatatag acagtcaaat tttagtagtt tagtggtctt tgcattttca ttcttatgtt
361 aaagacttga gtttccattt tggccaagtt gaatgtttaa atttttaaaa cattcaagtt
421 atattataat ttagttgagt tcacttcttt gatcattata gtattgattt ttttcttccg
481 ctatgtaaag ttagttagac caagggtccg ctcgaggcca acaatggtct tcgagtgtcg
541 gctatgctca gggtgctggc tcgggacgtg acattcattn ttttgtttat aattatgatg
601 ttgtgtttta caatttgtct atccatgatt atataatgtt tgaacgtttg
//
Not so bad…..
We have some fields which we can define and use as a starting point for our table:
LOCUS: the locus name (different from accession -- accessions get VERSION numbers, loci don't)
SEQUENCE_LENGTH: the length of the sequence in bp 650 bp
SEQUENCE_TYPE: DNA
SEQUENCE_SUBTYPE: GSS
DATE ENTERED: 04-FEB-1999
ACCESSION: AQ367761
NID: the only unique identifier! G4222151
VERSION: the *.1 means this is version one of this sequence, and it is "linked" to the UNIQUE GI # 4222151
KEYWORDS: a variable field of identifying keywords
Seq primer: haven't seen this much, we should talk about it later.
Class: haven't seen this much either
Sequence start: Although the submitted sequence is 650 bp, the high quality
Sequence stop: sequence is only from 29-526 bp
Base Count: obvious…. But I think we should include the IUPAC convention……
SEQUENCE: And then the FASTA formatted sequence…..
Before we overanalyze this, let's try another…. Maybe an EST. . . .
LOCUS AI637381 187 bp mRNA EST 27-APR-1999
DEFINITION DB# 320 Giant cell specific cDNA Lycopersicon esculentum cDNA clone
DB# 320 3' similar to nascent polypeptide associated complex, alpha
subunit, mRNA sequence.
ACCESSION AI637381
NID g4689615
VERSION AI637381.1 GI:4689615
KEYWORDS EST.
SOURCE tomato.
ORGANISM Lycopersicon esculentum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; core
eudicots; Asteridae; euasterids I; Solanales; Solanaceae; Solanum;
Potatoe; Lycopersicon.
REFERENCE 1 (bases 1 to 187)
AUTHORS Greene,A.E., Thiery,M. and Bird,D.M.
TITLE Overlapping expression of ESTs induced by root-knot and cyst
nematodes in tomato
JOURNAL Unpublished (1999)
COMMENT
Contact: Bird,DM
Plant Nematode Genetics Group
North Carolina State University
Box 7616, Raleigh, NC 27695, USA
Tel: 919 515 6813
Fax: 919 515 9500
Email: david_bird@ncsu.edu
Seq primer: M13 forward
High quality sequence stop: 187.
FEATURES Location/Qualifiers
source 1..187
/organism="Lycopersicon esculentum"
/cultivar="Rutgers Large Red"
/note="Organ: Root; Vector: pGem 11Zf+; Site_1: Not I;
Site_2: Sfi I; Individual, root-knot induced, tomato giant
cells were dissected, and a cDNA bank constructed. This
bank was extensively subtracted against healthy root
tissue cDNA to generate the current library."
/db_xref="taxon:4081"
/clone="DB# 320"
/clone_lib="Giant cell specific cDNA"
/cell_type="Giant Cells"
/lab_host="E. coli DH12S"
BASE COUNT 58 a 36 c 53 g 40 t
ORIGIN
1 ggtgaagcaa agattgagga cttgagctca cagttgcagt ctcaagctgc agagcagttt
61 aaggctccta atctaagcaa cgttatatcg aagccagagc catcaactgt tgctcaggat
121 gacgaggacg ttgatgaaac tggcgtagaa ccaaaggata ttgagctggt gatgactcaa
181 gcaggag
//
Again, we can get rid of some stuff and be left with…
LOCUS AI637381 187 bp mRNA EST 27-APR-1999
DEFINITION DB# 320 Giant cell specific cDNA Lycopersicon esculentum cDNA clone
DB# 320 3' similar to nascent polypeptide associated complex, alpha
subunit, mRNA sequence.
ACCESSION AI637381
NID g4689615
VERSION AI637381.1 GI:4689615
KEYWORDS EST.
High quality sequence stop: 187.
BASE COUNT 58 a 36 c 53 g 40 t
ORIGIN
1 ggtgaagcaa agattgagga cttgagctca cagttgcagt ctcaagctgc agagcagttt
61 aaggctccta atctaagcaa cgttatatcg aagccagagc catcaactgt tgctcaggat
121 gacgaggacg ttgatgaaac tggcgtagaa ccaaaggata ttgagctggt gatgactcaa
181 gcaggag
That fits our previous model, some omissions of data, but that's OK.
LOCUS AF050497 5114 bp RNA PLN 04-JUN-1999
DEFINITION Lycopersicon esculentum Ca2+-ATPase precursor RNA sequence.
ACCESSION AF050497
NID g4206314
VERSION AF050497.1 GI:4206314
KEYWORDS .
SOURCE tomato.
ORGANISM Lycopersicon esculentum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; core
eudicots; Asteridae; euasterids I; Solanales; Solanaceae; Solanum;
Potatoe; Lycopersicon.
REFERENCE 1 (bases 1 to 5114)
AUTHORS Navarro-Avino,J.P., Hentzen,A.E. and Bennett,A.B.
TITLE Alternate transcription initiation sites generate two LCA1
Ca2+-ATPase mRNA transcripts in tomato
JOURNAL Plant Mol. Biol. 40 (1), 133-140 (1999)
REFERENCE 2 (bases 1 to 5114)
AUTHORS Hentzen,A.E.
TITLE Direct Submission
JOURNAL Submitted (24-FEB-1998) Medical Technology, University of Wyoming,
1233 East Second St., Casper, WY 82601, USA
FEATURES Location/Qualifiers
source 1..5114
/organism="Lycopersicon esculentum"
/db_xref="taxon:4081"
precursor_RNA 1..5114
/gene="Ca2+-ATPase"
/note="alternative transcript"
gene 1..5114
/gene="Ca2+-ATPase"
/note="LCA1C"
intron 2612..3211
/gene="Ca2+-ATPase"
BASE COUNT 1401 a 868 c 1134 g 1711 t
ORIGIN
1 tgttgcttcc ttgcaattgc taaagtttca atcttttttc ttttattttt tcatgtgggt
61 actgtattaa tgacatgatc tgttttgatt cttctttatg tgaccgttat tcttgttgta
4981 tcctttctag gaggggtttc atgtattata agttctagtt tgtattattc atttattgaa
5041 tctgctatct taccgtagat tttgatgaac ttaaaacaca cctttttcta aaaaaaaaaa
5101 aaaaaaaaaa aaaa
//
After pushing data to other tables we get…
LOCUS AF050497 5114 bp RNA PLN 04-JUN-1999
DEFINITION Lycopersicon esculentum Ca2+-ATPase precursor RNA sequence.
ACCESSION AF050497
NID g4206314
VERSION AF050497.1 GI:4206314
KEYWORDS .
FEATURES Location/Qualifiers
source 1..5114
/organism="Lycopersicon esculentum"
/db_xref="taxon:4081"
precursor_RNA 1..5114
/gene="Ca2+-ATPase"
/note="alternative transcript"
gene 1..5114
/gene="Ca2+-ATPase"
/note="LCA1C"
intron 2612..3211
/gene="Ca2+-ATPase"
BASE COUNT 1401 a 868 c 1134 g 1711 t
ORIGIN
1 tgttgcttcc ttgcaattgc taaagtttca atcttttttc ttttattttt tcatgtgggt
61 actgtattaa tgacatgatc tgttttgatt cttctttatg tgaccgttat tcttgttgta
4981 tcctttctag gaggggtttc atgtattata agttctagtt tgtattattc atttattgaa
5041 tctgctatct taccgtagat tttgatgaac ttaaaacaca cctttttcta aaaaaaaaaa
5101 aaaaaaaaaa aaaa
Looks like there's some markup for genes and introns…..don't want to lose that….
In general the FEATURES subdivision of a GenBank record is a real pain. . . . maybe it should be its own table…..
If that is true, let's look at what the basic Sequence Table for SynTom: looks like, only incorporating GenBank….
LOCUS: the locus name (different from accession -- accessions get VERSION numbers, loci don't)
SEQUENCE_LENGTH: the length of the sequence in bp 650 bp
SEQUENCE_TYPE: DNA
SEQUENCE_SUBTYPE: GSS
DATE ENTERED: 04-FEB-1999
ACCESSION: AQ367761
NID: the only unique identifier! G4222151
VERSION: the *.1 means this is version one of this sequence, and it is "linked" to the UNIQUE GI # 4222151
KEYWORDS: a variable field of identifying keywords
Base Count: obvious…. But I think we should include the IUPAC convention……
SEQUENCE: And then the FASTA formatted sequence…..
We'll get back to Features later, for now this let's us import all the GenBank data into SynTom…
How else do sequences get into SynTom?
Sometimes I get sequences which are just FASTA formatted, with very little additional information…..
>|60_ta56/lv121f.Seq|Fri, Mar 6, 1998 |(Source TMF)|
AGTGGGGGGACACAAAGGCAAGGCAACATGCCCAATTGACACATTGAAAC
TTGGGGCTTGTGTGGATCTTTTAGGTGGGCTTGTTCATATTGGGCTTGGT
GACCCAGCTGTTAATGAATGTTGTCCAATACTTAGTGGGCTTGTTGAACT
TGAAGCTGCTGCTTGCCTTTGCACAACACTTAAAGTCAAATTACTTAACC
TATTCAAAAAAGAATAAGATACGAAATTTGTGGTAATCTATCGCTAAATG
AACATATCTAATATAATTTATTTTTTTTAAAAAATCCGTAGTAACCTAGG
ATTANCTACGAATTTTATGGTAANTAAACCTATTANTAANTTAAAATCTG
GATAAAANNGCAGGTTTCATTCTGAAATTTGAAAAAAATGCATGGGTTTA
CATGNAAAAGGAGGNCCAATTTATGAATTAATTAAATTCTTTCCNCCCNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNAAGNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNN
Not a lot of information that I can glean from the file itself, I have to go back to the source (in this case Theresa) and fill in as many of the table fields as possible. People generally don't include a lot of detail on the "title" line of a FASTA file…..But there's nothing new either, so everything that's in these types of files already exists in the table schema we have working…. However, we do need to come up with a GID of our own -- a unique key to seaparate one sequence from another. Since there are so few files that are in this format, the NAME of the file should be IMHO the NID.
Another way in which sequence data comes into SynTom is through chromatographs…..
Chromatographs are binary (raw) data which we then convert into sequence using a number of different programs….
Here's how I think we should handle these types of sequences…..
CHROMATOGRAPH: filename of the chromatograph
CONVERSION_PROGRAM: Phred
CONVERSION_DATE: May 4, 1999
CONVERSION_VERSION: 0.960108
CONVERSION_PERSON: Maura Hart (or maybe e-mail for a foreign key…)
CONVERSION_PLATFORM: NT
RAW_SEQUENCE: the sequence (output from phred)
RAW_QUALITY: the quality (output from phred)
SEQUENCE_LENTGH: length of the raw sequence in bp
VECTOR_PROGRAM: cross_match
VECTOR_VERSION: 0.990319
VECTOR_PERSON: Maura Hart
VECTOR_PLATFORM: NT
VECTOR_DATABASE: vector.seq (another foreign key - vector.seq is a collection of vector sequences which should be stored in the database as a screen library table)
CLEAN_SEQUENCE: exported from cross_match
SEQUENCE_START: The starting bp of the Raw sequence considered "good"
SEQUENCE_END: The last bp of the Raw sequence considered "good"
And there we have it. Let's put it all together and look at where we will have Keys, and where other tables need to be created:
LOCUS: the locus name if applicable
SEQUENCE_TYPE: DNA
SEQUENCE_SUBTYPE: GSS
DATE ENTERED: 04-FEB-1999
ACCESSION: AQ367761 if applicable
NID: G4222151 (the PRIMARY KEY for the Sequence Table)
VERSION: AQ367761.1
KEYWORDS: a variable field of identifying keywords
Base Count:
CONTACT: A FOREIGN KEY to the People Table
REFERENCE: A FOREIGN KEY to the References Table
Library: A FOREIGN KEY to the Library Table
Organism: Since not every seqeunce comes from a library, we'll have to state what organism some come from FOREIGN KEY to the Taxonomy Table
CHROMATOGRAPH: filename of the chromatograph
CONVERSION_PROGRAM: Phred
CONVERSION_VERSION: 0.960108
CONVERSION_PERSON: Maura Hart (or maybe e-mail for a foreign key…)
CONVERSION_PLATFORM: NT
RAW_SEQUENCE: the sequence (output from phred)
RAW_QUALITY: the quality (output from phred)
SEQUENCE_LENTGH: length of the raw sequence in bp
VECTOR_PROGRAM: cross_match
VECTOR_VERSION: 0.990319
VECTOR_PERSON: Maura Hart
VECTOR_PLATFORM: NT
VECTOR_DATABASE: vector.seq (another foreign key - vector.seq is a collection of vector sequences which should be stored in the database as a screen library table)
CLEAN_SEQUENCE: exported from cross_match
SEQUENCE_START: The starting bp of the Raw sequence considered "good"
SEQUENCE_END: The last bp of the Raw sequence considered "good"
Now some things I haven't discussed yet:
SECURITY: Here's where we'll have to deal with which seqeunces are avaialble to whom..
CONTIG: a FOREIGN KEY to the assembly# this sequence belongs to
CLONE: a FOREIGN KEY to the clone table, many of our sequences are partial sequences (ESTs) from various CLONES
BLAST: BLAST results will be handled differently. . . the BLAST results will be parsed into the database as BLAST tables each query and subject will be identified by its NID, so looking up what is similar to what is a matter of SQL querying the BLAST tables with pointers back to the sequence tables…..
Primers: Primers in the database needs more discussion. Not only do we need to keep track of the primer sequences, we need to keep track of the primer's name, who ordered it, where it is, etc. Looks like we need a Primer table.….
PRIMER_LEFT_START:
PRIMER_LEFT_END:
PRIMER_RIGHT_START:
PRIMER_RIGHT_END: