SynTom Meeting Notes ALM Tuesday, June 15, 1999

 

Outline:

  1. Introduction
    1. Definitions and Simple Examples
    2. Design Principles
  1. Syntom Tables:
    a) overview
    1. people
    2. libraries
    3. sequences
      1. sources
      2. examples
      3. definitions
  1. Wrap-Up

 

1a) Definitions and Simple Examples

Most of this comes from: The Practical SQL Handbook - Using Structured Query Language (Third Edition) by Judith S. Bowman, Sandra L. Emerson and Marcy Darnovksy.

The SynTom (Synteny in Tomato)database's goal is to be the global repository for sequence, mapping and synteny information for all of the species in the family solanaceae, with emphasis on tomato. (from http://syntom.cit.cornell.edu/objectives.html

In order to achieve this goal we are Oracle 8 to build a relational database.

Oracle is a relational database management system or DBMS.

Relational databases represent all information in TABLES.

 

 

People

Name

Address

John Doe

123 Spud Terrace

Jane Smith

867 Rockinghorse Blvd.

Bill Clinton

1600 Pennsylvania Avenue

 

 

 

 

 

 

 

A set of related tables forms a database.

Each row describes one occurrence of an entity - a person, a sequence, a car, etc.

Each column describes one characteristic of the entity - a person's name, a sequence's GC content, a car's color, etc.

The intersection of a row and a column is a data element or value. To find the value you want, you need to know the name of the table, what column it's in and the value of the row's primary key or unique identifier.

So, to have Oracle find Bill Clinton's address, you need to tell it to look up Bill Clinton (the unique identifier) in the People table, and extract data from the column named address.

N.B. The names in my table are not in alphabetical order. The order of the rows is insignificant to the database…..

 

The language used to access information in relational databases is called SQL - Structured Query Language. It's officially pronounced "ess-cue-ell" but most people pronounce it "sequel". I won't talk much about SQL, but just to illustrate, here's how we'd perform the retrieval we did in the previous example.

SELECT Address

FROM People

WHERE Name = "Bill Clinton"

Oracle would then return "1600 Pennsylvania Avenue"

Here's how we'd actually make the table:

CREATE table People

(Name varchar2(40) NOT NULL PRIMARY KEY,

Address varchar2(100) NOT NULL)

TABLESPACE User_Data;

Let's suppose we had another table in our fictional database which keeps track of people's jobs:

 

 

 

jobs

Name

Job

John Doe

Couch Potatoe

Jane Smith

Computer programmer

Bill Clinton

Tabloid fodder

 

To find out what John Doe does and where he lives, we perform a JOIN (we put two tables together for the query).

SELECT Job, Address

FROM jobs, People

WHERE Name="John Doe"

The Result would be:

 

Job Address

-------------------------------------------------

Couch Potatoe 123 Spud Terrace

 

 

 

1b) Design Principles

Designing a database involves choosing:

  1. The tables that belong in the database
  2. The columns that belong in each table
  3. How tables and columns interact

Relational databases allow flexibilty - decisions we make now will not limit the questions we can ask later.

Steps in Database Design:

  1. Think about the information in the database.
    1. Where does it come from?
    2. What format is it in?
    3. How will it be entered?
    4. How often will it change?
  1. List the "things" or entities with their properties and attributes
  2. Locate unique identifiers or primary keys for each entity
  3. Consider relationships between the entities
    1. are they one to many (each book has one publisher)
    2. or many to many (an author can write multiple books and a book can have multiple authors
    3. can the data in one proposed table be joined to data in related tables?
  1. Create the database, experiment with some reports and queries
  2. Start all over again

 

Enough introduction. . . . .

Syntom Tables: Overview

What kinds of information are going to be stored in SynTom?

People (researchers names, e-mail, lab affiliations, phone numbers, etc.)

Organisms (genus, species, cultivar, chromosome number, etc.)

Libraries (tissues, vectors, treatments, made by whom, # of clones, etc.)

Sequences (sequence, vector screened, organism, library, made by whom, BLAST results, expressed or genomic clone it came from, etc.)

Clones (where is it, what library did it come from, where is it, has it been sequenced, what kind of clone is it, etc.)

Maps (number of linkage groups, markers, bins, what markers/clones are in each bin, etc.)

Assemblies (for any library how many assemblies are there, how many sequences in each library, what method was used to create the assembly, BLAST results, etc.)

Taxonomy (how related are two organisms, how many sequences per organism, do they share sequences, etc.)

Expression ….

Similarity….

Synteny . . .

And probably a whole lot more…..

But we can begin to classify some "core" tables that we'll need. And that's how it begins. We'll start with a (hopefully) easily classifiable starting point:

Syntom Tables: people

First Name: Andreas

Last Name: Matern

Initials: ALM

E-mail: alm13@cornell.edu

Address: 622 Rhodes Hall

City: Ithaca

State: NY

Zip: 14853

University: Cornell University

Department: Plant Breeding and Biometry

Phone: 607-254-7473

Fax: 607-255-6683

Title: Graduate Student

Group: Tanksley

Homepage: http://syntom.cit.cornell.edu

Project: SynTom

Last Updated: June 15, 1999

Comments: Limps, sleeps late….

Security: ????

Is that enough information?

What's the primary key?

Well, thanks to the Internet, we know that everyone needs a unique e-mail. So, perhaps e-mail is a good unique identifier…..

How will this be entered? Regretfully by hand…. But each user can enter their own information, so its not so bad.

How will this be useful? Well, everytime a user enters a sequence or a clone, or a map, or any other kind of data, we'd like to track who did what and have a way of contacting them…..

 

SynTom Tables: Libraries

  1. Organism (Lycopersicon esculentum)
  2. Cultivar (TA496)
  3. Tissue (tomato ovary)
  4. who created the library (Alcala)
  5. when was the library created (January 8, 1999)
  6. who did the sequencing (TIGR)
  7. when was the sequencing done (March 15, 1999)
  8. vector (pBluescript SK(-))
  9. restriction site 1 (EcoR1)
  10. restriction site 2 (Xho1)
  11. developmental stage (5 days pre-anthesis to 5 days post-anthesis)
  12. host (XL1-Blue MRF')
  13. library name (tomato ovary, TAMU)

This table came right out of the data we entered to GenBank…. Do we need more information?

\There's an obvious primary key: library name

There are some obvious foreign keys:

  1. who created the library a link to the people table, although we'll probably have to change Alcala to Alcala's e-mail address….
  2. Who did the sequencing: looks like we need a Lab table. . . .
  3. Organism, a foreign key which will join this table with the Organism and the Taxonomy tables
  4. Cultivar, ibid

 

SynTom Tables: Sequences

Enough with the trivial examples, let's get down to the nitty gritty . . . sequences.

 

Sources of Data:

We get sequences for SynTom from a variety of sources, including:

Chromatographs which we have to process and store for later use

GenBank downloads

FASTA files (either from "old" sequencing projects for which we have no chromatographs, or "donated" sequences for which there are no chromatographs)

The best way (IMHO) to create the sequence tables, is to look at these data input files and examine what fields are there. Then we need to combine the different inputs into one (or maybe many) tables.

 

GENBANK:

Here's an example of a GenBank record for a tomato genomic sequence:

LOCUS AQ367761 650 bp DNA GSS 04-FEB-1999

DEFINITION toxb0002P24r CUGI Tomato BAC Library Lycopersicon esculentum

genomic clone toxb0002P24r, genomic survey sequence.

ACCESSION AQ367761

NID g4222151

VERSION AQ367761.1 GI:4222151

KEYWORDS GSS.

SOURCE tomato.

ORGANISM Lycopersicon esculentum

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;

Asteridae; Solananae; Solanales; Solanaceae; Solanum; Potatoe;

Lycopersicon.

REFERENCE 1 (bases 1 to 650)

AUTHORS Budiman,M.A. and Wing,R.A.

TITLE A Preliminary Analysis of Tomato BAC End Sequencing

JOURNAL Unpublished (1999)

COMMENT Tomato is a vegetable crop that ranks second only

to potatoes in value and importance. Among plant

geneticists and physiologists, tomato represents an ideal

dicot model beside Arabidopsis and monocot rice to derive

genomic information from. To facilitate the genome

analysis of tomato, we have constructed a tomato BAC

library that is suitable for positional cloning, physical

mapping, and genome sequencing. The library contains

129,000 clones and a random sampling of 498 clones

indicated an average insert size of 117.5 kb. With 15X

haploid genome equivalents (1C equals 953 Mb)

(Arumuganathan and Earle, 1991), the probability to

recover any particular sequence is greater than 99%. High

stability, large insert and ease in manipulation make BAC

libraries the choice for genome sequencing. Pre

characterization of a few hundred bases of insert ends

will make BAC clones extremely useful for rapid contig

assembly (Venter, Smith, and Hood, 1996). Here we present

the construction, characterization of the tomato BAC

library, and preliminary analysis of the 1536 tomato BAC

end sequences."

Contact: Wing RA

Clemson University Genomics Institute

Clemson University

100 Jordan Hall, Clemson, SC 29634, USA

Tel: 864 656 7288

Fax: 864 656 4293

Email: rwing@clemson.edu

Seq primer: GGAAACAGCTATGACCATG

Class: BAC ends

High quality sequence start: 29

High quality sequence stop: 526.

FEATURES Location/Qualifiers

source 1..650

/organism="Lycopersicon esculentum"

/cultivar="Heinz 1706"

/note="Vector: pBeloBAC 11; Site_1: HindIII; Site_2:

HindIII;

/db_xref="taxon:4081"

/clone="toxb0002P24r"

/clone_lib="CUGI Tomato BAC Library"

/tissue_type="Nuclei preparation from Leaf"

/lab_host="E. coli DH10B"

BASE COUNT 171 a 103 c 116 g 259 t 1 others

ORIGIN

1 attcgacacg caatctatac aggtcacact atataatact caagcttacg ttgttttagc

61 attccaactc gcataatcgt acattttgtg tacaaattct agtttgccca catcgtatca

121 tgatacaaat gtaggtaatg agaatcggca tccaatgcac tatggattga gttgagcact

181 ttagaatcag ttggtgaacc tccttatatt ctgaaggact tcttttattg tgtttttagt

241 atttttatta ttaggatgtt ctagtgtctg tcctaacatc catcttagtt ttagaagtct

301 acatatatag acagtcaaat tttagtagtt tagtggtctt tgcattttca ttcttatgtt

361 aaagacttga gtttccattt tggccaagtt gaatgtttaa atttttaaaa cattcaagtt

421 atattataat ttagttgagt tcacttcttt gatcattata gtattgattt ttttcttccg

481 ctatgtaaag ttagttagac caagggtccg ctcgaggcca acaatggtct tcgagtgtcg

541 gctatgctca gggtgctggc tcgggacgtg acattcattn ttttgtttat aattatgatg

601 ttgtgtttta caatttgtct atccatgatt atataatgtt tgaacgtttg

//

Now that's a lot of data! We obviously don't want to discard any of it, so. . . .

Let's take it apart!

I'll start by removing all those items that will be stored in other tables…. . .

Contact: Wing RA

Clemson University Genomics Institute

Clemson University

100 Jordan Hall, Clemson, SC 29634, USA

Tel: 864 656 7288

Fax: 864 656 4293

Email: rwing@clemson.edu

Well that all fits nicely into the People table I described before. . . .so we'll discard it all except for the e-mail address which is our primary key in the people table and therefore a foreign key in this table.

REFERENCE 1 (bases 1 to 650)

AUTHORS Budiman,M.A. and Wing,R.A.

TITLE A Preliminary Analysis of Tomato BAC End Sequencing

JOURNAL Unpublished (1999)

COMMENT Tomato is a vegetable crop that ranks second only

to potatoes in value and importance. Among plant

geneticists and physiologists, tomato represents an ideal

dicot model beside Arabidopsis and monocot rice to derive

genomic information from. To facilitate the genome

analysis of tomato, we have constructed a tomato BAC

library that is suitable for positional cloning, physical

mapping, and genome sequencing. The library contains

129,000 clones and a random sampling of 498 clones

indicated an average insert size of 117.5 kb. With 15X

haploid genome equivalents (1C equals 953 Mb)

(Arumuganathan and Earle, 1991), the probability to

recover any particular sequence is greater than 99%. High

stability, large insert and ease in manipulation make BAC

libraries the choice for genome sequencing. Pre

characterization of a few hundred bases of insert ends

will make BAC clones extremely useful for rapid contig

assembly (Venter, Smith, and Hood, 1996). Here we present

the construction, characterization of the tomato BAC

library, and preliminary analysis of the 1536 tomato BAC

end sequences."

There are going to be lots of references (and there are already lots of references in SolGenes) so let's make a Reference Table later. Perhaps the TITLE should be the key here….

 

SOURCE tomato.

ORGANISM Lycopersicon esculentum

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;

Asteridae; Solananae; Solanales; Solanaceae; Solanum; Potatoe;

Lycopersicon.

This looks like something for an Organism and a Taxonomy database. ORGANISM is an obvious foreign key…

/organism="Lycopersicon esculentum"

/cultivar="Heinz 1706"

/note="Vector: pBeloBAC 11; Site_1: HindIII; Site_2:

HindIII;

/db_xref="taxon:4081"

/clone="toxb0002P24r"

/clone_lib="CUGI Tomato BAC Library"

/tissue_type="Nuclei preparation from Leaf"

/lab_host="E. coli DH10B"

This goes right into the library table, it's even named (foreign key).

So what do we have left for the Sequence table (genbank edition)

LOCUS AQ367761 650 bp DNA GSS 04-FEB-1999

DEFINITION toxb0002P24r CUGI Tomato BAC Library Lycopersicon esculentum

genomic clone toxb0002P24r, genomic survey sequence.

ACCESSION AQ367761

NID g4222151

VERSION AQ367761.1 GI:4222151

KEYWORDS GSS.

Seq primer: GGAAACAGCTATGACCATG

Class: BAC ends

High quality sequence start: 29

High quality sequence stop: 526.

BASE COUNT 171 a 103 c 116 g 259 t 1 others

ORIGIN

1 attcgacacg caatctatac aggtcacact atataatact caagcttacg ttgttttagc

61 attccaactc gcataatcgt acattttgtg tacaaattct agtttgccca catcgtatca

121 tgatacaaat gtaggtaatg agaatcggca tccaatgcac tatggattga gttgagcact

181 ttagaatcag ttggtgaacc tccttatatt ctgaaggact tcttttattg tgtttttagt

241 atttttatta ttaggatgtt ctagtgtctg tcctaacatc catcttagtt ttagaagtct

301 acatatatag acagtcaaat tttagtagtt tagtggtctt tgcattttca ttcttatgtt

361 aaagacttga gtttccattt tggccaagtt gaatgtttaa atttttaaaa cattcaagtt

421 atattataat ttagttgagt tcacttcttt gatcattata gtattgattt ttttcttccg

481 ctatgtaaag ttagttagac caagggtccg ctcgaggcca acaatggtct tcgagtgtcg

541 gctatgctca gggtgctggc tcgggacgtg acattcattn ttttgtttat aattatgatg

601 ttgtgtttta caatttgtct atccatgatt atataatgtt tgaacgtttg

//

Not so bad…..

We have some fields which we can define and use as a starting point for our table:

LOCUS: the locus name (different from accession -- accessions get VERSION numbers, loci don't)

SEQUENCE_LENGTH: the length of the sequence in bp 650 bp

SEQUENCE_TYPE: DNA

SEQUENCE_SUBTYPE: GSS

DATE ENTERED: 04-FEB-1999

ACCESSION: AQ367761

NID: the only unique identifier! G4222151

VERSION: the *.1 means this is version one of this sequence, and it is "linked" to the UNIQUE GI # 4222151

KEYWORDS: a variable field of identifying keywords

Seq primer: haven't seen this much, we should talk about it later.

Class: haven't seen this much either

Sequence start: Although the submitted sequence is 650 bp, the high quality

Sequence stop: sequence is only from 29-526 bp

Base Count: obvious…. But I think we should include the IUPAC convention……

SEQUENCE: And then the FASTA formatted sequence…..

Before we overanalyze this, let's try another…. Maybe an EST. . . .

LOCUS AI637381 187 bp mRNA EST 27-APR-1999

DEFINITION DB# 320 Giant cell specific cDNA Lycopersicon esculentum cDNA clone

DB# 320 3' similar to nascent polypeptide associated complex, alpha

subunit, mRNA sequence.

ACCESSION AI637381

NID g4689615

VERSION AI637381.1 GI:4689615

KEYWORDS EST.

SOURCE tomato.

ORGANISM Lycopersicon esculentum

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; core

eudicots; Asteridae; euasterids I; Solanales; Solanaceae; Solanum;

Potatoe; Lycopersicon.

REFERENCE 1 (bases 1 to 187)

AUTHORS Greene,A.E., Thiery,M. and Bird,D.M.

TITLE Overlapping expression of ESTs induced by root-knot and cyst

nematodes in tomato

JOURNAL Unpublished (1999)

COMMENT

Contact: Bird,DM

Plant Nematode Genetics Group

North Carolina State University

Box 7616, Raleigh, NC 27695, USA

Tel: 919 515 6813

Fax: 919 515 9500

Email: david_bird@ncsu.edu

Seq primer: M13 forward

High quality sequence stop: 187.

FEATURES Location/Qualifiers

source 1..187

/organism="Lycopersicon esculentum"

/cultivar="Rutgers Large Red"

/note="Organ: Root; Vector: pGem 11Zf+; Site_1: Not I;

Site_2: Sfi I; Individual, root-knot induced, tomato giant

cells were dissected, and a cDNA bank constructed. This

bank was extensively subtracted against healthy root

tissue cDNA to generate the current library."

/db_xref="taxon:4081"

/clone="DB# 320"

/clone_lib="Giant cell specific cDNA"

/cell_type="Giant Cells"

/lab_host="E. coli DH12S"

BASE COUNT 58 a 36 c 53 g 40 t

ORIGIN

1 ggtgaagcaa agattgagga cttgagctca cagttgcagt ctcaagctgc agagcagttt

61 aaggctccta atctaagcaa cgttatatcg aagccagagc catcaactgt tgctcaggat

121 gacgaggacg ttgatgaaac tggcgtagaa ccaaaggata ttgagctggt gatgactcaa

181 gcaggag

//

 

Again, we can get rid of some stuff and be left with…

LOCUS AI637381 187 bp mRNA EST 27-APR-1999

DEFINITION DB# 320 Giant cell specific cDNA Lycopersicon esculentum cDNA clone

DB# 320 3' similar to nascent polypeptide associated complex, alpha

subunit, mRNA sequence.

ACCESSION AI637381

NID g4689615

VERSION AI637381.1 GI:4689615

KEYWORDS EST.

High quality sequence stop: 187.

BASE COUNT 58 a 36 c 53 g 40 t

ORIGIN

1 ggtgaagcaa agattgagga cttgagctca cagttgcagt ctcaagctgc agagcagttt

61 aaggctccta atctaagcaa cgttatatcg aagccagagc catcaactgt tgctcaggat

121 gacgaggacg ttgatgaaac tggcgtagaa ccaaaggata ttgagctggt gatgactcaa

181 gcaggag

That fits our previous model, some omissions of data, but that's OK.

LOCUS AF050497 5114 bp RNA PLN 04-JUN-1999

DEFINITION Lycopersicon esculentum Ca2+-ATPase precursor RNA sequence.

ACCESSION AF050497

NID g4206314

VERSION AF050497.1 GI:4206314

KEYWORDS .

SOURCE tomato.

ORGANISM Lycopersicon esculentum

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; core

eudicots; Asteridae; euasterids I; Solanales; Solanaceae; Solanum;

Potatoe; Lycopersicon.

REFERENCE 1 (bases 1 to 5114)

AUTHORS Navarro-Avino,J.P., Hentzen,A.E. and Bennett,A.B.

TITLE Alternate transcription initiation sites generate two LCA1

Ca2+-ATPase mRNA transcripts in tomato

JOURNAL Plant Mol. Biol. 40 (1), 133-140 (1999)

REFERENCE 2 (bases 1 to 5114)

AUTHORS Hentzen,A.E.

TITLE Direct Submission

JOURNAL Submitted (24-FEB-1998) Medical Technology, University of Wyoming,

1233 East Second St., Casper, WY 82601, USA

FEATURES Location/Qualifiers

source 1..5114

/organism="Lycopersicon esculentum"

/db_xref="taxon:4081"

precursor_RNA 1..5114

/gene="Ca2+-ATPase"

/note="alternative transcript"

gene 1..5114

/gene="Ca2+-ATPase"

/note="LCA1C"

intron 2612..3211

/gene="Ca2+-ATPase"

BASE COUNT 1401 a 868 c 1134 g 1711 t

ORIGIN

1 tgttgcttcc ttgcaattgc taaagtttca atcttttttc ttttattttt tcatgtgggt

61 actgtattaa tgacatgatc tgttttgatt cttctttatg tgaccgttat tcttgttgta

4981 tcctttctag gaggggtttc atgtattata agttctagtt tgtattattc atttattgaa

5041 tctgctatct taccgtagat tttgatgaac ttaaaacaca cctttttcta aaaaaaaaaa

5101 aaaaaaaaaa aaaa

//

 

After pushing data to other tables we get…

LOCUS AF050497 5114 bp RNA PLN 04-JUN-1999

DEFINITION Lycopersicon esculentum Ca2+-ATPase precursor RNA sequence.

ACCESSION AF050497

NID g4206314

VERSION AF050497.1 GI:4206314

KEYWORDS .

FEATURES Location/Qualifiers

source 1..5114

/organism="Lycopersicon esculentum"

/db_xref="taxon:4081"

precursor_RNA 1..5114

/gene="Ca2+-ATPase"

/note="alternative transcript"

gene 1..5114

/gene="Ca2+-ATPase"

/note="LCA1C"

intron 2612..3211

/gene="Ca2+-ATPase"

BASE COUNT 1401 a 868 c 1134 g 1711 t

ORIGIN

1 tgttgcttcc ttgcaattgc taaagtttca atcttttttc ttttattttt tcatgtgggt

61 actgtattaa tgacatgatc tgttttgatt cttctttatg tgaccgttat tcttgttgta

4981 tcctttctag gaggggtttc atgtattata agttctagtt tgtattattc atttattgaa

5041 tctgctatct taccgtagat tttgatgaac ttaaaacaca cctttttcta aaaaaaaaaa

5101 aaaaaaaaaa aaaa

Looks like there's some markup for genes and introns…..don't want to lose that….

In general the FEATURES subdivision of a GenBank record is a real pain. . . . maybe it should be its own table…..

If that is true, let's look at what the basic Sequence Table for SynTom: looks like, only incorporating GenBank….

LOCUS: the locus name (different from accession -- accessions get VERSION numbers, loci don't)

SEQUENCE_LENGTH: the length of the sequence in bp 650 bp

SEQUENCE_TYPE: DNA

SEQUENCE_SUBTYPE: GSS

DATE ENTERED: 04-FEB-1999

ACCESSION: AQ367761

NID: the only unique identifier! G4222151

VERSION: the *.1 means this is version one of this sequence, and it is "linked" to the UNIQUE GI # 4222151

KEYWORDS: a variable field of identifying keywords

Base Count: obvious…. But I think we should include the IUPAC convention……

SEQUENCE: And then the FASTA formatted sequence…..

We'll get back to Features later, for now this let's us import all the GenBank data into SynTom…

How else do sequences get into SynTom?

Sometimes I get sequences which are just FASTA formatted, with very little additional information…..

>|60_ta56/lv121f.Seq|Fri, Mar 6, 1998 |(Source TMF)|

AGTGGGGGGACACAAAGGCAAGGCAACATGCCCAATTGACACATTGAAAC

TTGGGGCTTGTGTGGATCTTTTAGGTGGGCTTGTTCATATTGGGCTTGGT

GACCCAGCTGTTAATGAATGTTGTCCAATACTTAGTGGGCTTGTTGAACT

TGAAGCTGCTGCTTGCCTTTGCACAACACTTAAAGTCAAATTACTTAACC

TATTCAAAAAAGAATAAGATACGAAATTTGTGGTAATCTATCGCTAAATG

AACATATCTAATATAATTTATTTTTTTTAAAAAATCCGTAGTAACCTAGG

ATTANCTACGAATTTTATGGTAANTAAACCTATTANTAANTTAAAATCTG

GATAAAANNGCAGGTTTCATTCTGAAATTTGAAAAAAATGCATGGGTTTA

CATGNAAAAGGAGGNCCAATTTATGAATTAATTAAATTCTTTCCNCCCNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNAAGNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNN

Not a lot of information that I can glean from the file itself, I have to go back to the source (in this case Theresa) and fill in as many of the table fields as possible. People generally don't include a lot of detail on the "title" line of a FASTA file…..But there's nothing new either, so everything that's in these types of files already exists in the table schema we have working…. However, we do need to come up with a GID of our own -- a unique key to seaparate one sequence from another. Since there are so few files that are in this format, the NAME of the file should be IMHO the NID.

Another way in which sequence data comes into SynTom is through chromatographs…..

 

 

Chromatographs are binary (raw) data which we then convert into sequence using a number of different programs….

Here's how I think we should handle these types of sequences…..

CHROMATOGRAPH: filename of the chromatograph

CONVERSION_PROGRAM: Phred

CONVERSION_DATE: May 4, 1999

CONVERSION_VERSION: 0.960108

CONVERSION_PERSON: Maura Hart (or maybe e-mail for a foreign key…)

CONVERSION_PLATFORM: NT

RAW_SEQUENCE: the sequence (output from phred)

RAW_QUALITY: the quality (output from phred)

SEQUENCE_LENTGH: length of the raw sequence in bp

VECTOR_PROGRAM: cross_match

VECTOR_VERSION: 0.990319

VECTOR_PERSON: Maura Hart

VECTOR_PLATFORM: NT
VECTOR_DATABASE: vector.seq (another foreign key - vector.seq is a collection of vector sequences which should be stored in the database as a screen library table)

CLEAN_SEQUENCE: exported from cross_match

SEQUENCE_START: The starting bp of the Raw sequence considered "good"

SEQUENCE_END: The last bp of the Raw sequence considered "good"

 

And there we have it. Let's put it all together and look at where we will have Keys, and where other tables need to be created:

LOCUS: the locus name if applicable

SEQUENCE_TYPE: DNA

SEQUENCE_SUBTYPE: GSS

DATE ENTERED: 04-FEB-1999

ACCESSION: AQ367761 if applicable

NID: G4222151 (the PRIMARY KEY for the Sequence Table)

VERSION: AQ367761.1

KEYWORDS: a variable field of identifying keywords

Base Count:

CONTACT: A FOREIGN KEY to the People Table

REFERENCE: A FOREIGN KEY to the References Table

Library: A FOREIGN KEY to the Library Table

Organism: Since not every seqeunce comes from a library, we'll have to state what organism some come from FOREIGN KEY to the Taxonomy Table

CHROMATOGRAPH: filename of the chromatograph

CONVERSION_PROGRAM: Phred

CONVERSION_VERSION: 0.960108

CONVERSION_PERSON: Maura Hart (or maybe e-mail for a foreign key…)

CONVERSION_PLATFORM: NT

RAW_SEQUENCE: the sequence (output from phred)

RAW_QUALITY: the quality (output from phred)

SEQUENCE_LENTGH: length of the raw sequence in bp

VECTOR_PROGRAM: cross_match

VECTOR_VERSION: 0.990319

VECTOR_PERSON: Maura Hart

VECTOR_PLATFORM: NT
VECTOR_DATABASE: vector.seq (another foreign key - vector.seq is a collection of vector sequences which should be stored in the database as a screen library table)

CLEAN_SEQUENCE: exported from cross_match

SEQUENCE_START: The starting bp of the Raw sequence considered "good"

SEQUENCE_END: The last bp of the Raw sequence considered "good"

 

Now some things I haven't discussed yet:

SECURITY: Here's where we'll have to deal with which seqeunces are avaialble to whom..

 

CONTIG: a FOREIGN KEY to the assembly# this sequence belongs to

 

CLONE: a FOREIGN KEY to the clone table, many of our sequences are partial sequences (ESTs) from various CLONES

BLAST: BLAST results will be handled differently. . . the BLAST results will be parsed into the database as BLAST tables each query and subject will be identified by its NID, so looking up what is similar to what is a matter of SQL querying the BLAST tables with pointers back to the sequence tables…..

 

Primers: Primers in the database needs more discussion. Not only do we need to keep track of the primer sequences, we need to keep track of the primer's name, who ordered it, where it is, etc. Looks like we need a Primer table.….

PRIMER_LEFT_START:

PRIMER_LEFT_END:

PRIMER_RIGHT_START:

PRIMER_RIGHT_END: