Jun1599

SynTom Meeting Notes ALM Tuesday, June 15, 1999

Outline:

Introduction

Definitions and Simple Examples
Design Principles

Syntom Tables:
a) overview

people
libraries
sequences

sources
examples
definitions

Wrap-Up

1a) Definitions and Simple Examples

Most of this comes from: The Practical SQL Handbook - Using Structured Query Language (Third Edition) by Judith S. Bowman, Sandra L. Emerson and Marcy Darnovksy.

The SynTom (Synteny in Tomato)database's goal is to be the global repository for sequence, mapping and synteny information for all of the species in the family solanaceae, with emphasis on tomato. (from http://syntom.cit.cornell.edu/objectives.html

In order to achieve this goal we are Oracle 8 to build a relational database.

Oracle is a relational database management system or DBMS.

Relational databases represent all information in TABLES.

People

Name	Address

John Doe	123 Spud Terrace
Jane Smith	867 Rockinghorse Blvd.
Bill Clinton	1600 Pennsylvania Avenue

A set of related tables forms a database.

Each row describes one occurrence of an entity - a person, a sequence, a car, etc.

Each column describes one characteristic of the entity - a person's name, a sequence's GC content, a car's color, etc.

The intersection of a row and a column is a data element or value. To find the value you want, you need to know the name of the table, what column it's in and the value of the row's primary key or unique identifier.

So, to have Oracle find Bill Clinton's address, you need to tell it to look up Bill Clinton (the unique identifier) in the People table, and extract data from the column named address.

N.B. The names in my table are not in alphabetical order. The order of the rows is insignificant to the database…..

The language used to access information in relational databases is called SQL - Structured Query Language. It's officially pronounced "ess-cue-ell" but most people pronounce it "sequel". I won't talk much about SQL, but just to illustrate, here's how we'd perform the retrieval we did in the previous example.

SELECT Address

FROM People

WHERE Name = "Bill Clinton"

Oracle would then return "1600 Pennsylvania Avenue"

Here's how we'd actually make the table:

CREATE table People

(Name varchar2(40) NOT NULL PRIMARY KEY,

Address varchar2(100) NOT NULL)

TABLESPACE User_Data;

Let's suppose we had another table in our fictional database which keeps track of people's jobs:

jobs

Name	Job

John Doe	Couch Potatoe
Jane Smith	Computer programmer
Bill Clinton	Tabloid fodder

To find out what John Doe does and where he lives, we perform a JOIN (we put two tables together for the query).

SELECT Job, Address

FROM jobs, People

WHERE Name="John Doe"

The Result would be:

Job Address

-------------------------------------------------

Couch Potatoe 123 Spud Terrace

1b) Design Principles

Designing a database involves choosing:

The tables that belong in the database
The columns that belong in each table
How tables and columns interact

Relational databases allow flexibilty - decisions we make now will not limit the questions we can ask later.

Steps in Database Design:

Think about the information in the database.

Where does it come from?
What format is it in?
How will it be entered?
How often will it change?

List the "things" or entities with their properties and attributes
Locate unique identifiers or primary keys for each entity
Consider relationships between the entities

are they one to many (each book has one publisher)
or many to many (an author can write multiple books and a book can have multiple authors
can the data in one proposed table be joined to data in related tables?

Create the database, experiment with some reports and queries
Start all over again

Enough introduction. . . . .

Syntom Tables: Overview

What kinds of information are going to be stored in SynTom?

People (researchers names, e-mail, lab affiliations, phone numbers, etc.)

Organisms (genus, species, cultivar, chromosome number, etc.)

Libraries (tissues, vectors, treatments, made by whom, # of clones, etc.)

Sequences (sequence, vector screened, organism, library, made by whom, BLAST results, expressed or genomic clone it came from, etc.)

Clones (where is it, what library did it come from, where is it, has it been sequenced, what kind of clone is it, etc.)

Maps (number of linkage groups, markers, bins, what markers/clones are in each bin, etc.)

Assemblies (for any library how many assemblies are there, how many sequences in each library, what method was used to create the assembly, BLAST results, etc.)

Taxonomy (how related are two organisms, how many sequences per organism, do they share sequences, etc.)

Expression ….

Similarity….

Synteny . . .

And probably a whole lot more…..

But we can begin to classify some "core" tables that we'll need. And that's how it begins. We'll start with a (hopefully) easily classifiable starting point:

Syntom Tables: people

First Name: Andreas

Last Name: Matern

Initials: ALM

E-mail: alm13@cornell.edu

Address: 622 Rhodes Hall

City: Ithaca

State: NY

Zip: 14853

University: Cornell University

Department: Plant Breeding and Biometry

Phone: 607-254-7473

Fax: 607-255-6683

Title: Graduate Student

Group: Tanksley

Homepage: http://syntom.cit.cornell.edu

Project: SynTom

Last Updated: June 15, 1999

Comments: Limps, sleeps late….

Security: ????

Is that enough information?

What's the primary key?

Well, thanks to the Internet, we know that everyone needs a unique e-mail. So, perhaps e-mail is a good unique identifier…..

How will this be entered? Regretfully by hand…. But each user can enter their own information, so its not so bad.

How will this be useful? Well, everytime a user enters a sequence or a clone, or a map, or any other kind of data, we'd like to track who did what and have a way of contacting them…..

SynTom Tables: Libraries

Organism (Lycopersicon esculentum)

Cultivar (TA496)

Tissue (tomato ovary)

who created the library (Alcala)

when was the library created (January 8, 1999)

who did the sequencing (TIGR)

when was the sequencing done (March 15, 1999)

vector (pBluescript SK(-))

restriction site 1 (EcoR1)

restriction site 2 (Xho1)

developmental stage (5 days pre-anthesis to 5 days post-anthesis)

host (XL1-Blue MRF')

library name (tomato ovary, TAMU)

This table came right out of the data we entered to GenBank…. Do we need more information?

\There's an obvious primary key: library name

There are some obvious foreign keys:

who created the library a link to the people table, although we'll probably have to change Alcala to Alcala's e-mail address….
Who did the sequencing: looks like we need a Lab table. . . .
Organism, a foreign key which will join this table with the Organism and the Taxonomy tables
Cultivar, ibid

SynTom Tables: Sequences

Enough with the trivial examples, let's get down to the nitty gritty . . . sequences.

Sources of Data:

We get sequences for SynTom from a variety of sources, including:

Chromatographs which we have to process and store for later use

GenBank downloads

FASTA files (either from "old" sequencing projects for which we have no chromatographs, or "donated" sequences for which there are no chromatographs)

The best way (IMHO) to create the sequence tables, is to look at these data input files and examine what fields are there. Then we need to combine the different inputs into one (or maybe many) tables.

GENBANK:

Here's an example of a GenBank record for a tomato genomic sequence:

LOCUS AQ367761 650 bp DNA GSS 04-FEB-1999

DEFINITION toxb0002P24r CUGI Tomato BAC Library Lycopersicon esculentum

genomic clone toxb0002P24r, genomic survey sequence.

ACCESSION AQ367761

NID g4222151

VERSION AQ367761.1 GI:4222151

KEYWORDS GSS.

SOURCE tomato.

ORGANISM Lycopersicon esculentum

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;

Asteridae; Solananae; Solanales; Solanaceae; Solanum; Potatoe;

Lycopersicon.

REFERENCE 1 (bases 1 to 650)

AUTHORS Budiman,M.A. and Wing,R.A.

TITLE A Preliminary Analysis of Tomato BAC End Sequencing

JOURNAL Unpublished (1999)

COMMENT Tomato is a vegetable crop that ranks second only

to potatoes in value and importance. Among plant

geneticists and physiologists, tomato represents an ideal

dicot model beside Arabidopsis and monocot rice to derive

genomic information from. To facilitate the genome

analysis of tomato, we have constructed a tomato BAC

library that is suitable for positional cloning, physical

mapping, and genome sequencing. The library contains

129,000 clones and a random sampling of 498 clones

indicated an average insert size of 117.5 kb. With 15X

haploid genome equivalents (1C equals 953 Mb)

(Arumuganathan and Earle, 1991), the probability to

recover any particular sequence is greater than 99%. High

stability, large insert and ease in manipulation make BAC

libraries the choice for genome sequencing. Pre

characterization of a few hundred bases of insert ends

will make BAC clones extremely useful for rapid contig

assembly (Venter, Smith, and Hood, 1996). Here we present

the construction, characterization of the tomato BAC

library, and preliminary analysis of the 1536 tomato BAC

end sequences."

Contact: Wing RA

Clemson University Genomics Institute

Clemson University

100 Jordan Hall, Clemson, SC 29634, USA

Tel: 864 656 7288

Fax: 864 656 4293

Email: rwing@clemson.edu

Seq primer: GGAAACAGCTATGACCATG

Class: BAC ends

High quality sequence start: 29

High quality sequence stop: 526.

FEATURES Location/Qualifiers

source 1..650

/organism="Lycopersicon esculentum"

/cultivar="Heinz 1706"

/note="Vector: pBeloBAC 11; Site_1: HindIII; Site_2:

HindIII;

/db_xref="taxon:4081"

/clone="toxb0002P24r"

/clone_lib="CUGI Tomato BAC Library"

/tissue_type="Nuclei preparation from Leaf"

/lab_host="E. coli DH10B"

BASE COUNT 171 a 103 c 116 g 259 t 1 others

ORIGIN

1 attcgacacg caatctatac aggtcacact atataatact caagcttacg ttgttttagc

61 attccaactc gcataatcgt acattttgtg tacaaattct agtttgccca catcgtatca

121 tgatacaaat gtaggtaatg agaatcggca tccaatgcac tatggattga gttgagcact

181 ttagaatcag ttggtgaacc tccttatatt ctgaaggact tcttttattg tgtttttagt

241 atttttatta ttaggatgtt ctagtgtctg tcctaacatc catcttagtt ttagaagtct

301 acatatatag acagtcaaat tttagtagtt tagtggtctt tgcattttca ttcttatgtt

361 aaagacttga gtttccattt tggccaagtt gaatgtttaa atttttaaaa cattcaagtt

421 atattataat ttagttgagt tcacttcttt gatcattata gtattgattt ttttcttccg

481 ctatgtaaag ttagttagac caagggtccg ctcgaggcca acaatggtct tcgagtgtcg

541 gctatgctca gggtgctggc tcgggacgtg acattcattn ttttgtttat aattatgatg

601 ttgtgtttta caatttgtct atccatgatt atataatgtt tgaacgtttg

Now that's a lot of data! We obviously don't want to discard any of it, so. . . .

Let's take it apart!

I'll start by removing all those items that will be stored in other tables…. . .

Contact: Wing RA

Clemson University Genomics Institute

Clemson University

100 Jordan Hall, Clemson, SC 29634, USA

Tel: 864 656 7288

Fax: 864 656 4293

Email: rwing@clemson.edu

Well that all fits nicely into the People table I described before. . . .so we'll discard it all except for the e-mail address which is our primary key in the people table and therefore a foreign key in this table.

REFERENCE 1 (bases 1 to 650)

AUTHORS Budiman,M.A. and Wing,R.A.

TITLE A Preliminary Analysis of Tomato BAC End Sequencing

JOURNAL Unpublished (1999)

COMMENT Tomato is a vegetable crop that ranks second only

to potatoes in value and importance. Among plant

geneticists and physiologists, tomato represents an ideal

dicot model beside Arabidopsis and monocot rice to derive

genomic information from. To facilitate the genome

analysis of tomato, we have constructed a tomato BAC

library that is suitable for positional cloning, physical

mapping, and genome sequencing. The library contains

129,000 clones and a random sampling of 498 clones

indicated an average insert size of 117.5 kb. With 15X

haploid genome equivalents (1C equals 953 Mb)

(Arumuganathan and Earle, 1991), the probability to

recover any particular sequence is greater than 99%. High

stability, large insert and ease in manipulation make BAC

libraries the choice for genome sequencing. Pre

characterization of a few hundred bases of insert ends

will make BAC clones extremely useful for rapid contig

assembly (Venter, Smith, and Hood, 1996). Here we present

the construction, characterization of the tomato BAC

library, and preliminary analysis of the 1536 tomato BAC

end sequences."

There are going to be lots of references (and there are already lots of references in SolGenes) so let's make a Reference Table later. Perhaps the TITLE should be the key here….

SOURCE tomato.

ORGANISM Lycopersicon esculentum

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;

Asteridae; Solananae; Solanales; Solanaceae; Solanum; Potatoe;

Lycopersicon.

This looks like something for an Organism and a Taxonomy database. ORGANISM is an obvious foreign key…

/organism="Lycopersicon esculentum"

/cultivar="Heinz 1706"

/note="Vector: pBeloBAC 11; Site_1: HindIII; Site_2:

HindIII;

/db_xref="taxon:4081"

/clone="toxb0002P24r"

/clone_lib="CUGI Tomato BAC Library"

/tissue_type="Nuclei preparation from Leaf"

/lab_host="E. coli DH10B"

This goes right into the library table, it's even named (foreign key).

So what do we have left for the Sequence table (genbank edition)