An Introduction to Bioinformatics
Created for Biology 261 at
Frances E. Weaver; last revised November
17, 2008
Objectives: Develop some basic computer database skills that will permit future explorations on
bio-molecular topics.
Learn how to access and use scientific databases, learn how to use the
database Medline to access the cell and molecular biology literature, learn to
distinguish between primary and secondary sources in the scientific literature
(peer reviewed articles vs. reviews) Learn how to use the program BLAST
to search public databases for the identity of an "unknown"
nucleotide sequence. Improve understanding of the relationship between
genes and the proteins they encode. Development of the skills needed to
identify products encoded in the horseshoe crab cDNA inserts we may be
able to send for sequencing.
What is due: Answers to the questions given on this sheet, due to be
reviewed by the instructor at the end of the lab.
Please bring your text book to lab !
Identifying and Researching an "Unknown" Nucleotide or Protein Sequence
You say:
gcccaagaag cccatcctgg gaaggaaaat
gcattgggga accctgtgcg gattcttgtg gctttggccc tatcttttct atgtccaagc tgtgcccatc caaaaagtcc aagatgacac caaaaccctc atcaagacaa ttgtcaccag gatcaatgac atttcacaca cgcagtcagt ctcctccaaa cagaaagtca ccggtttgga cttcattcct gggctccacc ccatcctgac
And I say: Leptin
The sequence of nucleotides in DNA (the gene) is copied out in the mRNA form (the transcript) and that copy is decoded in the process of translation to produce the protein. The order of amino acids in a protein can be predicted if you know the gene or cDNA sequence and the genetic code. Once the primary sequence of the protein is known, we can use powerful bioinformatics techniques to deduce its function by comparing this protein's sequence to those of proteins with known functions.
A. Using BLAST
BLAST (basic local alignment search
tool) uses mathematical functions called algorithms to compare an input
sequence to a nucleotide or protein sequence data base. Used by researchers to
identify unknown sequences they may have generated in creating cDNA or genomic
libraries, or to verify the identity of a particular DNA sequence they have
been attempting to clone, BLAST is one of the most generally useful programs
ever written for bioinformatics.
It is possible to submit a nucleotide sequence, have the computer translate it in all six reading frames and compare those to databases of protein sequences, which is what we will do today, in a real research situation this would be training for the identification of sequences that might be returned to us from our horseshoe crab cDNAs.
Answer these questions before you begin (you may use any source please reference your sources)
1. What is meant by translation ?
2. What is a reading frame?
3. Why are there 6 reading frames, not three?
4. What is homology in the context of nucleotide or protein sequences??
Sources used:
5. Open the document provided at this link We are going to look for protein sequences identical to or homologous to the one encoded by these nucleotides.
6. Select the sequence you have been assigned,
being certain to select the top line that
includes the character >,
and copy it.
7. Open Internet Explorer
8. Visit the BLAST home page http://www.ncbi.nlm.nih.gov/BLAST/ (may take some time to load)
9.Scroll down to and click on: Search protein database using a translated
nucleotide query [blastx]
10. Paste your copied sequence into the search box (shown above) . DO NOT CHANGE ANYTHING ELSE
11. Click on the button labeled BLAST at the bottom of the screen.
You must wait to check your results, this may take some time.
When you have the results------
Clicking on score will show you an alignment (similar to what some of you might have madr with CLUSTALW in Biology Workbench for Biology 161) but for only two sequences at a time. .
What do the scores mean?
The higher the score (bits) the more likely your sequence is identical to
the one found in the data base.
The E value is the number of matches of this quality in a database the
size of the one you just searched that are predicted to occur by chance
alone. Very low E values, such as 4 e-36, mean that it is extremely
unlikely that you will get that good of a match by chance, higher E
values such as 5 or 105 mean that this match occurs often (very often, all the
time) simply by chance.
Notice that E can exceed 1, so it is not a probability value.
So what the heck is this thing............?
Clicking on the link that begins with "gi"
will show you what your sequence found in the database.
Follow some of the links at the top of your list of sequences producing
significant alignments, scrolling down to gather information as each one comes
up. If the information is not what you need, try reading down the file a bit.
12. Based on the results of BLAST,
what protein have you been assigned?
13. Is this protein associated with a human disease or trait, if so what disease or trait?
14. Visit the Genes and Disease site of the National Institutes of
Health
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=gnd.preface.91
and search for the protein's involvement in a disease or trait. The search
box is in the upper right corner.
15. Summarize what you find out about the disease or trait here in your own words.
II. Locating the scientific literature:
to access data bases of the scientific literature. Here we "pretend"
that we don't know which database to use.
First click Databases by title (on the right hand side)
Click on in the left hand panel
Find Articles
Research by Subject Area
Biology
Find Articles
Your best bets for peer reviewed scientific articles are in Basic Biosis and Medline. Medline (aka PubMed, but not PubMed central ) is the current first choice for anything that has medical relevance (insulin for example), and most cellular or molecular biology topics.
Click on Pubmed
Search for articles related to your protein
Write in this space the reference information (authors, titles, source,
date, pages etc, use abreviated form) for any
three refereed (aka primary source) research articles dealing in some way
with your protein ( how do I know it’s a primary
source? see below*)
Abbreviate author lists and titles here, please
1.
2.
3.
Locate a review article on your topic and write the reference information in the space below. ( how do I know it’s a review or secondary source? see below*) Hint: you can search for articles by type
4.
Is the full text available (for free) on
line?
Some articles are. To find some limit your search to full text only
or use PubMed Central
Once you have one, surf there to see what that is like.
2. List the reference information for a full text online article about
your topic.
3. For what would that article be useful ? Is it useful for cellular or
molecular information?
IIc. Expore
some other useful databases: go to databases listed by title http://www3.widener.edu/Academics/Libraries/Wolfgram_Memorial_Library/Find_Articles/657/
4. What sorts of articles can you find in
EbSCO Host? (do not write down title, just types as
in "reviews, full text" etc.)
In Scirus?
In JSTOR?
In Science Direct?
* A peer reviewed or refereed article in a professional journal will read very much like a lab report. It will have an abstract, an introduction, a methods and materials section, a results section and a discussion section. Such articles are written by the people who did the research and are therefore primary sources for science. If you read through an abstract and the authors tell you what experiments they did, then you are likely to have found a primary source. Review articles are secondary sources, although such articles often have an abstract as well. In the First Search engine, the article type is given beneath each abstract.
Once you have located articles and explored other databases go on to the next part, if time allows
5.
Visit Molecules To Go http://molbio.info.nih.gov/cgi-bin/pdb
Enter the name of your protein in the search box, and see if a Protein Data
Bank file is available.
Pick one from the list, select JMol PDB viewer and use the navigation windows to explore the structure
If Molecules To Go doesn't have a structure, try a general search engine, such as Google and do an image search.
Summarize what you find here, including the name of the PDB file or web site with the image:
Include in your summary such information as what you see in the secondary or
tertiary structure, if the protein binds other atoms, how many chains or
subunits there are etc.