An Introduction to Bioinformatics

Created for Biology 261 at Widener University

Frances E. Weaver; last revised November 17, 2008

Objectives: Develop some basic computer database skills that will permit future explorations on bio-molecular topics.  
 Learn how to access and use scientific databases, learn how to use the database Medline to access the cell and molecular biology literature, learn to distinguish between primary and secondary sources in the scientific literature (peer reviewed articles vs. reviews)  Learn how to use the program BLAST to search public databases for the identity of an "unknown" nucleotide sequence.  Improve understanding of the relationship between genes and the proteins they encode. Development of the skills needed to identify products encoded in the horseshoe  crab cDNA inserts we may be able to send for sequencing.

What is due: Answers to the questions given on this sheet, due to be reviewed by the instructor at the end of the lab. 
Please bring your text book to lab !


 

Identifying and Researching an "Unknown" Nucleotide or Protein Sequence

You say:
gcccaagaag cccatcctgg gaaggaaaat gcattgggga accctgtgcg gattcttgtg gctttggccc tatcttttct atgtccaagc tgtgcccatc caaaaagtcc aagatgacac caaaaccctc atcaagacaa ttgtcaccag gatcaatgac atttcacaca cgcagtcagt ctcctccaaa cagaaagtca ccggtttgga cttcattcct gggctccacc ccatcctgac

And I say:  Leptin
 

The sequence of nucleotides in DNA (the gene) is copied out in the mRNA form (the transcript) and that copy is decoded in the process of translation to produce the protein.  The order of amino acids in a protein can be predicted if you know the gene or cDNA sequence and the genetic code.  Once the primary sequence of the protein is known, we can use powerful bioinformatics techniques to deduce its function by comparing this protein's sequence to those of proteins with known functions. 

A. Using BLAST
    BLAST (basic local alignment search tool) uses mathematical functions called algorithms to compare an input sequence to a nucleotide or protein sequence data base. Used by researchers to identify unknown sequences they may have generated in creating cDNA or genomic libraries, or to verify the identity of a particular DNA sequence they have been attempting to clone, BLAST is one of the most generally useful programs ever written for bioinformatics.

    It is possible to submit a nucleotide sequence, have the computer translate it in all six reading frames and compare those to databases of protein sequences, which is what we will do today, in a real research situation this would be training for the identification of sequences that might be returned to us from our horseshoe crab cDNAs.

   Answer these questions before you begin (you may use any source please reference your sources)

1. What is meant by translation ?
 

2. What is a reading frame?
 

3. Why are there 6 reading frames, not three?

 

4. What is homology in the context of nucleotide or protein sequences??

 

 

Sources used:

 

5. Open the document provided at this link  We are going to look for protein sequences identical to or homologous to the one encoded by these nucleotides. 

6. Select the sequence you have been assigned,

being certain to select the top line that includes the character >,

and copy it.

 

7. Open Internet Explorer

8. Visit the BLAST home page http://www.ncbi.nlm.nih.gov/BLAST/ (may take some time to load)

 

9.Scroll down to and click on: Search protein database using a translated nucleotide query [blastx]
 

10. Paste your copied sequence into the search box (shown above) . DO NOT CHANGE ANYTHING ELSE

11. Click on the button labeled BLAST at the bottom of the screen.

You must wait to check your results, this may take some time.

When you have the results------ 

Clicking on score will show you an alignment (similar to what some of you might have madr with CLUSTALW in Biology Workbench for Biology 161) but for only two sequences at a time. .

What do the scores mean?

The higher the score (bits) the more likely your sequence is identical to the one found in the data base.
The E value is the number of matches of this quality in a database the size of the one you just searched that are predicted to occur by chance alone.  Very low E values, such as 4 e-36, mean that it is extremely unlikely that you will get that good of a match by chance,  higher E values such as 5 or 105 mean that this match occurs often (very often, all the time)  simply by chance.
Notice that E can exceed 1, so it is not a probability value.

So what the heck is this thing............?

Clicking on the link that begins with "gi" will show you what your sequence found in the database.
Follow some of the links at the top of your list of sequences producing significant alignments, scrolling down to gather information as each one comes up. If the information is not what you need, try reading down the file a bit.

12.  Based on the results of BLAST, what protein have you been assigned?
 
 
 

13. Is this protein associated with a human disease or trait, if so what disease or trait?

 

 


14.  Visit the Genes and Disease site of the National Institutes of Health

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=gnd.preface.91
and search for the protein's involvement in a disease or trait.  The search box is in the upper right corner.

15. Summarize what you find out about the disease or trait here in your own words.

 

 

 


 
II. Locating the scientific literature:
Visit Widener University's Library Web site http://www3.widener.edu/Academics/Libraries/Wolfgram_Memorial_Library/480/
to access data bases of the scientific literature. Here we "pretend" that we don't know which database to use.

First click Databases by title (on the right hand side)

Click on in the left  hand panel
Find Articles
Research by Subject Area
Biology
Find Articles

Your best bets for peer reviewed scientific articles are in Basic Biosis and Medline. Medline (aka PubMed, but not PubMed central ) is the current first choice for anything that has medical relevance (insulin for example), and most cellular or molecular biology topics.

Click on Pubmed

Search for articles related to your protein

Write in this space the reference information (authors, titles, source, date, pages etc, use abreviated form) for any three  refereed (aka primary source) research articles dealing in some way with your protein  ( how do I know it’s a primary source?  see below*)
Abbreviate author lists and titles here, please

1.
 
 

2.
 
 

3.
 

Locate a review article on your topic and write the reference information in the space below. ( how do I know it’s a review or secondary source?  see below*) Hint: you can search for articles by type

4.
 
 

Is the full text available (for free) on line?
Some articles are.  To find some limit your search to full text only  or use PubMed Central
Once you have one, surf there to see what that is like.

2. List the reference information for a full text online article about your topic.
 




3. For what would that article be useful ?  Is it useful for cellular or molecular information?


 

 

 

IIc. Expore some other useful databases: go to databases listed by title http://www3.widener.edu/Academics/Libraries/Wolfgram_Memorial_Library/Find_Articles/657/

4.  What sorts of articles can you find in

EbSCO Host? (do not write down title, just types as in "reviews, full text" etc.)



In Scirus?

 
In JSTOR?

In Science Direct?

* A peer reviewed or refereed article in a professional journal will read very much like a lab report. It will have an abstract, an introduction, a methods and materials section, a results section and a discussion section.  Such articles are written by the people who did the research and are therefore primary sources for science. If you read through an abstract and the authors tell you what experiments they did, then you are likely to have found a primary source.  Review articles are secondary sources, although such articles often have an abstract as well.  In the First Search engine, the article type is given beneath each abstract.

Once you have located articles and explored other databases go on to the next part, if time allows

5.      Visit Molecules To Go  http://molbio.info.nih.gov/cgi-bin/pdb
Enter the name of your protein in the search box, and see if a Protein Data Bank file is available.  

Pick one from the list, select JMol PDB viewer and use the navigation windows to explore the structure

If Molecules To Go doesn't have a structure, try a general search engine, such as Google and do an image search.

Summarize what you find here, including the name of the PDB file or web site with the image:

 

 

 

 

Include in your summary such information as what you see in the secondary or tertiary structure, if the protein binds other atoms, how many chains or subunits there are etc.