An Introduction to Bioinformatics
Created for Biology 261 at
Frances E. Weaver; last revised November 17, 2008
Objectives: Develop some basic computer database skills that will permit future explorations on
Learn how to access and use scientific databases, learn how to use the database Medline to access the cell and molecular biology literature, learn to distinguish between primary and secondary sources in the scientific literature (peer reviewed articles vs. reviews) Learn how to use the program BLAST to search public databases for the identity of an "unknown" nucleotide sequence. Improve understanding of the relationship between genes and the proteins they encode. Development of the skills needed to identify products encoded in the horseshoe crab cDNA inserts we may be able to send for sequencing.
What is due: Answers to the questions given on this sheet, due to be
reviewed by the instructor at the end of the lab.
Please bring your text book to lab !
Identifying and Researching an "Unknown" Nucleotide or Protein Sequence
gcccaagaag cccatcctgg gaaggaaaat gcattgggga accctgtgcg gattcttgtg gctttggccc tatcttttct atgtccaagc tgtgcccatc caaaaagtcc aagatgacac caaaaccctc atcaagacaa ttgtcaccag gatcaatgac atttcacaca cgcagtcagt ctcctccaaa cagaaagtca ccggtttgga cttcattcct gggctccacc ccatcctgac
And I say: Leptin
The sequence of nucleotides in DNA (the gene) is copied out in the mRNA form (the transcript) and that copy is decoded in the process of translation to produce the protein. The order of amino acids in a protein can be predicted if you know the gene or cDNA sequence and the genetic code. Once the primary sequence of the protein is known, we can use powerful bioinformatics techniques to deduce its function by comparing this protein's sequence to those of proteins with known functions.
A. Using BLAST
BLAST (basic local alignment search tool) uses mathematical functions called algorithms to compare an input sequence to a nucleotide or protein sequence data base. Used by researchers to identify unknown sequences they may have generated in creating cDNA or genomic libraries, or to verify the identity of a particular DNA sequence they have been attempting to clone, BLAST is one of the most generally useful programs ever written for bioinformatics.
It is possible to submit a nucleotide sequence, have the computer translate it in all six reading frames and compare those to databases of protein sequences, which is what we will do today, in a real research situation this would be training for the identification of sequences that might be returned to us from our horseshoe crab cDNAs.
Answer these questions before you begin (you may use any source please reference your sources)
1. What is meant by translation ?
2. What is a reading frame?
3. Why are there 6 reading frames, not three?
4. What is homology in the context of nucleotide or protein sequences??
We are going to look for protein sequences identical to or homologous to the one encoded by these nucleotides.
6. Select the sequence you have been assigned,
being certain to select the top line that includes the character >,
and copy it.
7. Open Internet Explorer
8. Visit the BLAST home page http://www.ncbi.nlm.nih.gov/BLAST/ (may take some time to load)
9.Scroll down to and click on: Search protein database using a translated
nucleotide query [blastx]
10. Paste your copied sequence into the search box (shown above) . DO NOT CHANGE ANYTHING ELSE
11. Click on the button labeled BLAST at the bottom of the screen.
You must wait to check your results, this may take some time.
When you have the results------
Clicking on score will show you an alignment (similar to what some of you might have madr with CLUSTALW in Biology Workbench for Biology 161) but for only two sequences at a time. .
What do the scores mean?
The higher the score (bits) the more likely your sequence is identical to
the one found in the data base.
The E value is the number of matches of this quality in a database the size of the one you just searched that are predicted to occur by chance alone. Very low E values, such as 4 e-36, mean that it is extremely unlikely that you will get that good of a match by chance, higher E values such as 5 or 105 mean that this match occurs often (very often, all the time) simply by chance.
Notice that E can exceed 1, so it is not a probability value.
So what the heck is this thing............?
Clicking on the link that begins with "gi"
will show you what your sequence found in the database.
Follow some of the links at the top of your list of sequences producing significant alignments, scrolling down to gather information as each one comes up. If the information is not what you need, try reading down the file a bit.
12. Based on the results of BLAST,
what protein have you been assigned?
13. Is this protein associated with a human disease or trait, if so what disease or trait?
14. Visit the Genes and Disease site of the National Institutes of Health
and search for the protein's involvement in a disease or trait. The search box is in the upper right corner.
15. Summarize what you find out about the disease or trait here in your own words.
II. Locating the scientific literature:
to access data bases of the scientific literature. Here we "pretend" that we don't know which database to use.
First click Databases by title (on the right hand side)
Click on in the left hand panel
Research by Subject Area
Your best bets for peer reviewed scientific articles are in Basic Biosis and Medline. Medline (aka PubMed, but not PubMed central ) is the current first choice for anything that has medical relevance (insulin for example), and most cellular or molecular biology topics.
Click on Pubmed
Search for articles related to your protein
Write in this space the reference information (authors, titles, source,
date, pages etc, use abreviated form) for any
three refereed (aka primary source) research articles dealing in some way
with your protein ( how do I know it’s a primary
source? see below*)
Abbreviate author lists and titles here, please
Locate a review article on your topic and write the reference information in the space below. ( how do I know it’s a review or secondary source? see below*) Hint: you can search for articles by type
Is the full text available (for free) on
Some articles are. To find some limit your search to full text only or use PubMed Central
Once you have one, surf there to see what that is like.
2. List the reference information for a full text online article about
3. For what would that article be useful ? Is it useful for cellular or molecular information?
some other useful databases: go to databases listed by title http://www3.widener.edu/Academics/Libraries/Wolfgram_Memorial_Library/Find_Articles/657/
4. What sorts of articles can you find in
EbSCO Host? (do not write down title, just types as in "reviews, full text" etc.)
In Science Direct?
* A peer reviewed or refereed article in a professional journal will read very much like a lab report. It will have an abstract, an introduction, a methods and materials section, a results section and a discussion section. Such articles are written by the people who did the research and are therefore primary sources for science. If you read through an abstract and the authors tell you what experiments they did, then you are likely to have found a primary source. Review articles are secondary sources, although such articles often have an abstract as well. In the First Search engine, the article type is given beneath each abstract.
Once you have located articles and explored other databases go on to the next part, if time allows
Visit Molecules To Go http://molbio.info.nih.gov/cgi-bin/pdb
Enter the name of your protein in the search box, and see if a Protein Data Bank file is available.
Pick one from the list, select JMol PDB viewer and use the navigation windows to explore the structure
If Molecules To Go doesn't have a structure, try a general search engine, such as Google and do an image search.
Summarize what you find here, including the name of the PDB file or web site with the image:
Include in your summary such information as what you see in the secondary or
tertiary structure, if the protein binds other atoms, how many chains or
subunits there are etc.