<< Chapter < Page Chapter >> Page >

An example: finding cpg islands

This example is taken from the excellent textbook Biological Sequence Analysis: probabilistic models of proteins and nucleic acids by Durbin, Eddy, Krogh and Mitchison. CpG islands are regions of the genome with a higher than normal percentage of C and G bases adjacent to each other. The usual percentage of adjacent CG bases in the genome is about 1%, but in CpG islands that percentage is over 6%. The reason that C followed by G is relatively rare in The "p" in "CpG" refers to the phosphodiester bond between the cytosine and the guanine, and serves to distinguish it from the C and G pairing on the double stranded DNA helix. CpG islands are bioogically intersting because they are in or near 40% of the promoters in mammalian genes and 70% in human promoter genes. CpG islands vary in length between 300 and 3000 basepairs. Thus fixed-length consensus sequence based approaches do not work well for detecting them. Effective identification of of CpG islands can aid in localizing genes in eukaryotes. CpG island detection also serves as an excellent problem to illustrate the power of Markov models.

We will consider two problems.

  • Given a short DNA sequence, does it come from a CpG island or not?
  • Given a long DNA sequence, find all the CpG islands on it, if any.

Generative models of biological sequences

We will construct generative models of CpG islands. A generative model produces strings, and the model parameters are tuned to reflect the characteristics of CpG islands.

Generative models for cpg island detection

The simplest probabilistic generative DNA sequence model associates a probability with the occurrence of each base: P(A), P(C), P(G) and P(T) such that these probabilities all sum to 1. For H. influenzae, these probabilities are P(A) = 0.3, P(C) = 0.2, P(G) = 0.2, and P(T) = 0.3. To generate a sequence based on this model, we first choose the length L of the sequence that we wish to construct. Then we draw bases for each position based on the discrete distribution above, as shown in the code fragement below.

i = 1; while i less-than-or-equal-to L doS[i] = a base drawn from the discrete probability distribution [0.3,0.2,0.2,0.3](for A,C,G,T) i = i+1end

This model does not capture interdependencies between bases. It assumes that the choice of base in each position of the generated sequence is independent of the bases surrounding it. A more complex model of DNA sequences can be constructed using the theory of Markov chains. In Markov chains, the probability of observing a base at a given position in a sequence is conditioned on the bases preceding it. Thus, Markov chains can model local correlations among the nucleotides. A Markov chain of order 1 assumes that the probability of a base at position i is dependent only on the base at position i - 1. A first order Markov chain can be specified by a probability matrix as shown below.

A first order markov model for generating dna sequences
A C G T
A 0.6 0.2 0.1 0.1
C 0.1 0.1 0.8 0.0
G 0.2 0.2 0.3 0.3
T 0.1 0.8 0.0 0.1

Questions & Answers

what is biology
Hajah Reply
the study of living organisms and their interactions with one another and their environments
AI-Robot
what is biology
Victoria Reply
HOW CAN MAN ORGAN FUNCTION
Alfred Reply
the diagram of the digestive system
Assiatu Reply
allimentary cannel
Ogenrwot
How does twins formed
William Reply
They formed in two ways first when one sperm and one egg are splited by mitosis or two sperm and two eggs join together
Oluwatobi
what is genetics
Josephine Reply
Genetics is the study of heredity
Misack
how does twins formed?
Misack
What is manual
Hassan Reply
discuss biological phenomenon and provide pieces of evidence to show that it was responsible for the formation of eukaryotic organelles
Joseph Reply
what is biology
Yousuf Reply
the study of living organisms and their interactions with one another and their environment.
Wine
discuss the biological phenomenon and provide pieces of evidence to show that it was responsible for the formation of eukaryotic organelles in an essay form
Joseph Reply
what is the blood cells
Shaker Reply
list any five characteristics of the blood cells
Shaker
lack electricity and its more savely than electronic microscope because its naturally by using of light
Abdullahi Reply
advantage of electronic microscope is easily and clearly while disadvantage is dangerous because its electronic. advantage of light microscope is savely and naturally by sun while disadvantage is not easily,means its not sharp and not clear
Abdullahi
cell theory state that every organisms composed of one or more cell,cell is the basic unit of life
Abdullahi
is like gone fail us
DENG
cells is the basic structure and functions of all living things
Ramadan
What is classification
ISCONT Reply
is organisms that are similar into groups called tara
Yamosa
in what situation (s) would be the use of a scanning electron microscope be ideal and why?
Kenna Reply
A scanning electron microscope (SEM) is ideal for situations requiring high-resolution imaging of surfaces. It is commonly used in materials science, biology, and geology to examine the topography and composition of samples at a nanoscale level. SEM is particularly useful for studying fine details,
Hilary
cell is the building block of life.
Condoleezza Reply
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask