Friday, October 22, 2010

Stochastic Process and Markov Chain

Background
To understand markov chains you need to understand stochastic processes and to understand stochastic processes you need to understand random variables. I'm going to explain random variables briefly here in a very intuitive sense. For more details, refer to the earlier blog post at http://randomvars.blogspot.com/2010/10/fundamental-ideas-in-probability.html.

The most basic thing in probability is the experiment which is a set of trials. The experiment could be "two flips of a coin" consisting of two trials, each of which is a single flip of the coin. The possible outcomes of this experiment are { HH, HT, TH, TT }. This set is called the sample space.

A random variable is an interesting numerical property of the experiment. In our "two flips of a coin" experiment, an interesting numerical property could be the number of heads in any outcome. This would be the random variable:
X : number of heads

The random variable is described by two things: a range and a probability distribution. The range is the possible values that the random variable can take on and the probability distribution is the probability of each of those values.

In our case, the range of X = { 0, 1, 2 } :
0 when the outcome is TT
1 when the outcome is HT or TH and
2 when the outcome is HH

Each of these numbers is associated with a probability. The expression of the probability of each outcome is called the probability mass function. In our case the probability mass function is:

p(0) = 1/4
p(1) = 1/2 and
p(2) = 1/4

Sequence of Random Variables
Stochastic processes require the understanding of the concept of a sequence of random variables. For this, we need to consider a different experiment.

Consider a telephone switchboard that can handle up to 100 simultaneous calls. Define a random variable (an interesting numerical property of the experiment) X as the number of active calls at a given instant of time. The range of this random variable is { 0, 1, 2, ... 100 } which represents the specific number of possible active calls at any given time.

X has a probability mass function that describes the probability of each of the following:
0 active calls,
1 active call,
2 active calls and so on up to
100 active calls.

Now consider a related random variable X1 which is the number of active calls at exactly 10:00 am. And another random variable X2 which is the number of active calls at 10:05 am and so on till X100 which is the number of active calls around ten hours later.

X1 to X100 is a sequence of random variables. They all have the same probability mass function. And we could make the simplifying assumption that they are all independent (most calls dont last longer than 5 minutes). In that case, this sequence of random variables is said to be independent and identically distributed (abbreviated as "iid").

Stochastic Processes
Average:
Let us consider the average of the sequence of random variables:
A = (X1 + X2 + ... + Xn) / n

This means that to find the average number of active calls at any given time, we would add up the values obtained for all the measurements and divide by the total number which is the usual sense of average.

Stochastic Process:
Suppose we consider the outcome of the whole sequence [X1, X2, X3...Xn] as a list. For example, in one experiment we may get the result [32, 35, 40, ..., 22, 23] as the number of active calls at different times. This is one possible outcome of the joint sequence.

Such a joint sequence is called a stochastic process. It is described by the joint probability mass function which is the product of the individual probabilities of the Xi's.

For example, the probability of getting the vector [32, 35, 40, ..., 22, 23] will be given by the product of the probabilities of getting 32 and 35 and 40 and ... and 22 and 23:

p(32, 35, ...23) = p(32) x p(35) x ... p(23).

It gives a complete progression of a sequence of random variables and it has a probability associated with each of the possible outcomes.

Markov Chains
Consider a stochastic process defined by the unending sequence X1, X2,...Xn.... of random variables. Add to this stochastic process the property that the probability of the next random variable in the sequence depends on the value of the previous random variable. Such a stochastic process is called a markov chain. The term "chain" comes from the property (described in detail below) that gives the probability of going from one Xi to the next Xi+1.

Adding this property implies that the probability of Xn depends on the probability of Xn-1. But, Xn-1 could represent a hundred different values from 0 active calls to 100 active calls and Xn itself could represent a hundred different values. This means that there is a probability of Xn-1 being, say, 10 and Xn being 93; and any other possible combination.

This kind of thing can only be defined by a 100 X 100 matrix which we cant show here, but if we were to reduce the problem to a maximum of 3 active calls, we would get a matrix








0 1 2 3
0 1/2 1/4 1/8 1/8
1 1/4 1/4 2/6 1/6
2 1/8 1/4 1/2 1/8
3 0 1/4 1/2 1/4


The first row of the matrix says that when Xn-1 is 0 then the probability of Xn also being 0 is 1/2 or 50%; the probability of Xn being 1 is 1/4 or 25% and so on.

This matrix is called a transition probability matrix for the markov chain.

Steady State Probabilities
There is an interesting consequence of the markov chain. Since the probability of transition into the next element in the sequence depends on the previous element in the sequence, the probability mass function changes every time the next element in the sequence comes up.

This is because the probability mass function for a random variable (of Xn) defines the probability of the random variable (Xn) taking on different possible values in its range. And that is exactly what the row of the transition matrix (corresponding to Xn-1) defines. But it is not the whole story.

The probabilities of Xn taking on different values is the probability mass function which can be calculated as:
the probability of the outcome taking on the first value in the range (viz. 0 active calls) times the transition probability of going from the first value to the Xn value
And this is then summed over all the values in the range of X:

pmf (stochastic process) = q(s) = p(x1)Q(Xn|x1) + p(x2)Q(Xn|x2) +... for the range of X

If the condition that q(s) does not change occurs, then the markov chain is said to be in the steady state.

Tuesday, October 19, 2010

Fundamental Ideas in Probability through Examples

Counting

The first fundamental idea in probability is that of counting.

There is an idea of an experiment which when conducted will result in an outcome - for example, tossing a coin, throwing a die or dice, finding whether a disease exists in a population etc.

The total set of all possible outcomes for the experiment is called the sample space. For example, for the experiment of tossing a coin, the set of all possible outcomes is the set { H, T }. For the experiment of two flips of a coin, the sample set is { HH, HT, TH, TT }. And so on. Enumerating these sets of outcomes is one aspect of counting in probability.

An event is an expression of a set of possible outcomes. For example, getting exactly one H in two flips of a coin is an event; or getting at least one H in two flips is another event.

And the other aspect of counting is enumerating the outcomes for a given event. For example, for the event of getting exactly one H in two flips of a coin is {HT, TH}.

The definition of probability is:

P(E) = n(E) / n(S)

which says that the probability of the occurrence of the event E is the ratio of the number of outcomes corresponding to E to the total number of outcomes possible.

Conditional Probability

If there are two events defined on a sample space, it is possible that the probability of one event is changed if the outcome of the other event is known.

Consider the experiment of two flips of a coin: the sample space is { HH, HT, TH, TT }. Consider two events: A is the event of getting one H in two flips and B is the event of getting a T in the first flip.

The question is, what is the probability of A when the event B has occurred? To answer this question we have to get back to counting: If B has occurred, then the possible outcomes are { TH, TT }. So, A can occur only in the case TH. Therefore, the probability of A occurring is 1/2.

Without the knowledge of B having occurred, the probability of A occurring would be 3/4. The knowledge that one T has already occurred has changed the probability for the event A. This is known as conditional probability.

P(A|B) is the notation for the conditional probability of A when event B has occurred.

The following relationship holds for conditional probability:
P(A|B) = P(A and B) / P(A)

In our example, the probability of A by itself is 3/4 and that for B by itself is 1/2.

The calculation becomes: (3/4 x 1/2) / (3/4) = (3/8) / (3/4) = (4/3) x (3/8)

= 1/2 (same as what we got through counting).

Bayes Theorem

Certain relationships exist between simple and conditional probabilities that make it possible to infer additional probabilities. Bayes theorem provides these relationships.

As an example, consider the following situation:
A certain disease exists in 0.5% of a population (0.005).
A certain blood test is 99% accurate when the disease is present (0.99)
The blood test gives a positive result when the disease does not exist in 5% cases (0.05)

The first statement above expresses the probability of the event (A) of the disease existing in a randomly chosen member of the population. The notation for this probability is P(A).

A second event B of getting a positive result of the blood test is also implied. However, the statements regarding this event are expressed as conditional probabilities:

P(B|A) expresses that the blood test will give a positive result when the disease present. That is, when it is already known that the disease is present, what is the probability of getting a positive result? This answer is 0.99

The third known probability is P(B|~A) which expresses that the blood test will give a positive result when it is known that the disease is not present. This value is 0.05.

What is not known is the probability of having the disease when the result of the test is positive.

If this number is lower than say 0.8 then it would imply that two people in ten who tested positive dont actually have the disease. Which would make the test quite useless for prescribing medication with side effects since two people out of ten would needlessly have to suffer the side effects. But if the number worked out to 0.95 then it would make the test quite useful.

Using Bayes theorem, it is possible to answer this and other related questions. Bayes theorem itself and the related calculations are the subject of another blog post. Here, we are presenting the case for and context of Bayes theorem.

Random Variables

Random variables are used to express probabilities of a group of related events. For example, an event could express getting a sum of 5 on the roll of two dice. But, related to this event are events where the sum is 2, 3, ... 12. Random variables are used to express the groups of related events and their respective probabilities.

A random variable is a function that maps outcomes from the sample space into numbers. This statement implies the definition of an experiment without which the set of outcomes is meaningless. Further, this statement also implies that the random variable expresses something about the outcomes. Let's look at an example.

For example, the experiment is two flips of a coin. The sample set is { HH, HT, TH, TT }. One possibility for a random variable could be:

X: the number of heads

The values for X are { 0, 1, 2 }

So, X has taken the outcomes and mapped them into a set of numbers.

Now, each of the values of X has a probability associated with it:

P(X=0) = 1/4
P(X=1) = 1/2
P(X=2) = 1/4

The above can be expressed as a function F(X) whose individual values are f(x) where each f(x) is actually P(X=t). This means:

f(0) = P(X=0) = 1/4
f(1) = P(X=1) = 1/2
f(2) = P(X=2) = 1/4

The function F(X) is called the probability mass function.

There is one more commonly associated concept with random variables - the Expected Value

The Expected Value E(X) is the weighted sum of the values of X with their corresponding probabilities. For the above example, the expected value is:

E(X) = 0 x f(0) + 1 x f(1) + 2 x f(2)
= 0 + 1/2 + 2/4
= 1

What this implies is that if the experiment is carried out for a large enough number of times and the values of X obtained from the experiments are averaged, then the answer will be 1.

This is reasonable because we are likely to get the value 0 for a quatter of the time, 1 for half the time and 2 for another quarter of the time. Like a sequence like: 1,2,1,0,1,0,1,2,... and the average of this is going to be 1 (or close to it).

So the Expected Value of a random variable is the average of the numbers obtained from a large number of experiments.