Probability

Friday, October 22, 2010

Stochastic Process and Markov Chain

Background
To understand markov chains you need to understand stochastic processes and to understand stochastic processes you need to understand random variables. I'm going to explain random variables briefly here in a very intuitive sense. For more details, refer to the earlier blog post at http://randomvars.blogspot.com/2010/10/fundamental-ideas-in-probability.html.

The most basic thing in probability is the experiment which is a set of trials. The experiment could be "two flips of a coin" consisting of two trials, each of which is a single flip of the coin. The possible outcomes of this experiment are { HH, HT, TH, TT }. This set is called the sample space.

A random variable is an interesting numerical property of the experiment. In our "two flips of a coin" experiment, an interesting numerical property could be the number of heads in any outcome. This would be the random variable:
X : number of heads

The random variable is described by two things: a range and a probability distribution. The range is the possible values that the random variable can take on and the probability distribution is the probability of each of those values.

In our case, the range of X = { 0, 1, 2 } :
0 when the outcome is TT
1 when the outcome is HT or TH and
2 when the outcome is HH

Each of these numbers is associated with a probability. The expression of the probability of each outcome is called the probability mass function. In our case the probability mass function is:

p(0) = 1/4
p(1) = 1/2 and
p(2) = 1/4

Sequence of Random Variables
Stochastic processes require the understanding of the concept of a sequence of random variables. For this, we need to consider a different experiment.

Consider a telephone switchboard that can handle up to 100 simultaneous calls. Define a random variable (an interesting numerical property of the experiment) X as the number of active calls at a given instant of time. The range of this random variable is { 0, 1, 2, ... 100 } which represents the specific number of possible active calls at any given time.

X has a probability mass function that describes the probability of each of the following:
0 active calls,
1 active call,
2 active calls and so on up to
100 active calls.

Now consider a related random variable X1 which is the number of active calls at exactly 10:00 am. And another random variable X2 which is the number of active calls at 10:05 am and so on till X100 which is the number of active calls around ten hours later.

X1 to X100 is a sequence of random variables. They all have the same probability mass function. And we could make the simplifying assumption that they are all independent (most calls dont last longer than 5 minutes). In that case, this sequence of random variables is said to be independent and identically distributed (abbreviated as "iid").

Stochastic Processes
Average:
Let us consider the average of the sequence of random variables:
A = (X1 + X2 + ... + Xn) / n

This means that to find the average number of active calls at any given time, we would add up the values obtained for all the measurements and divide by the total number which is the usual sense of average.

Stochastic Process:
Suppose we consider the outcome of the whole sequence [X1, X2, X3...Xn] as a list. For example, in one experiment we may get the result [32, 35, 40, ..., 22, 23] as the number of active calls at different times. This is one possible outcome of the joint sequence.

Such a joint sequence is called a stochastic process. It is described by the joint probability mass function which is the product of the individual probabilities of the Xi's.

For example, the probability of getting the vector [32, 35, 40, ..., 22, 23] will be given by the product of the probabilities of getting 32 and 35 and 40 and ... and 22 and 23:

p(32, 35, ...23) = p(32) x p(35) x ... p(23).

It gives a complete progression of a sequence of random variables and it has a probability associated with each of the possible outcomes.

Markov Chains
Consider a stochastic process defined by the unending sequence X1, X2,...Xn.... of random variables. Add to this stochastic process the property that the probability of the next random variable in the sequence depends on the value of the previous random variable. Such a stochastic process is called a markov chain. The term "chain" comes from the property (described in detail below) that gives the probability of going from one Xi to the next Xi+1.

Adding this property implies that the probability of Xn depends on the probability of Xn-1. But, Xn-1 could represent a hundred different values from 0 active calls to 100 active calls and Xn itself could represent a hundred different values. This means that there is a probability of Xn-1 being, say, 10 and Xn being 93; and any other possible combination.

This kind of thing can only be defined by a 100 X 100 matrix which we cant show here, but if we were to reduce the problem to a maximum of 3 active calls, we would get a matrix

	0	1	2	3
0	1/2	1/4	1/8	1/8
1	1/4	1/4	2/6	1/6
2	1/8	1/4	1/2	1/8
3	0	1/4	1/2	1/4

The first row of the matrix says that when Xn-1 is 0 then the probability of Xn also being 0 is 1/2 or 50%; the probability of Xn being 1 is 1/4 or 25% and so on.

This matrix is called a transition probability matrix for the markov chain.

Steady State Probabilities
There is an interesting consequence of the markov chain. Since the probability of transition into the next element in the sequence depends on the previous element in the sequence, the probability mass function changes every time the next element in the sequence comes up.

This is because the probability mass function for a random variable (of Xn) defines the probability of the random variable (Xn) taking on different possible values in its range. And that is exactly what the row of the transition matrix (corresponding to Xn-1) defines. But it is not the whole story.

The probabilities of Xn taking on different values is the probability mass function which can be calculated as:
the probability of the outcome taking on the first value in the range (viz. 0 active calls) times the transition probability of going from the first value to the Xn value
And this is then summed over all the values in the range of X:

pmf (stochastic process) = q(s) = p(x1)Q(Xn|x1) + p(x2)Q(Xn|x2) +... for the range of X

If the condition that q(s) does not change occurs, then the markov chain is said to be in the steady state.

Tuesday, October 19, 2010

Fundamental Ideas in Probability through Examples

Counting

The first fundamental idea in probability is that of counting.

There is an idea of an experiment which when conducted will result in an outcome - for example, tossing a coin, throwing a die or dice, finding whether a disease exists in a population etc.

The total set of all possible outcomes for the experiment is called the sample space. For example, for the experiment of tossing a coin, the set of all possible outcomes is the set { H, T }. For the experiment of two flips of a coin, the sample set is { HH, HT, TH, TT }. And so on. Enumerating these sets of outcomes is one aspect of counting in probability.

An event is an expression of a set of possible outcomes. For example, getting exactly one H in two flips of a coin is an event; or getting at least one H in two flips is another event.

And the other aspect of counting is enumerating the outcomes for a given event. For example, for the event of getting exactly one H in two flips of a coin is {HT, TH}.

The definition of probability is:

P(E) = n(E) / n(S)

which says that the probability of the occurrence of the event E is the ratio of the number of outcomes corresponding to E to the total number of outcomes possible.

Conditional Probability

If there are two events defined on a sample space, it is possible that the probability of one event is changed if the outcome of the other event is known.

Consider the experiment of two flips of a coin: the sample space is { HH, HT, TH, TT }. Consider two events: A is the event of getting one H in two flips and B is the event of getting a T in the first flip.

The question is, what is the probability of A when the event B has occurred? To answer this question we have to get back to counting: If B has occurred, then the possible outcomes are { TH, TT }. So, A can occur only in the case TH. Therefore, the probability of A occurring is 1/2.

Without the knowledge of B having occurred, the probability of A occurring would be 3/4. The knowledge that one T has already occurred has changed the probability for the event A. This is known as conditional probability.

P(A|B) is the notation for the conditional probability of A when event B has occurred.

The following relationship holds for conditional probability:
P(A|B) = P(A and B) / P(A)

In our example, the probability of A by itself is 3/4 and that for B by itself is 1/2.

The calculation becomes: (3/4 x 1/2) / (3/4) = (3/8) / (3/4) = (4/3) x (3/8)

= 1/2 (same as what we got through counting).

Bayes Theorem

Certain relationships exist between simple and conditional probabilities that make it possible to infer additional probabilities. Bayes theorem provides these relationships.

As an example, consider the following situation:
A certain disease exists in 0.5% of a population (0.005).
A certain blood test is 99% accurate when the disease is present (0.99)
The blood test gives a positive result when the disease does not exist in 5% cases (0.05)

The first statement above expresses the probability of the event (A) of the disease existing in a randomly chosen member of the population. The notation for this probability is P(A).

A second event B of getting a positive result of the blood test is also implied. However, the statements regarding this event are expressed as conditional probabilities:

P(B|A) expresses that the blood test will give a positive result when the disease present. That is, when it is already known that the disease is present, what is the probability of getting a positive result? This answer is 0.99

The third known probability is P(B|~A) which expresses that the blood test will give a positive result when it is known that the disease is not present. This value is 0.05.

What is not known is the probability of having the disease when the result of the test is positive.

If this number is lower than say 0.8 then it would imply that two people in ten who tested positive dont actually have the disease. Which would make the test quite useless for prescribing medication with side effects since two people out of ten would needlessly have to suffer the side effects. But if the number worked out to 0.95 then it would make the test quite useful.

Using Bayes theorem, it is possible to answer this and other related questions. Bayes theorem itself and the related calculations are the subject of another blog post. Here, we are presenting the case for and context of Bayes theorem.

Random Variables

Random variables are used to express probabilities of a group of related events. For example, an event could express getting a sum of 5 on the roll of two dice. But, related to this event are events where the sum is 2, 3, ... 12. Random variables are used to express the groups of related events and their respective probabilities.

A random variable is a function that maps outcomes from the sample space into numbers. This statement implies the definition of an experiment without which the set of outcomes is meaningless. Further, this statement also implies that the random variable expresses something about the outcomes. Let's look at an example.

For example, the experiment is two flips of a coin. The sample set is { HH, HT, TH, TT }. One possibility for a random variable could be:

X: the number of heads

The values for X are { 0, 1, 2 }

So, X has taken the outcomes and mapped them into a set of numbers.

Now, each of the values of X has a probability associated with it:

P(X=0) = 1/4
P(X=1) = 1/2
P(X=2) = 1/4

The above can be expressed as a function F(X) whose individual values are f(x) where each f(x) is actually P(X=t). This means:

f(0) = P(X=0) = 1/4
f(1) = P(X=1) = 1/2
f(2) = P(X=2) = 1/4

The function F(X) is called the probability mass function.

There is one more commonly associated concept with random variables - the Expected Value

The Expected Value E(X) is the weighted sum of the values of X with their corresponding probabilities. For the above example, the expected value is:

E(X) = 0 x f(0) + 1 x f(1) + 2 x f(2)
= 0 + 1/2 + 2/4
= 1

What this implies is that if the experiment is carried out for a large enough number of times and the values of X obtained from the experiments are averaged, then the answer will be 1.

This is reasonable because we are likely to get the value 0 for a quatter of the time, 1 for half the time and 2 for another quarter of the time. Like a sequence like: 1,2,1,0,1,0,1,2,... and the average of this is going to be 1 (or close to it).

So the Expected Value of a random variable is the average of the numbers obtained from a large number of experiments.

Thursday, November 22, 2007

Intuitively Understanding Random Variables

Understanding random variables and probability distributions is not as intuitive as one would like it to be. And searching on the internet typically gets you articles that contain a lot of equations and a lot of math which doesnt give the intuitive feel for the subject. The following discussion is to give that intuitive feel for the subject.

Probability

The first thing to understand is probabilities. And the simple situation to deal with probabilities is that of tossing a coin. Multiple times. And one needs to understand the probabilities involved in this.

There is a very parallel situation that can be used for understanding this idea: binary numbers. Binary numbers are numbers that consist of ones and zeros only. Counting in binary numbers proceeds as:
0
1
10
11
100
101
110
111 ... where these numbers numerically correspond to 0 1 2 3 4 5 6 7 ...

Now consider a binary number (e.g. 1100011001). It could be used to represent a sequence of coin tosses where the sequence of outcomes was (H H T T T H H T T H) where the H and T have been replaced by 1 and 0 respectively.

Now we can use some properties of binary numbers which will help us in calculating probabilities.

Consider an experiment consisting of 10 tosses of a coin. This will lead to a ten bit binary number (bit = binary digit). Like 1100011001.

All combinations of 10 bit binary numbers can be enumerated as
00000 00000
00000 00001
00000 00010
...
11111 11101
11111 11110
11111 11111

From the theory of binary numbers we know that there will be 1024 (or 2 raised to 10) such numbers. This means that there will be 1024 possible outcomes for 10 tosses of a coin.

Now let us consider a possible outcome of the experiment - let us consider the outcome that there will be only one head, all the others will be tails. To understand this, let us look at it in terms of the binary numbers. What we are looking for is those 10 bit binary numbers which have only one 1 in them. How many of these do we have? Lets enumerate:
00000 00001
00000 00010
00000 00100
00000 01000
...
00100 00000
01000 00000
10000 00000

There are ten of them - with a 1 in each of the 10 digit (bit) positions.
So what can one say about the probability of getting one of these patterns: this is expressed as 10 possibilities in 1024 outcomes (10/1024) which is less than one in hundred (or 0.009... also expressed as 0.9..% probability).

So, get it? The probability is the total outcomes matching the criteria divided by the total outcomes possible.

Random Variables and Probability Distributions
Above we considered the probablity of a single outcome - that of getting a single head in a series of 10 tosses of a coin. And got a value that is less that one in hundred.

We could continue this analysis and find the probabilities for getting 2 heads in a series of 10 tosses. The analysis would be more complicated, but here is the general pattern that we would follow:

Let us fix the position of one 1, and see how many other possibilities are there.
XXXXX XXXX1

00000 00011
00000 00101
00000 01001
...
10000 00001
There are nine of these patterns since the 1 in the last position is fixed and the other 1 can be in the remaining nine posistions. So these are nine outcomes.
Now we change the position of the fixed 1 to the next spot:

00000 00011
00000 00110
00000 01010
...
10000 00010
There are nine more of these patterns. However, notice that the pattern 00000 00011 has repeated.

And we can continue to shift the fixed 1 in all the 10 bit positions.
So we get 10 X 9 = 90 possible variations. But each pattern has occurred twice, so we get 90 / 2 distinct outcomes = 45. So, now the proability of getting 2 heads in a series of 10 tosses is 45 / 1024 which is a 4-fold increase in the probability.

We continue to do this kind of counting and getting the possible outcomes for each condition - 3 heads, 4 heads, 5 heads ...

OK - so we are now in a position to define the random variable:
Let us represent by "X" the specific outcome like 4 heads or 7 heads etc. What this means is that X can take values from 0 to 10. So if X has the value 7, it means that we are talking about the outcome of getting 7 heads in a series of 10 tosses.

Because X can take on differnt values (all the way from 0 to 10), we call it a variable and it can be plotted on a graph. We plot X on the horizontal (or x-)axis.

Then what can we plot on the y-axis? The probability values.

For a series of 10 tosses, the values for the probabilities work out to:
X=0 P=0.00097
X=1 P=0.0097
X=2 P=0.043
X=3 P=0.17
X=4 P=0.20
X=5 P=0.24
X=6 P=0.20
X=7 P=0.17
X=8 P=0.043
X=9 P=0.0097
X=10P=0.00097

If you plot these values on a graph on your notebook and join the dots with a smooth curve, you will get the famous "bell-shaped" curve.

This graph is called the probablity distribution for the series of 10 tosses of a coin. The distribution is also referred to as the binomial distribution.

A Technicality (that can be ignored)
Now, what we called X would properly be called X-10 (actually, X with subscript 10) since it related to a series of 10 tosses. We could also consider X-11 which represents a series of 11 tosses. And of course, X-1...X-10... to infinity. We can represents all these X's by a single X which represents the outcomes of any series of coin tosses. This is properly called the random variable. However, there is a probabilty distribution associated with each X-i and so what we have considered is technically correct.