Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Six

Spring 2026, UC Berkeley

Interpretation of Probability

Because Bayesian statistics is simply probability theory applied to inference, understanding the meaning and interpretation of probability is essential.

There are broadly two ways of understanding probability: frequentist and Bayesian.

Frequentist/Objective Understanding of Probability

From the frequentist viewpoint, probability is applicable only in the context of “random experiments” (such as tossing coins and rolling dice). The probability P(A)\P(A) of an event AA is defined as the relative frequency that AA occurs in NN repeated trials of the experiment in the limit as NN \rightarrow \infty:

P(A):=limNNAN\P(A) := \lim_{N \rightarrow \infty} \frac{N_A}{N}

where NAN_A is the number of trials out of NN where AA occurs. This definition of probability is the basis of frequentist statistics.

Here are some examples:

  1. The statement P(H)=0.5\P(H) = 0.5 means that the proportion of heads in a large number of tosses of the coin approaches 0.5.

  2. The statement

    ϵ1,,ϵni.i.dN(0,σ2)\epsilon_1, \dots, \epsilon_n \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)

    means that if the experiment generating ϵ1,,ϵn\epsilon_1, \dots, \epsilon_n is repeated a large number of times, the proportion of times the values of (ϵ1,,ϵn)(\epsilon_1, \dots, \epsilon_n) lie in a set AA approaches

    Ai=1n12πσexp(xi22σ2)dx1dxn\int_A \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{x_i^2}{2 \sigma^2} \right) dx_1 \dots dx_n

    and this should be true for all subsets AA of Rn\R^n.

The following are some obvious problems with the frequentist definition:

  1. It is very restrictive and hardly ever applicable. In many simple situations where we would like to use probability, the frequency definition is simply does not apply:

    1. Is the suspect X guilty?

    2. What is the chance of rain in Berkeley today?

    3. What is the chance that Y is cancer positive given that they tested positive?

    For an interesting anecdote about how this restrictive notion does not simply make sense in some important problems, see DeGrootBlackwell.

  2. Even in situations where the frequency definition is seemingly applicable, closer thought might reveal some issues. For example, the frequentist probability that a coin comes up heads is 0.6 means that 60% of a large number of tosses of the coin should result in 0.6. But the mechanics of no two tosses are really identical and if two tosses are done exactly identically, then we would expect the same outcome by the laws of physics. So the term “identical and independent repetitions of an experiment” is ambiguous.

In the frequentist definition, probability is considered an intrinsic property of the object under investigation which is only accessible by an experiment generating samples of infinite size. Thus frequentist probability is also referred to as “objective probability”. The implication is that we cannot assign it arbitrarily because any probability assignment that does not agree with the frequency in infinite trials is wrong. Unfortunately, the actual frequentist probabililty is seldom known because one cannot generally observe a large number of repetitions of an experiment and so almost all probability assignments are wrong from the frequentist point of view. This is one way of understanding the statistics aphorism: “All models are wrong” (usually attributed to George Box; see All models are wrong).

Here are some quotes by famous statisticians/probabilists illustrating how widespread frequentist thinking in probability is:

The numbers prp_r should, in fact, be regarded as physical constants of the particular die that we are using, and the question as to their numerical values cannot be answered by the axioms of probability, any more than the size and the weight of the die are determined by the geometrical and mechanical axioms. However, experience shows that in a well-made die the frequency of any event rr in any long series of throws usually approaches 1/61/6, and accordingly we shall often assume that all the prp_r are equal to 1/61/6... – Cramr.

Here is Jaynes’s response to the above quote (from page 317 of his book): To a physicist, this statement seems to show utter contempt for the known laws of mechanics. The results of tossing a die many times do not tell us any definite number characteristic only of the die. They tell us also something about how the die was tossed. If you toss ’loaded’ dice in different ways, you can easily alter the relative frequencies of the faces. With only slightly more difficulty, you can still do this if your dice are perfectly ’honest’.

Here is a quote by Feller (see page 322 of the Jaynes book) illustrating the thinking that bridge hands possess physical probabilities and that the uniform probability assignment is a convention whose correctness can only be verified by observed frequencies in a random experiment : The number of possible distributions of cards in bridge is almost 1030. Usually we agree to consider them as equally probable. For a check of this convention more than 1030 experiments would be required – a billion of billion of years if every living person played one game every second, day and night. – Feller.

In spite of these objections, one positive aspect of the frequentist meaning of probability is that the Rules of Probability follow easily from this definition. Recall that the rules of probability are:

  1. P(A)\P(A) always lies between 0 and 1. The probability of an impossible event is 0 and the probability of a certain event is 1.

  2. Product rule: P(AB)=P(A)P(BA)=P(B)P(AB)\P(A \cap B) = \P(A) \P(B|A) = \P(B) \P(A|B).

  3. Sum rule: P(AB)=P(A)+P(B)\P(A \cup B) = \P(A) + \P(B) for disjoint events AA and BB.

Subjective or Bayesian Understanding of Probability

In Bayesian statistics, probability is considered a general method of reasoning under uncertainty. It is applicable to all situations involving uncertainty, and is not restricted to situations involving “random experiments”.

Further, probability is assumed to have nothing to do with frequency. This means there is no right or wrong probability model. Different analysts are welcome to use different models, and they can be assessed in terms of performance. One can also assess different models in a Bayesian model selection framework.

Meaning can be assigned to probability statements without relying on any connection to long-run frequencies. For example, suppose a doctor assigns a probability of 0.02 to a patient having cancer based on some background information. This can be quantified as (see Lindley): The doctor’s degree of belief in the uncertain event that the patient has cancer is the same as the degree of belief of the uncertain event of drawing a red ball from an urn containing 2 red balls and 98 green balls.

Lindley goes on to clearly say that there is no frequency interpretation here. There is no repetition in this definition. The ball is to be taken once, and once only, and the long-run frequency of red balls in repeated drawings is irrelevant. After its withdrawal, the urn and its contents can go up in smoke for all that it matters.

For a concrete example, consider the probability assignment P{HI}=0.5\P\{H \mid I\} = 0.5 where HH denotes heads (in the context of tossing a coin) and II denotes some background information. In the Bayesian context, this is not an informative statement on some long-run frequency of heads while tossing the coin. Instead, it is an assignment of probability by some individual for the next coin toss, based on some information. Since P{HI}=P{TI}\P\{H \mid I\} = \P\{T \mid I\}, it implies that the background information is symmetric between H and T. The actual information itself might vary, for example, consider the following two kinds of background information both of which can justify this assignment P{HI}=0.5\P\{H \mid I\} = 0.5:

  1. Information I1I_1: We don’t know anything at all about the coin. We don’t even know if it really has two sides HH and TT or both of its sides are of only one kind (either HH or TT). In addition, we don’t know how exactly it will be tossed.

  2. Information I2I_2: We know that it is a “regular” coin and that it has two sides HH and TT, and that it will be tossed in the “usual” way.

In the first case above, it might very well happen that the coin has both sides H, in which case repeated tossing of the coin will lead to HHHHH.. so the long run frequency of heads will be 1 (and not 0.5). But this does not invalidate the assignment P(HI1)=0.5\P(H \mid I_1) = 0.5 because it is not a statement on long run frequency of heads, but it claims something about the next toss.

Now consider the third kind of information: We know that this coin has been tossed a large number of times in the past and it landed heads 70% of the time. Now is the assumption P{HI}=0.5\P\{H \mid I\} = 0.5 justifiable? In this case, the probability we need to calculate is:

P{Xn+1=1X1++Xnn=0.7}\begin{align} \P \left\{X_{n+1} = 1 \mid \frac{X_1 + \dots + X_n}{n} = 0.7 \right\} \end{align}

where X1,,XnX_1, \dots, X_n represent the historical tosses and Xn+1X_{n+1} denotes the outcome of the next toss (1 here represents HH and 0 represents TT). The probability (1) cannot be assigned arbitrarily but instead it should be calculated based on some joint model for X1,,Xn+1X_1, \dots, X_{n+1}. Consider the following two models:

X1,,Xn+1i.i.dBernoulli(0.5)\begin{align} X_1, \dots, X_{n+1} \overset{\text{i.i.d}}{\sim} \text{Bernoulli}(0.5) \end{align}

and

X1,,Xn+1θi.i.dBernoulli(θ)   and   θuniform(0,1).\begin{align} X_1, \dots, X_{n+1} \mid \theta \overset{\text{i.i.d}}{\sim} \text{Bernoulli}(\theta) ~~ \text{ and } ~~ \theta \sim \text{uniform}(0, 1). \end{align}

For the first model, Xn+1X_{n+1} is independent of X1,,XnX_1, \dots, X_n and we indeed get P(Xn+1=1)=0.5\P(X_{n+1} = 1) = 0.5. For the second model, the situation is more interesting, and we will get an answer to (1) that is close to 0.7 when nn is large. So here, probability still does not have anything to do with frequency, but in this case, the right kind of model will lead to the frequency assignment.

This flexibility with modeling is a positive feature of the Bayesian framework. However, there remains the question of justification of the rules of probability. If analysts are allowed to come up arbitrary probability models, then why do they have to carry out calculations in accordance with the rules of probability. What is the justification for using the rules of probability? This question was raised by many people including R. A. Fisher, who in Fisher1934 writes:

Keynes establishes the laws of addition and multiplication of probabilities, by stating these laws in the form of definitions of the processes of addition and multiplication. The important step of showing that, when these probabilities have numerical values, “addition” and “multiplication” are so defined, are equivalent to the arithmetical processes ordinarily known by these names, is omitted. The omission is an interesting one, since it shows the difficulty of establishing the laws of mathematical probability, without basing the notion of probability on the concept of frequency, for which these laws are really true, and from which they were originally derived.

It turns out that the rules of probability can be justified without using any connections between probability and frequency. The following arguments are due to the physicist R. T. Cox and are described in Chapters 1 and Chapter 2 of Jaynes. I will give a sketch of the argument skipping some important technical details. For the full argument, please read Jaynes.

Justification of the Rules of Probability without using any connections to Long-run Frequencies

As just mentioned, the following argument is due to the physicist R. T. Cox, and can be read in the book rtcox. I will follow the treatment given in Jaynes.

Let us first remove all restrictions on probabilities and even allow them to take values outside the interval [0,1][0, 1]. To avoid confusion, let us use the term “plausibilities”. We are assigning plausibilities of various events (or propositions) conditional on other events. Let us denote the plausibility of event AA conditional on event BB by (AB)(A|B). Let us first make the assumption that plausibilities take values in the set of real numbers (no restriction now to be in the interval [0,1][0, 1]) and that a higher value of plausibility represents a greater belief.

Product Rule

Let us first investigate why the product rule should be true. The product rule in terms of probabilities states that

P(ABC)=P(BC)P(ABC)\P(AB|C) = \P(B|C) \P(A|BC)

Here ABAB denotes the event ABA \cap B. Should our plausibilities satisfy a similar inequality? Let us first assume that the plausibility (ABC)(AB|C) should really be determined by the two plausibilities (BC)(B|C) and (ABC)(A|BC). This is basically because the process of deciding that ABAB is true can be broken down into first deciding whether BB is true and then, having accepted BB as true, deciding whether AA is true. We shall therefore assume that there should be a function FF such that

(ABC)=F((BC),(ABC)).(AB|C) = F((B|C), (A|BC)).

We also assume that we should use the same function FF for all possible events A,B,CA, B, C (i.e., we are not using one function FF for some A,B,CA, B, C while calculating (ABC)(AB|C) from (BC)(B|C) and (ABC)(A|BC) and using another function FF for different A,B,CA, B, C). This means in particular that

(ABC)=F((AC),(BAC)).(AB|C) = F((A|C), (B|AC)).

It is also reasonable to assume that F(x,y)F(x, y) is monotone increasing in each of its arguments and that it is continuous. If it is not continuous, then a small change in (BC)(B|C) (or (ABC)(A|BC)) might lead to a large change in (ABC)(AB|C) which is undesirable.

Now if we have four events A,B,C,DA, B, C, D, we can write

(ABCD)=F((BCD),(ABCD))=F(F((CD),(BCD)),(ABCD)).(ABC|D) = F((BC|D), (A|BCD)) = F(F((C|D), (B|CD)), (A|BCD)).

We can also write

(ABCD)=F((CD),(ABCD))=F((CD),F((BCD),(ABCD))).(ABC|D) = F((C|D), (AB|CD)) = F((C|D), F((B|CD), (A|BCD))).

We shall now make the following important consistency assumption: If a plausibility can be calculated via two different methods, then both methods should give the same answer. Clearly if this assumption were violated, then our answer to a plausibility calculation would depend on the specific method chosen to calculate and this would be highly undesirable. This assumption immediately implies that

F(F((CD),(BCD)),(ABCD))=F((CD),F((BCD),(ABCD))).F(F((C|D), (B|CD)), (A|BCD)) = F((C|D), F((B|CD), (A|BCD))).

for all A,B,C,DA, B, C, D. If the individual plausibilities are arbitrary, we would get the following condition that the function FF should satisfy

F(F(x,y),z)=F(x,F(y,z))  for all real numbers x,y,z.F(F(x, y), z) = F(x, F(y, z)) ~~ \text{for all real numbers $x, y, z$}.

It now turns out the only functions FF which satisfy the above equation are of the form

F(x,y)=w1(w(x)w(y))F(x, y) = w^{-1}(w(x) w(y))

for a positive continuous increasing function ww. I will skip this derivation (see Section 2.1, Chapter 2 of Jaynes). We thus have

(ABC)=F((BC),(ABC))=w1(w(BC)w(ABC)).(AB|C) = F((B|C), (A|BC)) = w^{-1} \left(w(B|C) w(A|BC) \right).

This is equivalent to

w(ABC)=w(BC)w(ABC).w(AB|C) = w(B|C) w(A|BC).

Now if we take B=AB = A, we get

w(AC)=w(AC)w(AAC)w(A|C) = w(A|C) w(A|AC)

The event AACA|AC can be seen as certainty so we get

w(AC)=w(AC)w(certainty)w(A|C) = w(A|C) w(\text{certainty})

for all AA and CC (note here that w(certainty)w(\text{certainty}) refers to the function ww applied to the plausibility of the certain proposition). This can happen only if

w(certainty)=1.w(\text{certainty}) = 1.

Also if we take B=AcB = A^c in (4), we get

w(AAcC)=w(AcC)w(AAcC)w(AA^c | C) = w(A^c|C) w(A|A^c C)

AAcCAA^c|C and AAcCA|A^cC can both be taken to represent impossibility so we get

w(impossible)=w(AcC)w(impossible)w(\text{impossible}) = w(A^c|C) w(\text{impossible})

for all AA and CC which gives

w(impossible)=0.w(\text{impossible}) = 0.

(5) and (6), along with the monotonicity of ww, imply

0w(AB)1   for all A and B.0 \leq w(A|B) \leq 1 ~~~ \text{for all $A$ and $B$}.

We have thus proved that w(AB)w(A|B) lies always between 0 and 1 (is 0 for impossibility and 1 for certainty) and it satisfies the product rule of probability:

w(ABC)=w(AC)w(BAC)=w(BC)w(ABC).w(AB|C) = w(A|C) w(B|AC) = w(B|C) w(A | BC).

In other words, if we apply this this function ww to our plausibilities, then the resulting assignments satisfy the first two rules of probability.

Sum Rule

Next the goal is to derive the sum rule. We will first derive the sum rule in the simplified form: P(Ac)=1P(A)\P(A^c) = 1 - \P(A), and then prove it in the general case. We shall discuss this argument in the next lecture.