Maximum a posteriori estimation Given a data D = ( x 1 , x 2 , … , x n ) , x i ∈ R D D = \left(x_1, x_2, \ldots, x_n \right), x_i \in \mathcal{R}^D D = ( x 1 , x 2 , … , x n ) , x i ∈ R D and assumed joint probability distribution p ( D , θ ) p(D, \theta) p ( D , θ ) , where θ \theta θ is a random variable. Our goal is to choose a good value of θ \theta θ for D D D such that
θ M A P = arg max θ p ( θ ∣ D ) \theta_{MAP} = \arg\max_\theta p(\theta | D) θ M A P = arg θ max p ( θ ∣ D )
A Maximum a posteriori estimate maximises the “posterior distribution”. This term comes from the Bayes Theorem applied to MLE for some prior distribution of θ \theta θ
p ( θ ∣ D ) = p ( D ∣ θ ) p ( θ ) p ( D ) p(\theta | D) = \frac{p(D | \theta) p(\theta)}{p(D)} p ( θ ∣ D ) = p ( D ) p ( D ∣ θ ) p ( θ )
Example
Compute MAP estimate for a univariate Gaussian distribution with unknown mean.
Given data samples D = ( x 1 , x 2 , … , x n ) D = \left(x_1, x_2, \ldots, x_n \right) D = ( x 1 , x 2 , … , x n ) and x i ∈ R x_i \in \mathcal{R} x i ∈ R
Assumptions:
Let us assume θ \theta θ is a univariate gaussian θ ∼ N ( μ , 1 ) \theta \sim N(\mu, 1) θ ∼ N ( μ , 1 ) . And also assume the data samples given θ \theta θ are i.i.d. are also normally distributed x i ∼ N ( θ , σ ) x_i \sim N(\theta, \sigma) x i ∼ N ( θ , σ ) .
p ( x 1 , x 2 , … , x n ∣ θ ) = ∏ i = 1 n p ( x i ∣ θ ) p(x_1, x_2, \ldots, x_n | \theta) = \prod_{i=1}^n p(x_i | \theta) p ( x 1 , x 2 , … , x n ∣ θ ) = i = 1 ∏ n p ( x i ∣ θ )
Now,
θ M A P = arg max θ p ( θ ∣ D ) = arg max θ p ( D ∣ θ ) p ( θ ) p ( D ) = arg max θ log ( p ( D ∣ θ ) ) + log ( p ( θ ) ) − log ( p ( D ) ) \begin{align*}
\theta_{MAP} &= \arg\max_\theta p(\theta | D) \\
&= \arg\max_\theta \frac{p(D | \theta) p(\theta)}{p(D)} \\
&= \arg\max_\theta \log(p(D | \theta)) + \log(p(\theta)) - \log(p(D)) \\
\end{align*} θ M A P = arg θ max p ( θ ∣ D ) = arg θ max p ( D ) p ( D ∣ θ ) p ( θ ) = arg θ max log ( p ( D ∣ θ )) + log ( p ( θ )) − log ( p ( D ))
Differentiating with respect to θ \theta θ
0 = ∂ ∂ θ ( log ( p ( D ∣ θ ) ) + log ( p ( θ ) ) ) 0 = ∂ ∂ θ log ∏ i = 1 n p ( x i ∣ θ ) + 1 p ( θ ) ∂ ∂ θ p ( θ ) 0 = ∂ ∂ θ ∑ n log p ( x i ∣ θ ) + 1 p ( θ ) ∂ ∂ θ p ( θ ) \begin{align*}
0 &= \frac{\partial}{\partial \theta} \left(\log(p(D | \theta)) + \log(p(\theta))\right) \\
0 &= \frac{\partial}{\partial \theta} \log \prod_{i=1}^n p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\
0 &= \frac{\partial}{\partial \theta} \sum_n \log p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\
\end{align*} 0 0 0 = ∂ θ ∂ ( log ( p ( D ∣ θ )) + log ( p ( θ )) ) = ∂ θ ∂ log i = 1 ∏ n p ( x i ∣ θ ) + p ( θ ) 1 ∂ θ ∂ p ( θ ) = ∂ θ ∂ n ∑ log p ( x i ∣ θ ) + p ( θ ) 1 ∂ θ ∂ p ( θ )
Consider
log p ( x i ∣ θ ) = log 1 2 π σ 2 e − ( x i − θ ) 2 2 σ 2 = log 1 2 π σ 2 − ( x i − θ ) 2 2 σ 2 = ∂ ∂ θ log p ( x i ∣ θ ) = ( x i − θ ) σ 2 \begin{align*}
\log p(x_i | \theta) &= \log \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - \theta)^2}{2 \sigma^2}} \\
&= \log \frac{1}{\sqrt{2 \pi \sigma^2}} - \frac{(x_i - \theta)^2}{2 \sigma^2} \\
&= \frac{\partial}{\partial \theta} \log p(x_i | \theta) = \frac{ (x_i - \theta)}{\sigma^2}
\end{align*} log p ( x i ∣ θ ) = log 2 π σ 2 1 e − 2 σ 2 ( x i − θ ) 2 = log 2 π σ 2 1 − 2 σ 2 ( x i − θ ) 2 = ∂ θ ∂ log p ( x i ∣ θ ) = σ 2 ( x i − θ )
And, (with σ = 1 \sigma = 1 σ = 1 )
p ( θ ) = 1 2 π σ 2 e − ( θ − μ ) 2 2 σ 2 ∂ ∂ θ p ( θ ) = 1 2 π σ 2 e − ( θ − μ ) 2 2 σ 2 − ( θ − μ ) σ 2 1 p ( θ ) ∂ ∂ θ p ( θ ) = 2 π σ 2 e − ( θ − μ ) 2 2 σ 2 ( 1 2 π σ 2 e − ( θ − μ ) 2 2 σ 2 ( θ − μ ) σ 2 ) 1 p ( θ ) ∂ ∂ θ p ( θ ) = θ − μ σ 2 = ( θ − μ ) \begin{align*}
p(\theta) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \\
\frac{\partial}{\partial \theta} p(\theta) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \frac{- (\theta - \mu)}{\sigma^2} \\
\frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) &= \frac{\sqrt{2 \pi \sigma^2}}{e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} } \left( \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \frac{ (\theta - \mu)}{\sigma^2} \right) \\
\frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) &= \frac{\theta - \mu}{\sigma^2} = (\theta - \mu)
\end{align*} p ( θ ) ∂ θ ∂ p ( θ ) p ( θ ) 1 ∂ θ ∂ p ( θ ) p ( θ ) 1 ∂ θ ∂ p ( θ ) = 2 π σ 2 1 e − 2 σ 2 ( θ − μ ) 2 = 2 π σ 2 1 e − 2 σ 2 ( θ − μ ) 2 σ 2 − ( θ − μ ) = e − 2 σ 2 ( θ − μ ) 2 2 π σ 2 ( 2 π σ 2 1 e − 2 σ 2 ( θ − μ ) 2 σ 2 ( θ − μ ) ) = σ 2 θ − μ = ( θ − μ )
Substituting these back
0 = ∂ ∂ θ ∑ n log p ( x i ∣ θ ) + 1 p ( θ ) ∂ ∂ θ p ( θ ) 0 = ∑ n x i − μ σ 2 + ( θ − μ ) 0 = ∑ n x i − n μ σ 2 + ( θ − μ ) 0 = ∑ n x i − n μ + σ 2 θ − σ 2 μ ( σ 2 + n ) μ = ∑ n x i + σ 2 θ μ = ∑ n x i + σ 2 θ σ 2 + n μ = n 1 n ∑ n x i + σ 2 θ σ 2 + n μ = n x ˉ σ 2 + n + σ 2 θ σ 2 + n \begin{align*}
0 &= \frac{\partial}{\partial \theta} \sum_n \log p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\
0 &= \sum_n \frac{x_i - \mu}{\sigma^2} + (\theta - \mu) \\
0 &= \frac{\sum_n x_i - n \mu}{\sigma^2} + (\theta - \mu) \\
0 &= \sum_n x_i - n \mu + \sigma^2 \theta - \sigma^2 \mu \\
(\sigma^2 + n) \mu &= \sum_n x_i + \sigma^2 \theta \\
\mu &= \frac{\sum_n x_i + \sigma^2 \theta}{\sigma^2 + n} \\
\mu &= \frac{n \frac{1}{n} \sum_n x_i + \sigma^2 \theta}{\sigma^2 + n} \\
\mu &= \frac{n \bar{x}}{\sigma^2 + n} + \frac{\sigma^2 \theta}{\sigma^2 + n} \\
\end{align*} 0 0 0 0 ( σ 2 + n ) μ μ μ μ = ∂ θ ∂ n ∑ log p ( x i ∣ θ ) + p ( θ ) 1 ∂ θ ∂ p ( θ ) = n ∑ σ 2 x i − μ + ( θ − μ ) = σ 2 ∑ n x i − n μ + ( θ − μ ) = n ∑ x i − n μ + σ 2 θ − σ 2 μ = n ∑ x i + σ 2 θ = σ 2 + n ∑ n x i + σ 2 θ = σ 2 + n n n 1 ∑ n x i + σ 2 θ = σ 2 + n n x ˉ + σ 2 + n σ 2 θ
Pros and Cons
Pros
They avoid overfitting (because we have a prior that can be used to regularise).
MAP tends to look like MLE asymptotically, (with infinite data points)
Cons
It is a point estimate - Once we estimate the values of θ \theta θ , we do not know the uncertainty associated with it.
Not invariant under reparameterisation. For an MLE, if τ = g ( θ ) \tau = g(\theta) τ = g ( θ ) , where g g g is some function that reparametrises θ \theta θ , then τ M L E = g ( θ M L E ) \tau_{MLE} = g(\theta_{MLE}) τ M L E = g ( θ M L E ) . But this is not true for MAP
We must assume a prior for θ \theta θ