Maximum a posteriori estimation

Given a data $D = \left(x_1, x_2, \ldots, x_n \right), x_i \in \mathcal{R}^D$ and assumed joint probability distribution $p(D, \theta)$ , where $\theta$ is a random variable. Our goal is to choose a good value of $\theta$ for $D$ such that

\theta_{MAP} = \arg\max_\theta p(\theta | D)

A Maximum a posteriori estimate maximises the “posterior distribution”. This term comes from the Bayes Theorem applied to MLE for some prior distribution of $\theta$

p(\theta | D) = \frac{p(D | \theta) p(\theta)}{p(D)}

Example

Compute MAP estimate for a univariate Gaussian distribution with unknown mean. Given data samples $D = \left(x_1, x_2, \ldots, x_n \right)$ and $x_i \in \mathcal{R}$

Assumptions:
Let us assume $\theta$ is a univariate gaussian $\theta \sim N(\mu, 1)$ . And also assume the data samples given $\theta$ are i.i.d. are also normally distributed $x_i \sim N(\theta, \sigma)$ .

p(x_1, x_2, \ldots, x_n | \theta) = \prod_{i=1}^n p(x_i | \theta)

Now,

\begin{align*} \theta_{MAP} &= \arg\max_\theta p(\theta | D) \\ &= \arg\max_\theta \frac{p(D | \theta) p(\theta)}{p(D)} \\ &= \arg\max_\theta \log(p(D | \theta)) + \log(p(\theta)) - \log(p(D)) \\ \end{align*}

Differentiating with respect to $\theta$

\begin{align*} 0 &= \frac{\partial}{\partial \theta} \left(\log(p(D | \theta)) + \log(p(\theta))\right) \\ 0 &= \frac{\partial}{\partial \theta} \log \prod_{i=1}^n p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\ 0 &= \frac{\partial}{\partial \theta} \sum_n \log p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\ \end{align*}

Consider

\begin{align*} \log p(x_i | \theta) &= \log \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - \theta)^2}{2 \sigma^2}} \\ &= \log \frac{1}{\sqrt{2 \pi \sigma^2}} - \frac{(x_i - \theta)^2}{2 \sigma^2} \\ &= \frac{\partial}{\partial \theta} \log p(x_i | \theta) = \frac{ (x_i - \theta)}{\sigma^2} \end{align*}

And, (with $\sigma = 1$ )

\begin{align*} p(\theta) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \\ \frac{\partial}{\partial \theta} p(\theta) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \frac{- (\theta - \mu)}{\sigma^2} \\ \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) &= \frac{\sqrt{2 \pi \sigma^2}}{e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} } \left( \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \frac{ (\theta - \mu)}{\sigma^2} \right) \\ \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) &= \frac{\theta - \mu}{\sigma^2} = (\theta - \mu) \end{align*}

Substituting these back

\begin{align*} 0 &= \frac{\partial}{\partial \theta} \sum_n \log p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\ 0 &= \sum_n \frac{x_i - \mu}{\sigma^2} + (\theta - \mu) \\ 0 &= \frac{\sum_n x_i - n \mu}{\sigma^2} + (\theta - \mu) \\ 0 &= \sum_n x_i - n \mu + \sigma^2 \theta - \sigma^2 \mu \\ (\sigma^2 + n) \mu &= \sum_n x_i + \sigma^2 \theta \\ \mu &= \frac{\sum_n x_i + \sigma^2 \theta}{\sigma^2 + n} \\ \mu &= \frac{n \frac{1}{n} \sum_n x_i + \sigma^2 \theta}{\sigma^2 + n} \\ \mu &= \frac{n \bar{x}}{\sigma^2 + n} + \frac{\sigma^2 \theta}{\sigma^2 + n} \\ \end{align*}

Pros and Cons

Pros

They avoid overfitting (because we have a prior that can be used to regularise).
MAP tends to look like MLE asymptotically, (with infinite data points)

Cons

It is a point estimate - Once we estimate the values of $\theta$ , we do not know the uncertainty associated with it.
Not invariant under reparameterisation. For an MLE, if $\tau = g(\theta)$ , where $g$ is some function that reparametrises $\theta$ , then $\tau_{MLE} = g(\theta_{MLE})$ . But this is not true for MAP
We must assume a prior for $\theta$