Maximum a posteriori estimation


Given a data D=(x1,x2,,xn),xiRDD = \left(x_1, x_2, \ldots, x_n \right), x_i \in \mathcal{R}^D and assumed joint probability distribution p(D,θ)p(D, \theta), where θ\theta is a random variable. Our goal is to choose a good value of θ\theta for DD such that

θMAP=argmaxθp(θD) \theta_{MAP} = \arg\max_\theta p(\theta | D)

A Maximum a posteriori estimate maximises the “posterior distribution”. This term comes from the Bayes Theorem applied to MLE for some prior distribution of θ\theta

p(θD)=p(Dθ)p(θ)p(D) p(\theta | D) = \frac{p(D | \theta) p(\theta)}{p(D)}

Example

Compute MAP estimate for a univariate Gaussian distribution with unknown mean. Given data samples D=(x1,x2,,xn)D = \left(x_1, x_2, \ldots, x_n \right) and xiRx_i \in \mathcal{R}

Assumptions:
Let us assume θ\theta is a univariate gaussian θN(μ,1)\theta \sim N(\mu, 1). And also assume the data samples given θ\theta are i.i.d. are also normally distributed xiN(θ,σ)x_i \sim N(\theta, \sigma).

p(x1,x2,,xnθ)=i=1np(xiθ) p(x_1, x_2, \ldots, x_n | \theta) = \prod_{i=1}^n p(x_i | \theta)

Now,

θMAP=argmaxθp(θD)=argmaxθp(Dθ)p(θ)p(D)=argmaxθlog(p(Dθ))+log(p(θ))log(p(D))\begin{align*} \theta_{MAP} &= \arg\max_\theta p(\theta | D) \\ &= \arg\max_\theta \frac{p(D | \theta) p(\theta)}{p(D)} \\ &= \arg\max_\theta \log(p(D | \theta)) + \log(p(\theta)) - \log(p(D)) \\ \end{align*}

Differentiating with respect to θ\theta

0=θ(log(p(Dθ))+log(p(θ)))0=θlogi=1np(xiθ)+1p(θ)θp(θ)0=θnlogp(xiθ)+1p(θ)θp(θ)\begin{align*} 0 &= \frac{\partial}{\partial \theta} \left(\log(p(D | \theta)) + \log(p(\theta))\right) \\ 0 &= \frac{\partial}{\partial \theta} \log \prod_{i=1}^n p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\ 0 &= \frac{\partial}{\partial \theta} \sum_n \log p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\ \end{align*}

Consider

logp(xiθ)=log12πσ2e(xiθ)22σ2=log12πσ2(xiθ)22σ2=θlogp(xiθ)=(xiθ)σ2\begin{align*} \log p(x_i | \theta) &= \log \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - \theta)^2}{2 \sigma^2}} \\ &= \log \frac{1}{\sqrt{2 \pi \sigma^2}} - \frac{(x_i - \theta)^2}{2 \sigma^2} \\ &= \frac{\partial}{\partial \theta} \log p(x_i | \theta) = \frac{ (x_i - \theta)}{\sigma^2} \end{align*}

And, (with σ=1\sigma = 1)

p(θ)=12πσ2e(θμ)22σ2θp(θ)=12πσ2e(θμ)22σ2(θμ)σ21p(θ)θp(θ)=2πσ2e(θμ)22σ2(12πσ2e(θμ)22σ2(θμ)σ2)1p(θ)θp(θ)=θμσ2=(θμ)\begin{align*} p(\theta) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \\ \frac{\partial}{\partial \theta} p(\theta) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \frac{- (\theta - \mu)}{\sigma^2} \\ \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) &= \frac{\sqrt{2 \pi \sigma^2}}{e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} } \left( \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(\theta - \mu)^2}{2 \sigma^2}} \frac{ (\theta - \mu)}{\sigma^2} \right) \\ \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) &= \frac{\theta - \mu}{\sigma^2} = (\theta - \mu) \end{align*}

Substituting these back

0=θnlogp(xiθ)+1p(θ)θp(θ)0=nxiμσ2+(θμ)0=nxinμσ2+(θμ)0=nxinμ+σ2θσ2μ(σ2+n)μ=nxi+σ2θμ=nxi+σ2θσ2+nμ=n1nnxi+σ2θσ2+nμ=nxˉσ2+n+σ2θσ2+n\begin{align*} 0 &= \frac{\partial}{\partial \theta} \sum_n \log p(x_i | \theta) + \frac{1}{p(\theta)} \frac{\partial}{\partial \theta} p(\theta) \\ 0 &= \sum_n \frac{x_i - \mu}{\sigma^2} + (\theta - \mu) \\ 0 &= \frac{\sum_n x_i - n \mu}{\sigma^2} + (\theta - \mu) \\ 0 &= \sum_n x_i - n \mu + \sigma^2 \theta - \sigma^2 \mu \\ (\sigma^2 + n) \mu &= \sum_n x_i + \sigma^2 \theta \\ \mu &= \frac{\sum_n x_i + \sigma^2 \theta}{\sigma^2 + n} \\ \mu &= \frac{n \frac{1}{n} \sum_n x_i + \sigma^2 \theta}{\sigma^2 + n} \\ \mu &= \frac{n \bar{x}}{\sigma^2 + n} + \frac{\sigma^2 \theta}{\sigma^2 + n} \\ \end{align*}

Pros and Cons

Pros

  1. They avoid overfitting (because we have a prior that can be used to regularise).
  2. MAP tends to look like MLE asymptotically, (with infinite data points)

Cons

  1. It is a point estimate - Once we estimate the values of θ\theta, we do not know the uncertainty associated with it.
  2. Not invariant under reparameterisation. For an MLE, if τ=g(θ)\tau = g(\theta), where gg is some function that reparametrises θ\theta, then τMLE=g(θMLE)\tau_{MLE} = g(\theta_{MLE}). But this is not true for MAP
  3. We must assume a prior for θ\theta