compress/wavelet_report/img.tex



								\section{Image Compression}

								\label{sec:img}


								As indicated in section~\ref{sec:intro} we would like to use the wavelets for compression. We will do this as follows: the image is appropriately transformed to the wavelet domain and then we throw away coefficients with a small absolute value. These values influence the image only subtly, so the image will still be recognizable. We will briefly analyse the compression rate. We will restrict ourselves to grey images.


								\subsection{Compression}

								In this subsection we will consider an image $X$ to be a real valued $n \times m$-matrix: $X = \{x_{i,j} \,|\, 0 \leq i \leq n,\, 0 \leq j \leq m\}$ (again $n$ and $m$ are powers of two). To determine the compression rate we have to make some assumptions on sizes of certain types. We assume a real value to be 8 bytes, and an index to be 4 bytes. This assumption restricts the size of the matrix ($0 \leq n < 2^{32}$ and $0 \leq m < 2^{32}$). Thus the original image $X$ has a size of $8mn$ bytes, as we can store the matrix as an contiguous array (we ignore the fact that we should also store the size). As discussed in section~\ref{sec:dau} we can apply the Daubechies wavelet transform in both directions, which results in a matrix $Y = \{y_{i,j} \| 0 \leq i \leq n,\, 0 \leq j \leq m\}$ (again of the same size). The idea is to throw away all element $y_{i,j}$ with $|y_{i,j}| \leq \tau$ for some threshold parameter $\tau$. The result is the compressed image:

								\[ C = \{ (y_{i,j}, i, j) \| 0 \leq i \leq n, 0 \leq j \leq m, |y_{i,j}| \leq \tau \}. \]

								Note that we now have to store the location of the coefficients which we did not throw away. Assume that there are $c$ of such elements. Then the compressed size is $(8+4+4) \times c = 16 c$. So this compression becomes useful when $c \leq \frac{1}{2}nm$. This might seem like a big loss in quality. But as we can see in figure~\ref{fig:compression}, the wavelets are capable of capturing the important shapes and edges of the image, even if we throw away a lot. As the wavelet transform is invertible we can decompress $C$ to obtain a approximation of $X$.


								\begin{figure}

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{oostenrijk.png}

										\caption{Original}

									\end{subfigure}~

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{10oostenrijk.png}

										\caption{$\rho=\frac{1}{10}$}

									\end{subfigure}~

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{100oostenrijk.png}

										\caption{$\rho=\frac{1}{100}$}

									\end{subfigure}~

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{1000oostenrijk.png}

										\caption{$\rho=\frac{1}{1000}$}

									\end{subfigure}

									\caption{Test image with a fraction $\rho$ of the coefficients left.}

									\label{fig:compression}

								\end{figure}


								\subsection{Quantization}

								We made a big assumption in the previous subsection which does not hold at all in practice. Namely that an image $X$ is a \emph{real valued} $n \times m$-matrix. Most images are quantized to take values in $[0, 255]$, i.e. images consists of pixels of a single byte. This means that the size of an image $X$ is simply $nm$ bytes. Our algorithm only works for real valued data, so we can convert these bytes to the reals and perform our algorithm to obtain $Y$. The values of $X$ are nicely distributed in $[0, 255]$, whereas $Y$ has a totally different distribution, most of it is concentrated around 0. To have a decent compression rate, we cannot store the real values (which take 8 bytes). So we would like to quantize the values which we do not throw away. For a suitable quantization $f: \R -> [0, 255]$ the compressed image is now:

								\[ C = \{ (f(y_{i,j}), i, j) \,|\, 0 \leq i \leq n, 0 \leq j \leq m, |y_{i,j}| \leq \tau \}, \]

								with a size of $9c$ instead of $mn$.


								The only thing left is to specify the quantization function $f$. We want it to have three properties. First of all we would like it to be invertible in order to decompress the image. Of course this is impossible, but we can ask for a function $g$ such that $|y - g(f(y))|$ is somewhat minimized. Secondly, we want $f$ to make the distribution of $Y$ more uniform. And at last we do not want $f$ to account the values we will throw away. In figure~\ref{fig:quantization} three such functions are considered. For the exact details we refer to the implementation. Note that a linear quantization performs very badly.


								\begin{figure}

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{eindhoven.png}

										\caption{Original}

									\end{subfigure}~

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{eindhoven_lin.png}

										\caption{Linear}

									\end{subfigure}~

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{eindhoven_log.png}

										\caption{Logarithmic}

									\end{subfigure}~

									\begin{subfigure}[b]{0.25\textwidth}

										\centering

										\includegraphics[width=\textwidth]{eindhoven_sqrt.png}

										\caption{Square root}

									\end{subfigure}

									\caption{Three different methods to quantize the coefficients.}

									\label{fig:quantization}

								\end{figure}


								\subsection{Parallel implementation}

								We have already seen how to perform a two-dimensional wavelet transform in section~\ref{sec:dau}. By parallelizing the one-dimensional case we get one for two dimensions for free. However, we can do a bit better. There was a slight problem in the parallelization, namely that there was a dedicated processor which had to end the algorithm. With more data at hand we can now do this last step too in parallel. Let $X$ be an image of width $w$ and height $h$ and let $p$ be the number of processors. We divide the image into blocks of size $\frac{w}{p}\times\frac{h}{p}$. We will assign processors to these blocks as shown in figure~\ref{fig:par2d}. When doing the wavelet transform horizontally we will let processor 0 finish the first $\frac{w}{p}$ rows, processor 1 the second $\frac{w}{p}$ rows, and so on. Similarly for the columns. This assignment ensures that all processors will end the algorithm for some rows and columns, hence no processors will be idling.


								\begin{figure}

									\centering

									\includegraphics{2dpar}

									\caption{Assignment of blocks to processors 0, 1, 2 and 3}

									\label{fig:par2d}

								\end{figure}


								The sequential cost of the algorithm is $28wh$, as we have to perform the wavelet in both directions. Most part of the analysis of the previous section still holds. Let us consider doing the wavelet horizontally for each row, instead of sending $C_m$ words in each step, we now have to send $\frac{h}{p}C_m$ words and so on. Almost every term in the cost analysis is multiplied by $\frac{h}{p}$, except for the factors of $l$, as we do not have to synchronize more often (we only have to send bigger chunks of data). Doing this in both directions yields the cost of the parallel two-dimensional wavelet transform:

								\begin{align*}

									T &= 14\frac{h}{p}D_m^H + 14\frac{h}{p}p + \frac{1}{m}\logt(\frac{w}{p})(\frac{h}{p}C_m^Hg+l) + 2(p-1)\frac{h}{p}g + 2l \quad\text{ (horizontal part)}\\

									  &+ 14\frac{w}{p}D_m^V + 14\frac{w}{p}p + \frac{1}{m}\logt(\frac{h}{p})(\frac{w}{p}C_m^Vg+l) + 2(p-1)\frac{w}{p}g + 2l \quad\text{ (vertical part)}

								\end{align*}

								where the $(-)^H$ and $(-)^V$ decorations indicate the \emph{horizontal} and \emph{vertical} constant (recall that the constant $D_m$ depended on $n$). If $n=w=h$ and $m=1$, we can simplify this to:

								\[ T = 28\frac{n^2}{p} + 28n + 2\logt(\frac{n}{p})(2\frac{n}{p}g+l) + 4(p-1)\frac{n}{p}g + 4l \text{ flops}, \]

								compared to the $28n^2$ flops of the sequential algorithm.