We now have seen three different bases to represent signals: in the sample domain, in the Fourier domain and in the Haar wavelets domain. The all have different properties. We have reasoned that the Haar wavelets have nice properties regarding images; it is able to represent edges well and errors are local. However a little bit of smoothness is sometimes asked for (again in photography, think of a blue sky: it's white/blue on the bottom, darker on the top). This is exactly what the Daubechies wavelets of order four add.
We now have seen three different bases to represent signals: in the sample domain, in the Fourier domain and in the Haar wavelets domain. They all have different properties. We have reasoned that the Haar wavelets have nice properties regarding images; it is able to represent edges well and errors are local. However a little bit of smoothness is sometimes asked for (again in photography, think of a blue sky: it's white/blue on the bottom, darker on the top). This is exactly what the Daubechies wavelets of order four add.
Instead of explicitly defining or showing the basis elements, we will directly describe the wavelet transform $W$.\footnote{Note that we didn't describe the transforms described in section~\ref{sec:intro}, as this section was motivational only.}In fact we will describe it as an algorithm, as our intent is to implement it.
Instead of explicitly defining or showing the basis elements, we will directly describe the wavelet transform $W$.\footnote{Note that we didn't describe the actual transforms described in section~\ref{sec:intro}, as this section was motivational only.}We will describe the transform in terms of matrices, just as in \cite{numc} and \cite{biss}.
\subsection{The Daubechies wavelet transform}
\subsection{The Daubechies wavelet transform}
We will formulate the algorithm in terms of matrix multiplications \cite{numc}. Before we do so, we need the following constants:
Before we begin with the matrix, we define the following constants:
\begin{align*}
\begin{align*}
c_0 &= \frac{1 + \sqrt{3}}{4 \sqrt{2}}, &\quad
c_0 &= \frac{1 + \sqrt{3}}{4 \sqrt{2}}, &\quad
c_1 &= \frac{3 + \sqrt{3}}{4 \sqrt{2}}, \\
c_1 &= \frac{3 + \sqrt{3}}{4 \sqrt{2}}, \\
@ -42,28 +42,13 @@ In many cases we want to apply the $n \times (n+2)$-matrix $W_n$ to a vector of
\[ P_n \vec{x}=(x_0, \ldots, x_{n-1}, x_0, x_1)\]
\[ P_n \vec{x}=(x_0, \ldots, x_{n-1}, x_0, x_1)\]
Now applying $W_n$ to the periodic $\vec{x}$ is exactly $W_n P_n \vec{x}$.
Now applying $W_n$ to the periodic $\vec{x}$ is exactly $W_n P_n \vec{x}$.
The wavelet transform now consists of multiplying the above matrices in a recursive fashion. Given a vector $\vec{x}$ of length $n$, calculate $\vec{x}^{(1)}= S_n W_n P_n \vec{x}$, and recurse on the first halve of $\vec{x^1}$ using $S_\frac{n}{2}$, $W_\frac{n}{2}$ and $P_\frac{n}{2}$. Repeat this procedure and end with the muliplication of $S_4 W_4 P_4$. More formally the wavelet transform is given by:
The wavelet transform now consists of multiplying the above matrices in a recursive fashion. Given a vector $\vec{x}$ of length $n$, calculate $\vec{x}^{(1)}= S_n W_n P_n \vec{x}$, and recurse on the first halve of $\vec{x}^{(1)}$ using $S_\frac{n}{2}$, $W_\frac{n}{2}$ and $P_\frac{n}{2}$. Repeat this procedure and end with the muliplication of $S_4 W_4 P_4$. More formally the wavelet transform is given by:
\[ W \vec{x} :=\diag(S_4 W_4 P_4, I_4, \ldots, I_4)
\[ W \vec{x} :=\diag(S_4 W_4 P_4, I_4, \ldots, I_4)
When implementing this transform, we don't have to perform the even-odd sort. Instead, we can simply do all calculations in place and use a stride to do the recursion on the even part. This will permute the original result.
\todo{Tell a bit more?}
\todo{Add images to show the difference}
\subsection{Costs}
We will briefly analyze the cost of the transform by counting the number of \emph{flops}, that is muliplications and additions. Computing on element of $W_n \vec{x}$ costs $4$ multiplications and $3$ additions. So $W_n \vec{x}$ costs $7n$ flops. Applying $S_n$ and $P_n$ do not require any flops, as they are mere data manipulations. Consequently computing $W \vec{x}$ costs
\[7\times n +7\times\frac{n}{2}+\cdots+7\times8+7\times4\text{ flops }. \]
Using the geometric series $\sum_{i=0}^\infty2^{-i}=2$ we can bound the number of flops by $14n$.
Compared to the FFT this is a big improvement in terms of scalability, as this wavelet transform has a linear complexity $\BigO{n}$, but the FFT has a complexity of $\BigO{n \log n}$. This is however also a burden, as it leaves not much room for overhead induced by parallelizing the algorithm. We will see an precies analysis of communication costs in section~\ref{sec:par}.
\todo{Do a simple upperbound of communication here?}
\subsection{The inverse}
\subsection{The inverse}
The wavelet transform is invertible. We will proof this by first showing that $S_n$ and $W_n P_n$ are invertible. In fact, they are orthogonal, which means that the inverse is given by the transpose.
The wavelet transform is invertible. We will proof this by first showing that $S_n$ and $W_n P_n$ are invertible. In fact, they are orthogonal, which means that the inverse is given by the transpose.
@ -81,10 +66,34 @@ The wavelet transform is invertible. We will proof this by first showing that $S
The matrix $W$ is invertible with $W^{-1}= W^T$.
The matrix $W$ is invertible with $W^{-1}= W^T$.
\end{theorem}
\end{theorem}
\begin{proof}
\begin{proof}
First note that$\diag(S_m W_m P_m, I_m, \ldots, I_m)$ is orthogonal by the above lemma. Now using the fact that the multiplications of two orthogonal matrices is again orthogonal we see that$W$ is orthogonal.
Using the fact that the composition of two orthogonal maps is again orthogonal, we easily see that $S_m W_m P_m$ is orthogonal for all even $m$. As a consequence$\diag(S_m W_m P_m, I_m, \ldots, I_m)$ is also orthogonal, and again by multiplication$W$ is orthogonal.
\end{proof}
\end{proof}
\todo{Note that I didnt parallelize the inverse}
\subsection{In place}
\todo{Add images to show the difference}
When implementing this transform, we don't have to perform the even-odd sort. Instead, we can simply do all calculations in place and use a stride to skip the odd elements in further steps. The only difference with the actual transform $W \vec{x}$ is that the output is permuted. However, in our application of image compression, we are not interested in the index of a coefficient, so we don't need to rectify this. In the rest of this paper the Daubechies wavelet transform will refer to this (in-place) variant.
Assume we have a function \texttt{apply_wn_pn(x, n, s)} which computes $W_n P_n (x_0, x_s, \ldots, x_{s(n-1)})$ in place\footnote{Implementing this is not so hard, but it wouldn't make this section nicer.}. The whole algorithm then can nicely be expressed as
\begin{lstlistings}
wavelet(x, n) =
for i = 1 to n/4
apply_wn_pn(x, n/i, i)
i = i*2
\end{lstlistings}
For future reference we also define the following computation: \texttt{apply_wn(x, y_0, y_1, n, s)} which computes $W_n (x_0, \ldots, y_0, y_1)$. Note that \texttt{apply_wn_pn(x, n, s)} can now be expressed as \texttt{apply_wn(x, x_0, x_s, n, s)}.
\subsection{Costs}
We will briefly analyze the cost of the transform by counting the number of \emph{flops}, that is muliplications and additions. Computing one element of $W_n \vec{x}$ costs $4$ multiplications and $3$ additions. So $W_n \vec{x}$ costs $7n$ flops. Applying $P_n$ does not require any flops, as is is a mere data manipulation. Consequently computing $W \vec{x}$ costs
\[7\times n +7\times\frac{n}{2}+\cdots+7\times8+7\times4\text{ flops }. \]
Using the geometric series $\sum_{i=0}^\infty2^{-i}=2$ we can bound the number of flops by $14n$.
Compared to the FFT this is a big improvement in terms of scalability, as this wavelet transform has a linear complexity $\BigO{n}$, but the FFT has a complexity of $\BigO{n \log n}$. This is however also a burden, as it leaves not much room for overhead induced by parallelizing the algorithm. We will see an precies analysis of communication costs in section~\ref{sec:par}.
\todo{Do a simple upperbound of communication here?}
In this section we will motivate the need for wavelets. We will start with the well known Fourier transform and discuss things we can change. As an example we will be using a 1-dimensional signal of length $128$. This section will be a bit informal and will not focus on algorithms.
We start this paper by motivating the need for wavelets. As a starting point of signal processing we first consider the well known Fourier transform. As an example we will be using a 1-dimensional signal of length $128$. As this section is mainly for the motivations we will not be very precise or give concrete algorithms.
\subsection{Recalling the Fourier transform}
\subsection{Recalling the Fourier transform}
Recall the Fourier transform; given an input signal $x =\sum_{i=1}^{128} x_i e_i$ (written on the standard basis) we can compute Fourier coefficients $x'_i$ such that $x =\sum_{i=1}^{128} x'_i f_i$. As we're not interested in the mathematics behind this transform, we will not specify $f_i$. Conceptually the Fourier transform is a basis transformation:
Recall the Fourier transform; given an input signal $x =\sum_{i=1}^{128} x_i e_i$ (written on the standard basis$\{e_i\}_i$) we can compute Fourier coefficients $x'_i$ such that $x =\sum_{i=1}^{128} x'_i f_i$. As we're not interested in the mathematics behind this transform, we will not specify the basis $\{f_i\}_i$. Conceptually the Fourier transform is a basis transformation:
$$ SampleDomain \to FourierDomain. $$
$$ SampleDomain \to FourierDomain. $$
Furthermore this transformation has an inverse. Applications of this transform consist of going to the Fourier domain, applying some (easy to compute) function there and go back to sample domain again.
Furthermore this transformation has an inverse. Real world applications of this transform often consists of going to the Fourier domain, applying some (easy to compute) function and go back to sample domain.This happens often as measurements often happen at intervals and thus generate samples, but in research people are often interested in the global signal represented by the signals.
In figure~\ref{fig:fourier_concepts}we've written an input signal of length $128$ on the standard basis, and on the Fourier basis (simplified, for illustrational purposes). We see that this signal is better expressed in the Fourier domain, as we only need three coefficients instead of all $128$.
In figure~\ref{fig:fourier_concepts} an input signal of length $128$ is expressed on the standard basis, and on the Fourier basis (simplified, for illustrational purposes). We see that this signal is better expressed in the Fourier domain, as we only need three coefficients instead of all $128$.
\todo{
\todo{
fig:fourier\_concepts
fig:fourier\_concepts
spelling out a sum of basis elements in both domains
spelling out a sum of basis elements in both domains
}
}
We see that we might even do compression based on the Fourier coefficients. Instead of sending all samples, we just only a few coefficients from which we are able to approximate the original input. However there is a shortcoming to this. Consider the following scenario. A sensor on Mars detects a signal, transforms it and sends the coefficients to earth. During the transmission one of the coefficients is corrupted. This results in a wave across the whole signal. The error is \emph{non-local}. If, however, we decided to send the original samples, a corrupted sample would only affect a small part of the signal, i.e. the error is \emph{local}. This is illustrated in figure~\ref{fig:fourier_error}.
We see that we might even do compression based on these Fourier coefficients. Instead of sending all samples, we just send only a few coefficients from which we are able to approximate the original input. However there is a shortcoming to this. Consider the following scenario. A sensor on Mars detects a signal, transforms it and sends the coefficients to earth. During the transmission one of the coefficients is corrupted. This results in a wave across the whole signal. The error is \emph{non-local}. If, however, we decided to send the original samples, a corrupted sample would only affect a small part of the signal, i.e. the error is \emph{local}. This is illustrated in figure~\ref{fig:fourier_error}.
\todo{
\todo{
fig:fourier\_error
fig:fourier\_error
@ -26,7 +26,7 @@ We see that we might even do compression based on the Fourier coefficients. Inst
\subsection{The simplest wavelet transform}
\subsection{The simplest wavelet transform}
At the heart of the Fourier transform is the choice of the basis elements $f_i$. With a bit of creativity we can cook up different basis elements with different properties. To illustrate this we will have a quick look at the so-called ``Haar wavelets''. In our case where $n=128$ we can define the following $128$ elements:
At the heart of the Fourier transform is the choice of the basis elements $f_i$. With a bit of creativity we can cook up different basis elements with different properties. To illustrate this we will have a quick look at the so-called \emph{Haar wavelets}. In our case where $n=128$ we can define the following $128$ elements:
We will refer to these elements as \emph{Haar wavelets}. To give a better feeling of these wavelets, they are plotted in figure~\ref{fig:haar_waveleta} on the standard basis. There is also an effective way to write an element written in the standard basis on this new basis, this is the Haar wavelet transform. Again our example can be written on this new basis, and again we see that the first coefficient already approximates the signal and that the other coefficients refine it.
We will refer to these elements as \emph{Haar wavelets}. To give a better feeling of these wavelets, some of them are plotted in figure~\ref{fig:haar_waveleta} on the standard basis. There is also an effective way to express a signal represented on the standard basis on this new basis. Again our example can be written on this new basis, and again we see that the first coefficient already approximates the signal and that the other coefficients refine it.
To go back to our problem of noise, if we add $0.5*h_9$ (there is a shift of indices) to this signal, only a small part of the signal is disturbed as shown in figure~\ref{fig:haar_error}.
To go back to our problem of noise, if we add $0.5*h_10$ to this signal, only a small part of the signal is disturbed as shown in figure~\ref{fig:haar_error}.
Another important difference is the way these basis elements can represent signals. With the Fourier basis elements we can easily approximate smooth signals, but with the Haar basis elements this is much harder. However representing a piecewise constant signal is easier with the Haar wavelets. In photography the latter is preferred, as edges are very common (think of branches of a tree against a clear sky or hard edges of a building). So depending on the application this \emph{non-smoothness} is either good or bad.
Another important difference is the way these basis elements can represent signals. With the Fourier basis elements we can easily approximate smooth signals, but with the Haar basis elements this is much harder. However representing a piecewise constant signal is almost trivial with the Haar wavelets. In photography the latter is preferred, as edges are very common (think of black branches of a tree against a clear sky). So depending on the application this \emph{non-smoothness} is either good or bad.
In this section we will look at how we can parallelize the Daubechies wavelet transform. We will first discuss a naive, and simple solution in which we communicate at every step. Secondly, we look at a solution which only communicates once.
In this section we will look at how we can parallelize the Daubechies wavelet transform. We will first discuss a naive, and simple solution in which we communicate at every step. Secondly, we look at a solution which only communicates once. By analysing the BSP costs we see that, depending on the machine, both solutions might be more performant than the other. The get an optimal solution we will derive a hybrid solution, which can dynamically choose the best solution depending on the machine.
By analysing the BSP-costs we see that, depending on the machine, both solutions can be more performant than the other. At last we will derive a hybrid solution, which can dynamically choose the best solution depending on the machine.
We already assumed the input size $n$ to be a power of two. We now additionally assume the number of processors $p$ is a power of two and (much) less than $n$. The data will be block distributed as input and output.
We already assumed the input size $n$ to be a power of two. We now additionally assume the number of processors $p$ is a power of two and (much) less than $n$. In all the given solutions we use a block distribution, each block thus contains $b =\frac{n}{p}$ elements (and is also a power of two).
\subsection{Many communications steps}
\subsection{Many communications steps}
The data $\vec{x}= x_0, \ldots, x_{n-1}$ is distributed among the processors with a block distribution, so processor $\proc{s}$ has the elements $\vec{x'}= x_{sb}, \ldots, x_{sb+b-1}$. At the first step of the algorithm we want to compute $W_b x'$, but we need to more elements in order to do so.
The data $\vec{x}= x_0, \ldots, x_{n-1}$ is distributed among the processors with a block distribution, so processor $\proc{s}$ has the elements $\vec{x'}= x_{sb}, \ldots, x_{sb+b-1}$. The first step of the algorithm consists of computing $\vec{x}^{(1)}= S_n W_n P_n \vec{x}$. We can already locally compute the first $b-2$ elements $x^{(1)}_{sb}, \ldots, x^{(1)}_{sb+b-3}$. For the remaining two elements $x^{(1)}_{sb+b-2}$ and $x^{(1)}_{sb+b-1}$ we need the first two elements on processor $s+1$. In the consequent steps a similar reasoning holds, so we derive a stub for the algorithm:
\begin{lstlistings}
for i=1 to b/4
y_0 <- get x_{(s+1)b} from processor s+1
y_1 <- get x_{(s+1)b+2^i} from processor s+1
apply_wn(x, y_0, y_1, b/i, i)
\end{lstlistings}
The next step in the sequential algorithm would be applying $W_{2p}$. The easiest way to do this is to let one of the processors finish the job from here (we could also let the other processors compute it redundantly). If we decided to let processor 0 finish the job, the rest of the algorithm would look like: