@ -7,8 +7,8 @@ We have discussed the Daubechies wavelet transform of order four to quiet some e
As for the application of image compression, we have seen that the wavelets capture both the global shades of the image as well as the sharp edges. By this we can throw away a lot of subtle information, while retaining recognizable images.
As for the application of image compression, we have seen that the wavelets capture both the global shades of the image as well as the sharp edges. By this we can throw away a lot of subtle information, while retaining recognizable images.
\vspace{3cm}
\vspace{3cm}
\begin{figure}
\begin{figure}[H]
\centering
\centering
\includegraphics[width=0.3\textwidth]{parijs.png}
\includegraphics[width=0.3\textwidth]{parijs.png}
\caption{My girlfriend an I in Paris. Still romantic with only 200 coefficients.}
\caption{My girlfriend and I in Paris. Still romantic with only 200 coefficients.}
\caption{The wavelet transform using a stride to skip odd}
\caption{The wavelet transform using a stride to skip the odd indices.}
\end{subfigure}
\end{subfigure}
\end{tabular}
\end{tabular}
\caption{There are two ways to implement the wavelet transform.}
\caption{There are two ways to implement the wavelet transform.}
\label{fig:wavelet_stride}
\label{fig:wavelet_stride}
\end{figure}
\end{figure}
Assume we already have a function \texttt{apply\_wn\_pn(x, n, s)} which computes the expression $W_n P_n (x_0, x_s, \ldots, x_{s(n-1)})$ (note the striding) in place\footnote{Implementing this is not so hard, but it would not explain the algorithm better.}. The whole algorithm then can nicely be expressed as
Assume we already have a function \texttt{apply\_wn\_pn(x, n, s)} which computes the expression $W_n P_n (x_0, x_s, \ldots, x_{s(n-1)})$ (note the striding) in place\footnote{Implementation not shown here, as it is straightforward.}. The whole algorithm then can nicely be expressed as:
\begin{lstlisting}
\begin{lstlisting}
for i=1 to n/4
for i=1 to n/4
apply_wn_pn(x, n/i, i)
apply_wn_pn(x, n/i, i)
i = i*2
i = i*2
\end{lstlisting}
\end{lstlisting}
For future reference we also define the following computation: \texttt{apply\_wn(x, y0, y1, n, s)} which computes $W_n (x_0, \ldots, y_0, y_1)$. Note that \texttt{apply\_wn\_pn(x, n, s)} can now be expressed as \texttt{apply\_wn(x, x0, xs, n, s)}.
For future reference we also define the following computation: \texttt{apply\_wn(x, y0, y1, n, s)} which computes $W_n (x_0, \ldots, y_0, y_1)$ again with striding \texttt{s}. Note that \texttt{apply\_wn\_pn(x, n, s)} can now be expressed as \texttt{apply\_wn(x, x0, xs, n, s)}.
\subsection{Costs}
\subsection{Costs}
@ -118,7 +118,7 @@ Compared to the FFT this is a big improvement in terms of scalability, as this w
\subsection{Higher dimensional wavelet transform}
\subsection{Higher dimensional wavelet transform}
Our final goal is to apply the wavelet transform to images. Of course we could simply put all the pixels of an image in a row and apply $W$. But if we do this, we don't use the spatial information of the image at all! In order to use the spatial information we have to apply $W$ in both directions. To be precise: we will apply $W$ to every row and then apply $W$ to all of the resulting columns. We can also do this the other way around, but this does not matter:
Our final goal is to apply the wavelet transform to images. Of course we could simply put all the pixels of an image in a row and apply $W$. But if we do this, we don't use the spatial information of the image at all! In order to use the spatial information we have to apply $W$ in both directions. To be precise: we will apply $W$ to every row and then apply $W$ to all of the resulting columns. The following lemma states that it does not matter which direction we take first.
\begin{notation}
\begin{notation}
Given a $n \times n$-matrix $F$ and an $m \times n$-matrix $X$ (which should be thought of as an image). Applying $F$ to each row of $X$ individually is denoted as $F^{H} X$.
Given a $n \times n$-matrix $F$ and an $m \times n$-matrix $X$ (which should be thought of as an image). Applying $F$ to each row of $X$ individually is denoted as $F^{H} X$.
As indicated in section~\ref{sec:intro} we would like to use the wavelets for compression. If the image is appropriately transformed to the wavelet domain we can throw away coefficients with a small absolute value. These values influence the image only subtely. In this section we will investigate the possibilities of using this method for compression and look at the different difficulties. We will restrict ourselves to gray images.
As indicated in section~\ref{sec:intro} we would like to use the wavelets for compression. We will do this as follows: the image is appropriately transformed to the wavelet domain and then we throw away coefficients with a small absolute value. These values influence the image only subtly, so the image will still be recognizable. We will briefly analyse the compression rate. We will restrict ourselves to grey images.
\subsection{Compression}
\subsection{Compression}
In this subsection we will consider an image $X$ to be a real valued $n \times m$-matrix: $X =\{x_{i,j}\|0\leq i \leq n,\,0\leq j \leq m\}$ (again $n$ and $m$ are powers of two). To determine the compression rate we have to make some assumptions on sizes of certain types. We assume a real value to be 8 bytes, and an index to be 4 bytes. This assumption restricts the size of the matrix ($0\leq n < 2^32$ and $0\leq m < 2^32$). Thus the original image $X$ has a size of $8\timesmn$ bytes, as we can store the matrix as an contiguous array (we ignore the fact that we should also store the size). As discussed in section~\ref{sec:dau} we can apply the Daubechies wavelet transform in both directions, which results in a matrix $Y =\{y_{i,j}\|0\leq i \leq n,\,0\leq j \leq m\}$ (again of the same size). The idea is to throw away all element $y_{i,j}$ with $|y_{i,j}| \leq\tau$ for some threshold parameter $\tau$. The result is the compressed image:
In this subsection we will consider an image $X$ to be a real valued $n \times m$-matrix: $X =\{x_{i,j}\,|\,0\leq i \leq n,\,0\leq j \leq m\}$ (again $n$ and $m$ are powers of two). To determine the compression rate we have to make some assumptions on sizes of certain types. We assume a real value to be 8 bytes, and an index to be 4 bytes. This assumption restricts the size of the matrix ($0\leq n < 2^{32}$ and $0\leq m < 2^{32}$). Thus the original image $X$ has a size of $8mn$ bytes, as we can store the matrix as an contiguous array (we ignore the fact that we should also store the size). As discussed in section~\ref{sec:dau} we can apply the Daubechies wavelet transform in both directions, which results in a matrix $Y =\{y_{i,j}\|0\leq i \leq n,\,0\leq j \leq m\}$ (again of the same size). The idea is to throw away all element $y_{i,j}$ with $|y_{i,j}| \leq\tau$ for some threshold parameter $\tau$. The result is the compressed image:
\[ C =\{(y_{i,j}, i, j)\|0\leq i \leq n, 0\leq j \leq m, |y_{i,j}| \leq\tau\}. \]
\[ C =\{(y_{i,j}, i, j)\|0\leq i \leq n, 0\leq j \leq m, |y_{i,j}| \leq\tau\}. \]
Note that we now have to store the location of the coefficients bigger than the threshold $\tau$. Assume that we did not throw away $c$ of the $nm$ coefficients. Then the compressed size is $(8+4+4)\times c =16\times c$. So this compression becomes useful when $c \leq\frac{1}{2}nm$. This might seem like a big loss in quality. But as we can see in figure~\ref{fig:compression}, the wavelets are capable of capturing the important shapes and edges of the image.
Note that we now have to store the location of the coefficients which we did not throw away. Assume that there are $c$ of such elements. Then the compressed size is $(8+4+4)\times c =16 c$. So this compression becomes useful when $c \leq\frac{1}{2}nm$. This might seem like a big loss in quality. But as we can see in figure~\ref{fig:compression}, the wavelets are capable of capturing the important shapes and edges of the image, even if we throw away a lot. As the wavelet transform is invertible we can decompress $C$ to obtain a approximation of $X$.
\begin{figure}
\begin{figure}
\begin{subfigure}[b]{0.25\textwidth}
\begin{subfigure}[b]{0.25\textwidth}
@ -35,15 +34,13 @@ Note that we now have to store the location of the coefficients bigger than the
\label{fig:compression}
\label{fig:compression}
\end{figure}
\end{figure}
As the wavelet transform is invertible we can decompress $C$ to obtain a approximation of $X$.
\subsection{Practical difficulties}
\subsection{Quantization}
We made a big assumption in the previous subsection which does not hold at all in the real world. Namely that an image $X$ is a \emph{real valued}$n \times m$-matrix. Most images are quantized to take values in $[0, 255]$, i.e. images consists of pixels of a single byte. This means that the size of an image $X$ is simply $nm$ bytes. Our algorithm only works for real valued data, so we can convert these bytes to the reals and perform our algorithm to obtain $Y$. The values of $X$ are nicely distributed in $[0, 255]$, whereas $Y$ has a totally different distribution, most of it is concentrated around 0. This blow-up from 1 byte to 8 bytes is still to big, so we would like to quantize the values which we do not throw away. For a suitable quantization $f: \R-> [0, 255]$ the compressed image is now:
We made a big assumption in the previous subsection which does not hold at all in practice. Namely that an image $X$ is a \emph{real valued}$n \times m$-matrix. Most images are quantized to take values in $[0, 255]$, i.e. images consists of pixels of a single byte. This means that the size of an image $X$ is simply $nm$ bytes. Our algorithm only works for real valued data, so we can convert these bytes to the reals and perform our algorithm to obtain $Y$. The values of $X$ are nicely distributed in $[0, 255]$, whereas $Y$ has a totally different distribution, most of it is concentrated around 0. To have a decent compression rate, we cannot store the real values (which take 8 bytes). So we would like to quantize the values which we do not throw away. For a suitable quantization $f: \R-> [0, 255]$ the compressed image is now:
\[ C =\{(f(y_{i,j}), i, j)\,|\,0\leq i \leq n, 0\leq j \leq m, |y_{i,j}| \leq\tau\}, \]
\[ C =\{(f(y_{i,j}), i, j)\,|\,0\leq i \leq n, 0\leq j \leq m, |y_{i,j}| \leq\tau\}, \]
with a size of $9c$ instead of $mn$. In figure~\ref{fig:comrpession_algo} the different steps of the algorithm are depicted.
with a size of $9c$ instead of $mn$.
The only thing left is to specify the quantization function $f$. We want it to have three properties. First of all we would like it to be invertible in order to decompress the image. Of course this is impossible, but we can ask for a function $g$ such that $|y - g(f(y))|$ is somewhat minimized. Secondly, we want $f$ to make the distribution of $Y$ more uniform. And at last we do not want $f$ to account the values we will throw away. In figure~\ref{fig:quantization} three such functions are considered. For the exact details we refer to the implementation. Note that a linear quantization performs very bad.
The only thing left is to specify the quantization function $f$. We want it to have three properties. First of all we would like it to be invertible in order to decompress the image. Of course this is impossible, but we can ask for a function $g$ such that $|y - g(f(y))|$ is somewhat minimized. Secondly, we want $f$ to make the distribution of $Y$ more uniform. And at last we do not want $f$ to account the values we will throw away. In figure~\ref{fig:quantization} three such functions are considered. For the exact details we refer to the implementation. Note that a linear quantization performs very badly.
\begin{figure}
\begin{figure}
\begin{subfigure}[b]{0.25\textwidth}
\begin{subfigure}[b]{0.25\textwidth}
@ -72,7 +69,7 @@ The only thing left is to specify the quantization function $f$. We want it to h
\subsection{Parallel implementation}
\subsection{Parallel implementation}
We have already seen how to perform a two-dimensional wavelet transform in section~\ref{sec:dau}. By parallelizing the one-dimensional case we get one for two dimensions for free. However, we can do a bit better. There was a slight problem in the parallelization, namely that there was a dedicated processor which had to end the algorithm. With more data at hand we can now do this last step too in parallel. Let $X$ be an image of width $w$ and height $h$ and let $p$ be the number of processors. We divide the image into blocks of size $\frac{w}{p}\times\frac{h}{p}$. We will assign processors to these blocks as shown in figure~\ref{fig:par2d}. When doing the wavelet horizontally we will let processor 0 finish the first $\frac{w}{p}$ rows, processor 1 the second $\frac{w}{p}$ rows, and so on. Similarly for the columns. This assignment ensures that all processors will end the algorithm for some rows and columns, hence no processors will be idling.
We have already seen how to perform a two-dimensional wavelet transform in section~\ref{sec:dau}. By parallelizing the one-dimensional case we get one for two dimensions for free. However, we can do a bit better. There was a slight problem in the parallelization, namely that there was a dedicated processor which had to end the algorithm. With more data at hand we can now do this last step too in parallel. Let $X$ be an image of width $w$ and height $h$ and let $p$ be the number of processors. We divide the image into blocks of size $\frac{w}{p}\times\frac{h}{p}$. We will assign processors to these blocks as shown in figure~\ref{fig:par2d}. When doing the wavelet transform horizontally we will let processor 0 finish the first $\frac{w}{p}$ rows, processor 1 the second $\frac{w}{p}$ rows, and so on. Similarly for the columns. This assignment ensures that all processors will end the algorithm for some rows and columns, hence no processors will be idling.
\begin{figure}
\begin{figure}
\centering
\centering
@ -86,7 +83,7 @@ The sequential cost of the algorithm is $28wh$, as we have to perform the wavele
In particular, if $n=w=h$ and $m=1$, we can simplify this to:
where the $(-)^H$ and $(-)^V$ decorations indicate the \emph{horizontal} and \emph{vertical} constant (recall that the constant $D_m$ depended on $n$). If $n=w=h$ and $m=1$, we can simplify this to:
\[ T =28\frac{n^2}{p}+28n +2\logt(\frac{n}{p})(2\frac{n}{p}g+l)+4(p-1)\frac{n}{p}g +4l \text{ flops}, \]
\[ T =28\frac{n^2}{p}+28n +2\logt(\frac{n}{p})(2\frac{n}{p}g+l)+4(p-1)\frac{n}{p}g +4l \text{ flops}, \]
compared to the $28n^2$ flops of the sequential algorithm.
compared to the $28n^2$ flops of the sequential algorithm.
We start this paper by motivating the need for wavelets. As a starting point of signal processing we first consider the well known Fourier transform. As an example we will be using a 1-dimensional signal of length $128$. As this section is mainly for the motivations we will not be very precise or give concrete algorithms.
We start this paper by motivating the need for wavelets. As a starting point of signal processing we first consider the well known Fourier transform. As this section is mainly for the motivations we will not be very precise or give concrete algorithms.
\subsection{Recalling the Fourier transform}
\subsection{Recalling the Fourier transform}
Recall the Fourier transform; given an input signal $x =\sum_{i=0}^{127} x_i e_i$ (written on the standard basis $\{e_i\}_i$) we can compute Fourier coefficients $x'_i$ such that $x =\sum_{i=0}^{127} x'_i f_i$. As we're not interested in the mathematics behind this transform, we will not specify the basis $\{f_i\}_i$. Conceptually the Fourier transform is a basis transformation:
Recall the Fourier transform (of length 128): given an input signal $x =\sum_{i=0}^{127} x_i e_i$ (represented on the standard basis $\{e_i\}_i$) we can compute Fourier coefficients $x'_i$ such that $x =\sum_{i=0}^{127} x'_i f_i$. As we're not interested in the mathematics behind this transform, we will not specify the basis $\{f_i\}_i$. Conceptually the Fourier transform is a basis transformation:
$$ SampleDomain \to FourierDomain. $$
$$ SampleDomain \to FourierDomain. $$
@ -64,8 +64,8 @@ At the heart of the Fourier transform is the choice of the basis elements $f_i$.
Another important difference is the way these basis elements can represent signals. With the Fourier basis elements we can easily approximate smooth signals, but with the Haar basis elements this is much harder. However representing a piecewise constant signal is almost trivial with the Haar wavelets. In photography the latter is preferred, as edges are very common (think of black branches of a tree against a clear sky). So depending on the application this \emph{non-smoothness} is either good or bad.
Another important difference is the way these basis elements can represent signals. With the Fourier basis elements we can easily approximate smooth signals, but with the Haar basis elements this is much harder. However representing a piecewise constant signal is almost trivial with the Haar wavelets. In photography the latter is preferred, as edges are very common (think of black branches of a tree against a clear sky). So depending on the application this \emph{non-smoothness} is either good or bad.
@ -9,7 +9,7 @@ We already assumed the input size $n$ to be a power of two. We now additionally
The BSP cost model as defined in \cite{biss} depends on three variables $r, g, l$, where $r$ is the overall speeds (in flops/s), $g$ is the number of flops to communicate one word and $l$ the number of flops to synchronize. For the analysis the variable $r$ is not important.
The BSP cost model as defined in \cite{biss} depends on three variables $r, g, l$, where $r$ is the overall speeds (in flops/s), $g$ is the number of flops to communicate one word and $l$ the number of flops to synchronize. For the analysis the variable $r$ is not important.
\subsection{Many communications steps}
\subsection{Many communications steps}
The data $\vec{x}= x_0, \ldots, x_{n-1}$ is distributed among the processors with a block distribution, so processor $\proc{s}$ has the elements $\vec{x'}=x_{sb}, \ldots, x_{sb+b-1}$. The first step of the algorithm consists of computing $\vec{x}^{(1)}= S_n W_n P_n \vec{x}$. We can already locally compute the first $b-2$ elements $x^{(1)}_{sb}, \ldots, x^{(1)}_{sb+b-3}$. For the remaining two elements $x^{(1)}_{sb+b-2}$ and $x^{(1)}_{sb+b-1}$ we need the first two elements on processor $s+1$. In the consequent steps a similar reasoning holds, so we derive a stub for the algorithm:
The data $\vec{x}= x_0, \ldots, x_{n-1}$ is distributed among the processors with a block distribution, so processor $\proc{s}$ has the elements $x_{sb}, \ldots, x_{sb+b-1}$. The first step of the algorithm consists of computing $\vec{x}^{(1)}= S_n W_n P_n \vec{x}$. We can already locally compute the first $b-2$ elements $x^{(1)}_{sb}, \ldots, x^{(1)}_{sb+b-3}$. For the remaining two elements $x^{(1)}_{sb+b-2}$ and $x^{(1)}_{sb+b-1}$ we need the first two elements on processor $s+1$. In the consequent steps a similar reasoning holds, so we derive a stub for the algorithm:
\begin{lstlisting}
\begin{lstlisting}
for i=1 to b/2
for i=1 to b/2
@ -24,7 +24,7 @@ We stop after $i=\frac{b}{2}=\frac{n}{2p}$ because for $i=b$ we would need three
\begin{lstlisting}
\begin{lstlisting}
// fan in
// fan in
put x_{sb} in processor 0
put x_{sb} in processor 0 in array y
if s = 0
if s = 0
wavelet(y, p)
wavelet(y, p)
@ -33,12 +33,12 @@ if s = 0
x_{sb} <- get y_{2s} from processor 0
x_{sb} <- get y_{2s} from processor 0
\end{lstlisting}
\end{lstlisting}
Let us analyse the cost of the first part of the algorithm. There are $\logt{b}$ steps to perform where in each step two elements are sent and two elements received, which amounts to a $2$-relation. Furthermore $b/i$ elements are computed. So the communication part costs $\logt{b}\times(2g+l)$ flops and the computational part $14\times b$ (see section~\ref{sec:dau}). The final part consists of two $(p-1)$-relation and a computation of $14p$ flops. So in total we have:
Let us analyse the cost of the first part of the algorithm. There are $\logt{b}$ steps to perform where in each step two elements are sent and two elements received, which amounts to a $2$-relation. Furthermore $b/i$ elements are computed. So the communication part costs $\logt{b}\times(2g+l)$ flops and the computational part in total $14\times b$ flops (see section~\ref{sec:dau}). The final part consists of two $(p-1)$-relations and a computation of $14p$ flops. So in total we have:
Depending on $l$ it might be very costly to have so many super steps. We can reduce this by requesting all the data of the next processor to calculate all values redundantly. Let us investigate when this is fruitful. The algorithm is as follows:
Depending on $l$ it might be very costly to have so many synchronizations. We can reduce this by requesting all the data of the next processor to calculate all values redundantly. Let us investigate when this is fruitful. The algorithm is as follows:
\begin{lstlisting}
\begin{lstlisting}
for j=1 to b-1
for j=1 to b-1
@ -54,7 +54,7 @@ We did indeed get rid of some $l$s. But we have to compute a lot more and send a
\subsection{An in between algorithm}
\subsection{An in between algorithm}
We can give an algorithm which combines the two methods. Instead of calculating everything redundantly, we can do so on smaller pieces, yet avoiding communication in every step. Before we give an exact algorithm we should consider how much data we need exactly for computing elements residing on each processor. As all elements come in pairs (i.e. if we have the data to calculate $x^{(1)}_0$, we can also calculate $x^{(1)}_1$), we will only consider the first element of each pair.
We can give an algorithm which combines the two methods. Instead of calculating everything redundantly, we can do so on smaller pieces, yet avoiding communication in every step. Before we give an exact algorithm we should consider how much data we need exactly for computing elements residing on each processor. The plan is to do do $m$ steps of the wavelet transform with a single communication. As all elements come in pairs (i.e. if we have the data to calculate $x^{(1)}_0$, we can also calculate $x^{(1)}_1$), we will only consider the first element of each pair.
To calculate $x^{(m)}_0$ we need $N_m =3\times2^m-2$ elements (and we get $x^{(m)}_{2^{m-1}}$ for free). So in order to calculate all elements $x^{(m)}_{sb}, \ldots, x^{(m)}_{sb+b-1}$ on one processor we need $D_m = b -2^m + N_m = b +2^{m+1}-2$ elements to start with ($b-2^m$ is the last index we are computing at stage $m$ and the index $b -2^{m-1}$ comes for free). So in order to compute $m$ steps with a single communication, we need the get $C_m = D_m - b =2^{m+1}-2$ elements from the next processor. The algorithm for a fixed $m$ becomes (given in pseudocode):
To calculate $x^{(m)}_0$ we need $N_m =3\times2^m-2$ elements (and we get $x^{(m)}_{2^{m-1}}$ for free). So in order to calculate all elements $x^{(m)}_{sb}, \ldots, x^{(m)}_{sb+b-1}$ on one processor we need $D_m = b -2^m + N_m = b +2^{m+1}-2$ elements to start with ($b-2^m$ is the last index we are computing at stage $m$ and the index $b -2^{m-1}$ comes for free). So in order to compute $m$ steps with a single communication, we need the get $C_m = D_m - b =2^{m+1}-2$ elements from the next processor. The algorithm for a fixed $m$ becomes (given in pseudocode):
@ -72,7 +72,7 @@ if r > 0
apply_wn on x and the data we received
apply_wn on x and the data we received
\end{lstlisting}
\end{lstlisting}
Again we end in the same style as above by letting a dedicated processor do a wavelet transform of length $p$. Note that for $m=1$ we get our first algorithm back and for $m =\logt(b)-1$ we get more or less our second algorithm back. The costs for this algorithm are given by:
Again we end in the same style as above by letting a dedicated processor do a wavelet transform of length $p$. Note that for $m=1$ we get our first algorithm back and for $m =\logt(b)-1$ we get more or less our second algorithm back. An upper bound for the costs are given by:
In order to pick the right $m$ we should know the parameters of the machine. Comparing these running times is done in section~\ref{sec:res}, where these theoretical bounds are plotted together with real measurements.
In order to pick the right $m$ we should know the parameters of the machine. Comparing these running times is done in section~\ref{sec:res}, where these theoretical bounds are plotted together with real measurements.
In this paper we will derive a parallel algorithm to perform a Daubechies wavelet transform of order four (DAU4). To conceptualize this transform we will first look into the Fourier transform to motivate first of all why we want such a transform and secondly to point out one of the shortcomings of the Fourier transform. After this introduction we will describe the Daubechies wavelet transform. This description will give us a simple sequential algorithm. By looking at which data is needed in which step of the algorithm, we can give a parallel algorithm. As an application we will look into image compression using this wavelet transform. The efficiency of the parallel algorithm is investigated and shows that it scales well with more processors, especially for image compression.
In this paper we will derive a parallel algorithm to perform a Daubechies wavelet transform of order four (DAU4). To conceptualize this transform we will first look into the Fourier transform to motivate first of all why we want such a transform and secondly to point out one of the shortcomings of the Fourier transform. After this introduction we will describe the Daubechies wavelet transform mathematically. This description will give us a simple sequential algorithm. By looking at which data is needed in which step of the algorithm, we can give a parallel algorithm. As an application we will look into image compression using this wavelet transform. The efficiency of the parallel algorithm is investigated and shows that it scales well with more processors.
\end{abstract}
\end{abstract}
\maketitle
\maketitle
@ -23,8 +23,6 @@ In this paper we will derive a parallel algorithm to perform a Daubechies wavele
The first step into measuring the gain of parallelization is to make sure the implementation is correct and to have a sequential baseline. The first implementation was a very naive and sequential implementation, which did a lot of data copying and shuffling. The a more performant sequential implementation was made, which provided the same output as the naive one. By using the inverse we assured that the implementation was correct. Only then the parallel version was made. It gives exactly the same outcome as the sequential version and is hence considered correct.
The first step into measuring the gain of parallelization is to make sure the implementation is correct and to have a sequential baseline. Two sequential implementations together with an implementation of the inverse were made to assure correctness. Then the parallel algorithm was implemented. It gives exactly the same outcome as the sequential version and is hence considered correct.
We analysed the theoretical BSP cost, but this does not guarantee much about running the real program. By also estimating the BSP variables $r, g$ and $l$ we can see how well the theoretical analysis matches the practical running time. To estimate these variables we used the general benchmarking tool \texttt{bench} from the BSP Edupack\footnote{See \url{http://www.staff.science.uu.nl/~bisse101/Software/software.html}}.
We analysed the theoretical BSP cost, but this does not guarantee much about running the real program. By also estimating the BSP variables $r, g$ and $l$ we can see how well the theoretical analysis matches the practical running time. To estimate these variables we used the general benchmarking tool \texttt{bench} from the BSP Edupack\footnote{See \url{http://www.staff.science.uu.nl/~bisse101/Software/software.html}}. The estimated BSP variables are listed in table ~\ref{tab:variables}.
There are two machines on which the computation was run. First of all, a Mac book pro (MBP 13'' early 2001) with two physical processors. Due to hyperthreading it actually has four \emph{virtual} cores, but for a pure computation (where the pipeline should always be filled) we cannot expect a speed up of more than two. Secondly, the super computer Cartesius (with many much more cores). We should note that the MBP actually has shared memory which the BSP model does not use at all. The estimated BSP variables are listed in table ~\ref{tab:variables}.
There are two machines on which the computation was run. First of all, a Mac book pro (MBP 13'' early 2001) with two physical processors. Due to hyperthreading it actually has four \emph{virtual} cores, but for a pure computation (where the pipeline should always be filled) we cannot expect a speed up of more than two. Secondly, the super computer Cartesius (with many much more cores).
When we measure time, we only measure the time of the actual algorithm. So we ignore start-up time or allocation time and initial data distribution. Time is measured with the \texttt{bsp\_time()} primitive, which is a wall clock. For a better measurement we iterated the algorithm at least 100 times and divided the total time by the number of iterations.
When we measure time, we only measure the time of the actual algorithm. So we ignore start-up time or allocation time and initial data distribution. Time is measured with the \texttt{bsp\_time()} primitive, which is a wall clock. For a better measurement we iterated the algorithm at least 100 times and divided the total time by the number of iterations.
\subsection{Results}
\subsection{Results}
In this subsection we will plot the actual running time of the algorithm. We will take $n$ as a variable to see how the parallel algorithm scales. As we only allow power of two for $n$ we will often plot in a $\log-\log$-fashion. In all cases we took $n=2^6$ as a minimum and $n=2^{27}$ as a maximum. Unless stated otherwise we will use blue for the parallel running time, red for the sequential running time. The thin lines shows the theoretical time for which we used the variables in table~\ref{tab:variables}.
In this subsection we will plot the actual running time of the algorithm. We will take $n$ as a variable to see how the parallel algorithm scales. As we only allow power of two for $n$ we will often plot on a $\log-\log$-scale. In all cases we took $n=2^6$ as a minimum and $n=2^{27}$ as a maximum.
In figure~\ref{fig:basic} the running time is for two test cases with varying $n$. There are multiple things to note. First of all we see that the actual running time closely matches the shape of the theoretical prediction. This assures us that the BSP cost model is sufficient to predict the impact of parallelization. However the theoretical prediction is too optimistic.
Figure~\ref{fig:basicplot}shows the running time for two test cases with varying $n$. Note that the actual running time closely matches the shape of the theoretical prediction. This assures us that the BSP cost model is sufficient to predict the impact of parallelization. However the theoretical prediction is too optimistic.
\tikzstyle{measured}=[mark=+]
\tikzstyle{measured}=[mark=+]
\tikzstyle{predicted}=[very thin]
\tikzstyle{predicted}=[very thin]
@ -65,9 +65,25 @@ In figure~\ref{fig:basic} the running time is for two test cases with varying $n
\label{fig:basicplot}
\label{fig:basicplot}
\end{figure}
\end{figure}
In figure~\ref{fig:gain} we plotted the speedup of the parallel program. We let $n$ vary and divide the sequential run time by the parallel run time. Values above 1 indicate that the parallel version outperformed the sequential one. This test was done on a MBP, note the at an beyond $2^{18}$, the speedup decreases. If we look closer at figure~\ref{fig:basicplot} we see that \emph{both} the sequential and parallel version decrease in efficiency, it is suspected that this is a cache issue.
\caption{The speedup of the parallel algorithm on a MBP with $p =2$ and $m =1$. The thin line is the analysed cost.}
\label{fig:speedup}
\end{figure}
In figure~\ref{fig:speedup} we plotted the speedup of the parallel program. Here we divided the sequential run time by the parallel run time and let $p$ vary. We see that it scales very well up to $p =16$.
In figure~\ref{fig:speedup} we plotted the speedup of the parallel program. Here we divided the sequential run time by the parallel run time and let $p$ vary. We see that it scales very well up to $p =16$.
In all the results on cartesius we used $m=5$ without any argument or proof of optimality. Figure~\ref{fig:optm} shows the running time for various $m$ and we see that indeed $m=5$ gives a local minimum, both in the measurements and in the theoretical cost model.
In all the results on Cartesius we used $m=5$ without any argument or proof of optimality. Figure~\ref{fig:optm} shows the running time for various $m$ and we see that indeed $m=5$ gives a local minimum, both in the measurements and in the theoretical cost model.