jcmp: My image compression format (w/ wavelets)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.
 
 
 
 

114 lines
6.5 KiB

\section{Results}
\label{sec:res}
\subsection{Methodology}
The first step into measuring the gain of parallelization is to make sure the implementation is correct and to have a sequential baseline. Two sequential implementations together with an implementation of the inverse were made to assure correctness. Then the parallel algorithm was implemented. It gives exactly the same outcome as the sequential version and is hence considered correct.
We analysed the theoretical BSP cost, but this does not guarantee much about running the real program. By also estimating the BSP variables $r, g$ and $l$ we can see how well the theoretical analysis matches the practical running time. To estimate these variables we used the general benchmarking tool \texttt{bench} from the BSP Edupack\footnote{See \url{http://www.staff.science.uu.nl/~bisse101/Software/software.html}}. The estimated BSP variables are listed in table ~\ref{tab:variables}.
There are two machines on which the computation was run. First of all, a Mac book pro (MBP 13'' early 2001) with two physical processors. Due to hyperthreading it actually has four \emph{virtual} cores, but for a pure computation (where the pipeline should always be filled) we cannot expect a speed up of more than two. Secondly, the super computer Cartesius (with many much more cores).
\begin{table}
\begin{tabular}{c|r|r|r|r}
& \thead{MBP} & \thead{MBP} & \thead{Cartesius} & \thead{Cartesius} \\
\hline
p & 2 & 4 & 4 & 16 \\
r & 5993 & 2849 & 6771 & 6771 \\
g & 284 & 248 & 219 & 340 \\
l & 1300 & 2161 & 46455 & 162761\\
\end{tabular}
\caption{The estimated BSP variables for the two machines. Estimated for a different number of processors.}
\label{tab:variables}
\end{table}
When we measure time, we only measure the time of the actual algorithm. So we ignore start-up time or allocation time and initial data distribution. Time is measured with the \texttt{bsp\_time()} primitive, which is a wall clock. For a better measurement we iterated the algorithm at least 100 times and divided the total time by the number of iterations.
\subsection{Results}
In this subsection we will plot the actual running time of the algorithm. We will take $n$ as a variable to see how the parallel algorithm scales. As we only allow power of two for $n$ we will often plot on a $\log-\log$-scale. In all cases we took $n=2^6$ as a minimum and $n=2^{27}$ as a maximum.
Figure~\ref{fig:basicplot} shows the running time for two test cases with varying $n$. Note that the actual running time closely matches the shape of the theoretical prediction. This assures us that the BSP cost model is sufficient to predict the impact of parallelization. However the theoretical prediction is too optimistic.
\tikzstyle{measured}=[mark=+]
\tikzstyle{predicted}=[very thin]
\tikzstyle{sequential}=[color=red]
\tikzstyle{parallel}=[color=blue]
\begin{figure}
\centering
\begin{subfigure}[b]{0.5\textwidth}
\begin{tikzpicture}
\begin{loglogaxis}[xlabel={$n$}, ylabel={Time (s)}, width=\textwidth]
\addplot[predicted, sequential] table[x=n, y=SeqP] {results/mbp_p2_m1_basic};
\addplot[predicted, parallel] table[x=n, y=ParP] {results/mbp_p2_m1_basic};
\addplot[measured, sequential] table[x=n, y=Seq] {results/mbp_p2_m1_basic}; \addlegendentry{sequential}
\addplot[measured, parallel] table[x=n, y=Par] {results/mbp_p2_m1_basic}; \addlegendentry{parallel}
\end{loglogaxis}
\end{tikzpicture}
\caption{MBP with $p=2$ and $m=2$}
\end{subfigure}~
\begin{subfigure}[b]{0.5\textwidth}
\begin{tikzpicture}
\begin{loglogaxis}[xlabel=n, width=\textwidth]
\addplot[predicted, sequential] table[x=n, y=SeqP] {results/cart_p4_m5_basic};
\addplot[predicted, parallel] table[x=n, y=ParP] {results/cart_p4_m5_basic};
\addplot[measured, sequential] table[x=n, y=Seq] {results/cart_p4_m5_basic}; \addlegendentry{sequential}
\addplot[measured, parallel] table[x=n, y=Par] {results/cart_p4_m5_basic}; \addlegendentry{parallel}
\end{loglogaxis}
\end{tikzpicture}
\caption{Cartesius with $p=4$ and $m=5$}
\end{subfigure}
\caption{Running time versus number of elements $n$. The thin line shows the theoretical prediction.}
\label{fig:basicplot}
\end{figure}
In figure~\ref{fig:gain} we plotted the speedup of the parallel program. We let $n$ vary and divide the sequential run time by the parallel run time. Values above 1 indicate that the parallel version outperformed the sequential one. This test was done on a MBP, note the at an beyond $2^{18}$, the speedup decreases. If we look closer at figure~\ref{fig:basicplot} we see that \emph{both} the sequential and parallel version decrease in efficiency, it is suspected that this is a cache issue.
\begin{figure}
\centering
\begin{tikzpicture}
\begin{axis}[xmode=log, xlabel={$n$}, ylabel={Speedup}, width=0.6\textwidth, height=5cm]
\addplot[measured, very thick] table[x=n, y=Gain] {results/mbp_p2_m1_gain};
\addplot[predicted] table[x=n, y=GainP] {results/mbp_p2_m1_gain};
\end{axis}
\end{tikzpicture}
\caption{The speedup of the parallel algorithm on a MBP with $p = 2$ and $m = 1$. The thin line is the analysed cost.}
\label{fig:speedup}
\end{figure}
In figure~\ref{fig:speedup} we plotted the speedup of the parallel program. Here we divided the sequential run time by the parallel run time and let $p$ vary. We see that it scales very well up to $p = 16$.
In all the results on Cartesius we used $m=5$ without any argument or proof of optimality. Figure~\ref{fig:optm} shows the running time for various $m$ and we see that indeed $m=5$ gives a local minimum, both in the measurements and in the theoretical cost model.
\begin{figure}
\centering
\begin{tikzpicture}
\begin{axis}[xlabel={$p$}, ylabel={Speedup}, width=0.6\textwidth, height=5cm, ymax = 12]
\addplot[measured, very thick] table[x=p, y=Speedup] {results/cart_m5_speedup};
\addplot[predicted] table[x=p, y=p] {results/cart_m5_speedup};
\end{axis}
\end{tikzpicture}
\caption{The speedup of the parallel algorithm on Cartesius. $n = 1048576$ and $m = 5$. The thin line is the theoretical maximum speedup.}
\label{fig:speedup}
\end{figure}
\begin{figure}
\centering
\begin{tikzpicture}
\begin{axis}[xlabel={$m$}, ylabel={Time (s)}, width=0.6\textwidth, height=5cm]
\addplot[measured, very thick] table[x=m, y=Time] {results/cart_p4_optm};
\addplot[predicted] table[x=m, y=Pred] {results/cart_p4_optm};
\end{axis}
\end{tikzpicture}
\caption{The running time on Cartesius of the parallel algorithm for $n = 32768$, $p = 4$ and varying $m$. The thing line is the prediction.}
\label{fig:optm}
\end{figure}