Browse Source

[Report] Started to generate some results

master
Joshua Moerman 11 years ago
parent
commit
20b9b15f7a
  1. 3
      filter.sh
  2. 10
      wavelet_report/preamble.tex
  3. 3
      wavelet_report/report.tex
  4. 67
      wavelet_report/res.tex
  5. 24
      wavelet_report/results/cart_p4_m1_basic
  6. 23
      wavelet_report/results/mbp_p2_m1_basic
  7. 24
      wavelet_speed.sh

3
filter.sh

@ -0,0 +1,3 @@
#!/bin/bash
cat "$@" | grep "seq\|par" | sed 's/[a-zA-Z]//g' | sed 's/[[:blank:]]//g' | sed 'N;s/\n/ /'

10
wavelet_report/preamble.tex

@ -5,6 +5,14 @@
% floating figures % floating figures
\usepackage{float} \usepackage{float}
\usepackage{tikz}
\usepackage{pgfplots}
\pgfplotsset{compat=newest}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
% Matrices have a upper bound for its size % Matrices have a upper bound for its size
\setcounter{MaxMatrixCols}{20} \setcounter{MaxMatrixCols}{20}
@ -24,3 +32,5 @@
\theoremstyle{plain} \theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section] \newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma} \newtheorem{lemma}[theorem]{Lemma}
\newcommand*{\thead}[1]{\multicolumn{1}{c}{\bfseries #1}}

3
wavelet_report/report.tex

@ -6,7 +6,7 @@
\title{Parallel wavelet transform} \title{Parallel wavelet transform}
\author{Joshua Moerman} \author{Joshua Moerman}
\includeonly{dau} %\includeonly{dau}
\begin{document} \begin{document}
@ -20,6 +20,7 @@ In this paper we will derive a parallel algorithm to perform a Daubechies wavele
\include{intro} \include{intro}
\include{dau} \include{dau}
\include{par} \include{par}
\include{res}
\nocite{*} \nocite{*}

67
wavelet_report/res.tex

@ -0,0 +1,67 @@
\section{Results}
\label{sec:res}
\subsection{Methodology}
The first step into measuring the gain of parallelization is to make sure the implementation is correct and to have a sequential baseline. The first implementation was a very naive and sequential implementation, which did a lot of data copying and shuffling. The a more performant sequential implementation was made, which provided the same output as the naive one. By using the inverse we assured that the implementation was correct. Only then the parallel version was made. It gives exactly the same outcome as the sequential version and is hence considered correct.
We analysed the theoretical BSP cost, but this does not guarantee much about running the real program. By also estimating the BSP variables $r, g$ and $l$ we can see how well the theoretical analysis matches the practical running time. To estimate these variables we used the general benchmarking tool \texttt{bench} from the BSP Edupack\footnote{See \url{http://www.staff.science.uu.nl/~bisse101/Software/software.html}}.
There are two machines on which the computation was run. First of all, a Mac book pro (MBP 13'' early 2001) with two physical processors. Due to hyperthreading it actually has four \emph{virtual} cores, but for a pure computation (where the pipeline should always be filled) we cannot expect a speed up of more than two. Secondly, the super computer Cartesius (with many much more cores). We should note that the MBP actually has shared memory which the BSP model does not use at all. The estimated BSP variables are listed in table ~\ref{tab:variables}.
\begin{table}
\begin{tabular}{c|r|r|r|r}
& \thead{MBP} & \thead{MBP} & \thead{Cartesius} & \thead{Cartesius} \\
\hline
p & 2 & 4 & 4 & 16 \\
r & 5993 & 2849 & 6771 & 6771 \\
g & 284 & 248 & 219 & 340 \\
l & 1300 & 2161 & 46455 & 162761\\
\end{tabular}
\caption{The estimated BSP variables for the two machines. Estimated for a different number of processors.}
\label{tab:variables}
\end{table}
When we measure time, we only measure the time of the actual algorithm. So we ignore start-up time or allocation time and initial data distribution. Time is measured with the \texttt{bsp\_time()} primitive, which is a wall clock. For a better measurement we iterated the algorithm at least 100 times and divided the total time by the number of iterations.
\subsection{Results}
In this subsection we will plot the actual running time of the algorithm. We will take $n$ as a variable to see how the parallel algorithm scales. As we only allow power of two for $n$ we will often plot in a $\log-\log$-fashion. In all cases we took $n=2^6$ as a minimum and $n=2^27$ as a maximum. Unless stated otherwise we will use blue for the parallel running time, red for the sequential running time. The thin lines shows the theoretical time for which we used the variables in table~\ref{tab:variables}.
In figure~\ref{fig:basic} the running time is plotted for the case where $m=1$. There are multiple things to note. First of all we see that the actual running time closely matches the shape of the theoretical prediction. This assures us that the BSP cost model is sufficient to predict the impact of parallelization. On both machines there is a point at which the parallel algorithm is faster and stays faster. However, on the MBP at around $10^6$ both the sequential and parallel algorithm show a bump.
\tikzstyle{measured}=[mark=+]
\tikzstyle{predicted}=[very thin, dashed]
\tikzstyle{sequential}=[color=red]
\tikzstyle{parallel}=[color=blue]
\begin{figure}
\centering
\begin{subfigure}[b]{0.5\textwidth}
\begin{tikzpicture}
\begin{loglogaxis}[xlabel={$n$}, ylabel={Time (s)}, width=\textwidth]
\addplot[predicted, sequential] table[x=n, y=SeqP] {results/mbp_p2_m1_basic};
\addplot[predicted, parallel] table[x=n, y=ParP] {results/mbp_p2_m1_basic};
\addplot[measured, sequential] table[x=n, y=Seq] {results/mbp_p2_m1_basic}; \addlegendentry{sequential}
\addplot[measured, parallel] table[x=n, y=Par] {results/mbp_p2_m1_basic}; \addlegendentry{parallel}
\end{loglogaxis}
\end{tikzpicture}
\caption{Running time on a MBP with $p=2$}
\end{subfigure}~
\begin{subfigure}[b]{0.5\textwidth}
\begin{tikzpicture}
\begin{loglogaxis}[xlabel=n, width=\textwidth]
\addplot[predicted, sequential] table[x=n, y=SeqP] {results/cart_p4_m1_basic};
\addplot[predicted, parallel] table[x=n, y=ParP] {results/cart_p4_m1_basic};
\addplot[measured, sequential] table[x=n, y=Seq] {results/cart_p4_m1_basic}; \addlegendentry{sequential}
\addplot[measured, parallel] table[x=n, y=Par] {results/cart_p4_m1_basic}; \addlegendentry{parallel}
\end{loglogaxis}
\end{tikzpicture}
\caption{Running time on Cartesius with $p=4$}
\end{subfigure}
\caption{Running time vs. number of elements $n$. The thin line shows the theoretical prediction.}
\label{fig:basicplot}
\end{figure}
bla

24
wavelet_report/results/cart_p4_m1_basic

@ -0,0 +1,24 @@
n Seq Par SeqP ParP
64 0.00000027 0.00002848 0.00000013232905 0.000024750184611
128 0.00000054 0.00002891 0.000000264658101 0.000028310736966
256 0.00000105 0.00003613 0.000000529316201 0.000031904371585
512 0.0000025 0.0000375 0.000001058632403 0.000035564170728
1024 0.00000417 0.00004423 0.000002117264806 0.000039356298922
2048 0.00000829 0.0000438 0.000004234529612 0.000043413085216
4096 0.000017 0.00005336 0.000008469059223 0.000047999187712
8192 0.0000352 0.00005818 0.000016938118446 0.000053643922611
16384 0.00007036 0.00007659 0.000033876236893 0.000061405922316
32768 0.00014554 0.00009578 0.000067752473785 0.000073402451632
65536 0.00033355 0.00014556 0.000135504947571 0.000093868040171
131072 0.00066892 0.00022974 0.000271009895141 0.000131271747157
262144 0.00133726 0.00041677 0.000542019790282 0.000202551691035
524288 0.00271766 0.00075303 0.001084039580564 0.000341584108699
1048576 0.00540122 0.00143222 0.002168079161128 0.000616121473933
2097152 0.01085754 0.00280032 0.004336158322257 0.001161668734308
4194304 0.02789499 0.00554204 0.008672316644513 0.002249235784965
8388608 0.06382695 0.01450042 0.017344633289027 0.004420842416187
16777216 0.1277917 0.0350954 0.034689266578054 0.008760528208536
33554432 0.2550964 0.06986389 0.069378533156107 0.017436372323143
67108864 0.50946262 0.1389675 0.138757066312214 0.034784533082263
134217728 1.01779021 0.27657479 0.277514132624428 0.069477327130409

23
wavelet_report/results/mbp_p2_m1_basic

@ -0,0 +1,23 @@
n Seq Par SeqP ParP
64 0.0000002 0.00000231 0.000000149507759 0.000001719339229
128 0.000000485 0.000003695 0.000000299015518 0.000002044718839
256 0.00000088 0.000003345 0.000000598031036 0.000002444852328
512 0.00000167 0.00000448 0.000001196062072 0.000002994493576
1024 0.000003305 0.00000637 0.000002392124145 0.000003843150342
2048 0.000006615 0.000010065 0.00000478424829 0.000005289838145
4096 0.000013705 0.000015575 0.000009568496579 0.000007932588019
8192 0.00002897 0.000020225 0.000019136993159 0.000012967462039
16384 0.00007088 0.00004938 0.000038273986317 0.000022786584348
32768 0.000146195 0.00008591 0.000076547972635 0.000042174203237
65536 0.000313405 0.000130035 0.000153095945269 0.000080698815284
131072 0.000534205 0.0004309 0.000306191890539 0.000157497413649
262144 0.001042505 0.000626695 0.000612383781078 0.000310843984649
524288 0.002813735 0.002353855 0.001224767562156 0.000617286500918
1048576 0.00747013 0.006598305 0.002449535124312 0.001229920907726
2097152 0.01472155 0.01315899 0.004899070248623 0.002454939095612
4194304 0.02943272 0.02627802 0.009798140497247 0.004904724845653
8388608 0.058599735 0.055608795 0.019596280994494 0.009804045720007
16777216 0.123395235 0.106270255 0.039192561988987 0.019602436842984
33554432 0.238386745 0.21387985 0.078385123977974 0.039198968463207
67108864 0.474814405 0.428788495 0.156770247955949 0.078391781077924
134217728 0.953750485 0.856867835 0.313540495911897 0.156777155681629

24
wavelet_speed.sh

@ -0,0 +1,24 @@
#!/bin/bash
#SBATCH -t 0:30:00
#SBATCH -n 4
p=2
start=6
end=27
iters=200
if [[ `whoami` == "bissstud" ]]; then
cd $HOME/Students13/JoshuaMoerman/assignments
echo "Running on Cartesius $@"
RUNCOMMAND="srun"
else
echo "Running locally $@"
RUNCOMMAND=""
fi
for i in `seq $start $end`; do
echo -e "\n\033[1;34mtime\t`date`\033[0;39m"
let "n=2**$i"
$RUNCOMMAND ./build-Release/wavelet/wavelet_parallel_mockup --m 1 --n $n --p $p --show-input --iterations $iters
done
echo -e "\n\033[1;31mtime\t`date`\033[0;39m"