phd-thesis/content/introduction.tex

\project thesis
\startcomponent introduction

\startchapter
  [title={Introduction},
   reference=chap:introduction]

When I was younger, I often learned how to play with new toys by messing around with them, i.e., by pressing buttons at random, observing its behaviour, pressing more buttons, and so on.
Only resorting to the manual -- or asking \quotation{experts} -- to confirm my beliefs on how the toy works.
Now that I am older, I do mostly the same with new devices, new tools, or new software.
However, now I know that this is an established computer science technique, called \emph{model learning}.

Model learning
\footnote{There are many names for the type of learning, such as \emph{active automata learning}.
The generic name \quotation{model learning} is chosen as a counterpoint to model checking.}
is a automated technique to construct a model -- often a type of \emph{automaton} -- from a black box system.
The goal of this technique can be manifold:
It can be used to reverse-engineer a system, to find bugs in it, to verify properties of the system, or to understand the system in one way or another.
It is \emph{not} just random testing: the information learned during the interaction with the system is actively used to guide following interactions.
Additionally, the information learned can be inspected and analysed.
\todo{We open the box?}

This thesis is about model learning and related techniques.
In the first part, I present results concerning \emph{black box testing} of automata.
Testing is a crucial part in learning software behaviour and often remains a bottleneck in applications of model learning.
In the second part, I show how \emph{nominal techniques} can be used to learn automata over structured infinite alphabets.
This was directly motivated by work on learning networks protocols which rely on identifiers or sequence numbers.

But before we get ahead of ourselves, we should first understand what we mean by learning, as learning means very different things to different people.
In educational science, learning may involve concepts such as teaching, blended learning, and interdisciplinarity.
Data scientist may think of data compression, feature extraction, and neural networks.
In this thesis we are mostly concerned with software verification.
But even in the field of verification several types of learning are relevant.


\startsection
  [title={Model Learning}]

In the context of software verification, we often look at stateful computation with inputs and outputs.
For this reason, it makes sense to look at \emph{words}, or \emph{traces}.
For an alphabet $\Sigma$, we denote the set of words by $\Sigma^{*}$.

The learning problem is defined as follows.
There is some fixed, but unknown to us, language $\lang \subseteq \Sigma^{*}$.
This language may define the behaviour of a software component, a property in model checking, a set of traces from a protocol, etc.
We wish to infer a description of $\lang$ by only using finitely many words.
Note that this is not an easy task, as $\lang$ is often infinite.

Such a learning problem can be stated and solved in a variety of ways.
In the applications we do in our research group, we often try to infer a model of a software component.
(\in{Chapter}[chap:applying-automata-learning] describes such an application.)
In such cases, a learning algorithm can interact with the software.
So it makes sense to study a learning paradigm which allows for \emph{queries}, and not just a data set of samples.
\footnote{Instead of query learning, people also use the term \emph{active learning}.}

A typical query learning framework was established by \citet[Angluin87].
In her framework, the learning algorithm may pose two types of queries to a \emph{teacher}, or \emph{oracle}:

\description{Membership queries (MQs)}
The learner poses such a query by providing a word $w \in \Sigma^{*}$ to the teacher.
The teacher will then reply whether $w \in \lang$ or not.
This type of query is often generalised to more general output, so the teacher replies with $\lang(w)$.
In some papers, such a query is then called an \emph{output query}.

\description{Equivalence queries (EQs)}
The learner can provide a hypothesised description of $\lang$ to the teacher.
If the hypothesis is correct, the teacher replies with \kw{yes}.
If, however, the hypothesis is incorrect, the teacher replies with \kw{no} together with a counterexample, i.e., a word which is in $\lang$ but not in the hypothesis or vice versa.

By posing many such queries, the learner algorithm is supposed to converge to a correct model.
This type of learning is hence called \emph{exact learning}.
\citet[Angluin87] showed that one can do this efficiently for deterministic finite automata DFAs (when $\lang$ is in the class of regular languages).

It should be clear why this is called \emph{query learning} or \emph{active learning}.
The learning algorithm initiates interaction with the teacher by posing queries, it may construct its own data points and ask for their corresponding label.
\todo{Nog afzetten tegen passive learning?}

Another paradigm which is relevant for our type of applications is \emph{PAC-learning with membership queries}.
Here, the algorithm can again use MQs as before, but the EQs are replace by random sampling.
So the allowed query is:

\description{Random sample queries (EXs)}
If the learner poses this query (there are no parameters), the teacher responds with a random word $w$, together with its label $w \in \lang$.

Instead of requiring that the learner exactly learns the model, we only require the following.
The learner should \emph{probably} return a model which is \emph{approximate} to the target.
This gives the name \emph{probably approximately correct} (PAC).
Note that there are two uncertainties: the probable and the approximate part.
Both part are bounded by parameters, so one can determine the confidence.

As with many problems in computer science, we are also interested in the \emph{efficiency} of learning algorithms.
Instead of measuring time or space, we often analyse the number of queries posed by an algorithm.
Efficiency often means that we require a polynomial number of queries.
But polynomial in what?
The learner has no input, other than the access to a teacher.
We ask the algorithms to be polynomial in the \emph{size of the target} (i.e., the size of the description which has yet to be learned).
In the case of PAC learning we also require it to be polynomial in the two parameters for confidence.

Deterministic automata can be efficiently learned in the PAC model.
In fact, any efficient exact learning algorithm with MQs and EQs can be transformed into an efficient PAC algorithm with MQs (see \citenp[KearnsV94], exercise 8.1).
For this reason, we mostly focus on the former type of learning in this thesis.
The transformation from exact learning to PAC learning is implemented by simply \emph{testing} the hypothesis with random samples.
This can be postponed until we actually implement a learning algorithm and apply it.

When using only EQs, only MQs, or only EXs, then there are hardness results for learning DFAs.
So the combinations MQs + EQs and MQs + EXs have been carefully been picked.
They provide a minimal basis for efficient learning.
See the book of \citet[KearnsV94] for such hardness results and more information on PAC learning.

So far, all the queries are assumed to be \emph{just there}.
Somehow, these are existing procedures which we can invoke with \kw{MQ}$(w)$, \kw{EQ}$(H)$, or \kw{EX}$()$.
This is a useful abstraction when designing a learning algorithm.
One can analyse the complexity (in terms of number of queries) independently of how these queries are resolved.
Nevertheless, at some point in time one has to implement them.
In our case of learning software behaviour, membership queries are easily implemented:
Simply provide the word $w$ to a running instance of the software and observe the output.
\footnote{In reality, it is a bit harder than this.
There are plentiful of challenges to solve, such as timing, choosing your alphabet, choosing the kind of observations to make and being able to faithfully reset the software.}
Equivalence queries, however, are in general not doable.
Even if we have the (machine) code, it is often way too complicated to check equivalence.
That is why we resort to testing with EX queries.
The EX query from PAC learning normally assumes a fixed, unknown, distribution.
In our case, we implement a distribution to test against.
This cuts both ways:
On one hand, it allows us to only test behaviour we really case about, on the other hand the results are only as good as our choice of distribution.
We deviate even further from the PAC-model as we sometimes change our distribution while learning.
Yet, as applications show, this is a useful way of learning software behaviour.


\stopsubsection
\startsubsection
  [title={Research challenges}]

Model learning is far from a
In this thesis, we will mostly see learning of DFAs or Mealy machines.
Although this is restrictive as many pieces of software require richer models, it has been successfully applied in many different areas.
The restrictions include the following.
\startitemize[after]
\item The system behaves deterministically.
\item One can reliably reset the system.
\item The system can be modelled with a finite state space.
\item The set of input actions is finite.
\item One knows when the target is reached.
\stopitemize

\description{Research challenge 1: Confidence in the hypothesis.}
Having confidence in a learned model is difficult.
We have PAC guarantees (as discussed before), but sometimes we may require other guarantees.
For example, we may require the hypothesis to be correct, provided that the real system is implemented with a certain number of states.
Efficiency is important here:
We want to obtain those guarantees fast and we want to find quickly counterexamples when the hypothesis is wrong.
Test generation methods is the topic of the first part in this thesis.
We will review existing algorithms and discuss new algorithms for test generation.

\startdescription[title={Research challenge 2: Generalisation to infinite alphabets.}]
Automata over infinite alphabets are very useful for modelling protocols which involve identifiers or timestamps.
Not only the alphabet is infinite in these cases, the state-space is as well, since the values have to be remembered.
In the second part of this thesis, we will see how nominal techniques can be used to tackle this challenge.

Being able to learn automata over an infinite alphabet is not new.
It has been tackled in the theses of \cite[Aarts14, Fiterau-Brostean18, Cassel15].
\todo{Dus...}
\stopdescription


\stopsubsection
\startsubsection
  [title={A few applications of learning}]

Since this thesis only contains one \quote{real-world} application on learning in \in{Chapter}[chap:applying-automata-learning], I think it is good to mention a few others.
Although we remain in the context of learning software behaviour, the applications are quite different.

Finite state machine conformance testing is a core topic in testing literature that is relevant for communication protocols and other reactive systems.
The combination of conformance testing and automata learning is applied in practice and it proves to be an effective way of finding bugs \cite[Joeri], replacing legacy software \cite[Philips], or inferring models to be used in model checking \cite[MeinkeSindhu].

Several model learning applications:
learning embedded controller software \cite[DBLP:conf/icfem/SmeenkMVJ15], learning the TCP protocol \cite[Fiterau-Brostean16] and learning the MQTT protocol \cite[TapplerAB17].


\stopsubsection
\stopsection
\startsection
  [title={Testing Techniques}]

\todo{Even lang maken als Nominal Techniques.}

An important step in automata learning is equivalence checking.
Normally, this is abstracted away and done by an oracle, but we intend to implement such an oracle ourselves for our applications.
Concretely, the problem we need to solve is that of \emph{conformance checking}\footnote{It is also known as \emph{machine verification} or \emph{fault detection}.} and this was first described by \citet[Moore56].

The problem is as follows:
We are given the description of a finite state machine and a black box system, does the system behave exactly as the description?
We wish te determine this by running experiments on the system (as it is black box).
It should be clear that this is a hopelessly difficult task, as an error can be hidden arbitrarily deep in the system.
That is why we often assume some knowledge of the system.
In this thesis we often assume a \emph{bound on the number of states} of the system.
Under these conditions, \citet[Moore56] already solved the problem.
Unfortunately, his experiment is exponential in size, or in his own words: \quotation{fantastically large.}

Years later, \citet[Chow78, Vasilevskii73] independently designed efficient experiments.
In particular, the set of experiments is polynomial in size.
These techniques will be discussed in detail in \in{Chapter}[chap:test-methods].
More background and other related problems, as well as their complexity results, are well exposed in a survey of \citet[LeeY94].


\stopsection
\startsection
  [title={Nominal Techniques}]

In the second part of this thesis, I will present results related to \emph{nominal automata}.
Usually, nominal techniques are introduced in order to solve problems involving name binding in topics like lambda calculus.
However, we use them to deal with \emph{register automata}.
These are automata which have an infinite alphabet, often thought of as input actions with \emph{data}.
The control flow of the automaton may actually depend on the data.
However, the data cannot be used in an arbitrary way as this would lead to many problems being undecidable.
\footnote{The class of automata with arbitrary data operations is sometimes called \emph{extended finite state machines}.}
A crucial concept in nominal automata is that of \emph{symmetries}.

To motivate the use of symmetries, we will look at an example of a register auttomaton.
In the following automaton we model a (not-so-realistic) login system for a single person.
The alphabet consists of the following actions:
\startformula\startmathmatrix[n=2, distance=1cm]
\NC \kw{register}(p) \NC \kw{logout}() \NR
\NC \kw{login}(p)    \NC \kw{view}()   \NR
\stopmathmatrix\stopformula
The \kw{register} action allows one to set a password $p$.
This can only be done when the system is initialised.
The \kw{login} and \kw{logout} actions speak for themselves and the \kw{view} action allows one to see the secret data (we abstract away from what the user actually gets to see here).
A simple automaton with roughly this behaviour is given in \in{Figure}[fig:login-system].
We will only informally discuss its semantics for now.

\startplacefigure
  [title={A simple register automaton. The symbol $\ast$ denotes any input otherwise not specified. The $r$ in states $q_1$ and $q_2$ is a register.},
   list={A simple register automaton.},
   reference=fig:login-system]
\hbox{\starttikzpicture[node distance=4cm]
  \node[state,initial]    (0) {$q_0$};
  \node[state with output,right of=0] (1) {$q_1$ \nodepart{lower} $r$};
  \node[state with output,right of=1] (2) {$q_2$ \nodepart{lower} $r$};

  \path[->]
  (0) edge [loop below] node [below] {$\ast$ / \checknok}
  (0) edge node [above, align=center] {$\kw{register}(p)$ / \checkok \\ set $r := p$} (1)
  (1) edge [loop below] node [below] {$\ast$ / \checknok}
  (1) edge [bend left, align=center] node [above] {$\kw{login}(p)$ / \checkok \\ if $r = p$} (2)
  (2) edge [loop below] node [right, align=left] {$\kw{view}()$ / \checkok \\ $\ast$ / \checknok}
  (2) edge [bend left] node [below] {$\kw{logout}()$ / \checkok} (1);
\stoptikzpicture}
\stopplacefigure

To model the behaviour nicely, we want the domain of passwords to be infinite.
After all, one should allow arbitrarily long passwords to be secure.
This means that a register automaton is actually an automaton over an infinite alphabet.

Common algorithms for automata, such as learning, will not work with an infinite alphabet.
Any loop which iterates over the alphabet will diverge.
In order to cope with this, we will use the \emph{symmetries} present in the alphabet.

Let us continue with the example and look at its symmetries.
If a person registers with a password \quotation{\kw{hello}} and consequently logins with \quotation{\kw{hello}}, then this is not distinguishable from a person registering and logging in with \quotation{\kw{bye}}.
This is an example of symmetry:
the values \quotation{\kw{hello}} and \quotation{\kw{bye}} can be permuted, or interchanged.
Note, however, that the trace $\kw{register}(\kw{hello})\, \kw{login}(\kw{bye})$ is different from the two before.
So we see that we cannot simply collapse the whole alphabet into one equivalence class.
For this reason, we keep the alphabet infinite and explicitly mention its symmetries.

Before formalising symmetries, let me remark the the use of symmetries in automata theory is not new.
One of the first to use symmetries was \citet[Sakamoto97].
He devised an \LStar{} learning algorithm for register automata, much like the one presented in \in{Chapter}[chap:learning-nominal-automata].
The symmetries are crucial to reduce the problem to a finite alphabet and use the regular \LStar{} algorithm.
(\in{Chapter}[chap:learning-nominal-automata] shows how to do it with more general symmetries.)
Around the same time \citet[FerrariMT05] worked on automata theoretic algorithms for the \pi-calculus.
Their approach was based on the same symmetries and they developed a theory of \emph{named sets} to implement their algorithms.
Named sets are equivalent to nominal sets.
However, nominal sets are defined in a more elementary way.
The nominal sets we will soon see are introduced by \citet[GabbayP02] to solve certain problems in name binding in abstract syntaxes.
Although this is not really related to automata theory, it was picked up by \citet[BojanczykKL14].
They provide an equivalence between register automata and nominal automata.
Additionally, they generalise the work on nominal sets to other symmetries.

The symmetries we encounter in this thesis are the following, but other symmetries can be found in the literature.
\startitemize[after]
\item
The \quotation{equality symmetry}.
Here the domain can be any countably infinite set.
We can take, for example, the set of string we used before as the domain from which we take passwords.
No further structure is used on this domain, meaning that any value is just as good as any other.
The symmetries therefore consist of all bijections on this domain.
\item
The \quotation{total order symmetry}.
In this case, we take a countable infinite set with a dense total order.
Typically, this means we use the rational numbers, $\Q$, as data values and symmetries which respect the ordering.
\stopitemize


\startsubsection
  [title={What is a nominal set?}]

So what exactly is a nominal set?
I will not define it here and leave the formalities to the corresponding chapters.
It suffices, for now, to think of nominal sets with abstract sets (often infinite) on which a group of symmetries acts.
The group of symmetries is not just any group, we fix a group of bijections on some fixed data domain.

In order to implement such sets algorithmically, we impose two finiteness requirements
\startitemize
\item
Finite support.
\item
Orbit-finite.
\stopitemize


\stopsubsection
\startsubsection
  [title={Contributions}]

The following chapters are split into two parts.
\in{Part}[part:testing] contains material about testing techniques, while \in{Part}[part:nominal] is about nominal techniques.
Each chapter could be read in isolation.
However, the chapters do get more technical and mathematical -- espacially in \in{Part}[part:nominal].

\description{\in{Chapter}[chap:test-methods].}
Bla


\stopsubsection
\stopsection
\referencesifcomponent
\stopchapter
\stopcomponent