phd-thesis/content/introduction.tex

\project thesis
\startcomponent introduction

\startchapter
  [title={Introduction},
   reference=chap:introduction]

When I was younger, I often learned how to play with new toys by messing around with them, i.e., by pressing buttons at random, observing its behaviour, pressing more buttons, and so on.
Only resorting to the manual -- or asking \quotation{experts} -- to confirm my beliefs on how the toy works.
Now that I am older, I do mostly the same with new devices, new tools, or new software.
However, now I know that this is an established computer science technique, called \emph{model learning}.

Model learning
\footnote{There are many names for the type of learning, such as \emph{active automata learning}.
The generic name \quotation{model learning} is chosen as a counterpoint to model checking.}
is a automated technique to construct a model -- often a type of \emph{automaton} -- from a black box system.
The goal of this technique can be manifold:
It can be used to reverse-engineer a system, to find bugs in it, to verify properties of the system, or to understand the system in one way or another.
It is \emph{not} just random testing: the information learned during the interaction with the system is actively used to guide following interactions.
Additionally, the information learned can be inspected and analysed.

This thesis is about model learning and related techniques.
In the first part, I present results concerning \emph{black box testing} of automata.
Testing is a crucial part in learning software behaviour and often remains a bottleneck in applications of model learning.
In the second part, I show how \emph{nominal techniques} can be used to learn automata over structured infinite alphabets.
This was directly motivated by work on learning networks protocols which rely on identifiers or sequence numbers.

But before we get ahead of ourselves, we should first understand what we mean by learning, as learning means very different things to different people.
In educational science, learning may involve concepts such as teaching, blended learning, and interdisciplinarity.
Data scientist may think of data compression, feature extraction, and neural networks.
In this thesis we are mostly concerned with software verification.
But even in the field of verification several types of learning are relevant.


\startsection
  [title={Model Learning}]

In the context of software verification, we often look at stateful computation with inputs and outputs.
For this reason, it makes sense to look at \emph{words}, or \emph{traces}.
For an alphabet $\Sigma$, we denote the set of words by $\Sigma^{*}$.

The learning problem is defined as follows.
There is some fixed, but unknown to us, language $\lang \subseteq \Sigma^{*}$.
This language may define the behaviour of a software component, a property in model checking, a set of traces from a protocol, etc.
We wish to infer a description of $\lang$ by only using finitely many words.
Note that this is not an easy task, as $\lang$ is often infinite.

Such a learning problem can be stated and solved in a variety of ways.
In the applications we do in our research group, we often try to infer a model of a software component.
(\in{Chapter}[chap:applying-automata-learning] describes such an application.)
In such cases, a learning algorithm can interact with the software.
So it makes sense to study a learning paradigm which allows for \emph{queries}, and not just a data set of samples.
\footnote{Instead of query learning, people also use the term \emph{active learning}.}

A typical query learning framework was established by \citet[Angluin87].
In her framework, the learning algorithm may pose two types of queries to a \emph{teacher}, or \emph{oracle}:

\description{Membership queries (MQs)}
The learner poses such a query by providing a word $w \in \Sigma^{*}$ to the teacher.
The teacher will then reply whether $w \in \lang$ or not.
This type of query is often generalised to more general output, so the teacher replies with $\lang(w)$.
In some papers, such a query is then called an \emph{output query}.

\description{Equivalence queries (EQs)}
The learner can provide a hypothesised description of $\lang$ to the teacher.
If the hypothesis is correct, the teacher replies with \kw{yes}.
If, however, the hypothesis is incorrect, the teacher replies with \kw{no} together with a counterexample, i.e., a word which is in $\lang$ but not in the hypothesis or vice versa.

By posing many such queries, the learner algorithm is supposed to converge to a correct model.
This type of learning is hence called \emph{exact learning}.
\citet[Angluin87] showed that one can do this efficiently for deterministic finite automata DFAs (when $\lang$ is in the class of regular languages).

It should be clear why this is called \emph{query learning} or \emph{active learning}.
The learning algorithm initiates interaction with the teacher by posing queries, it may construct its own data points and ask for their corresponding label.

Another paradigm which is relevant for our type of applications is \emph{PAC-learning with membership queries}.
Here, the algorithm can again use MQs as before, but the EQs are replace by random sampling.
So the allowed query is:

\description{Random sample queries (EXs)}
If the learner poses this query (there are no parameters), the teacher responds with a random word $w$, together with its label $w \in \lang$.

Instead of requiring that the learner exactly learns the model, we only require the following.
The learner should \emph{probably} return a model which is \emph{approximate} to the target.
This gives the name \emph{probably approximately correct} (PAC).
Note that there are two uncertainties: the probable and the approximate part.
Both part are bounded by parameters, so one can determine the confidence.

As with many problems in computer science, we are also interested in the \emph{efficiency} of learning algorithms.
Instead of measuring time or space, we often analyse the number of queries posed by an algorithm.
Efficiency often means that we require a polynomial number of queries.
But polynomial in what?
The learner has no input, other than the access to a teacher.
We ask the algorithms to be polynomial in the \emph{size of the target} (i.e., the size of the description which has yet to be learned).
In the case of PAC learning we also require it to be polynomial in the two parameters for confidence.

Deterministic automata can be efficiently learned in the PAC model.
In fact, any efficient exact learning algorithm with MQs and EQs can be transformed into an efficient PAC algorithm with MQs (see \citenp[KearnsV94], exercise 8.1).
For this reason, we mostly focus on the former type of learning in this thesis.
The transformation from exact learning to PAC learning is implemented by simply \emph{testing} the hypothesis with random samples.
This can be postponed until we actually implement a learning algorithm and apply it.

When using only EQs, only MQs, or only EXs, then there are hardness results for learning DFAs.
So the combinations MQs + EQs and MQs + EXs have been carefully been picked.
They provide a minimal basis for efficient learning.
See the book of \citet[KearnsV94] for such hardness results and more information on PAC learning.

So far, all the queries are assumed to be \emph{just there}.
Somehow, these are existing procedures which we can invoke with \kw{MQ}$(w)$, \kw{EQ}$(H)$, or \kw{EX}$()$.
This is a useful abstraction when designing a learning algorithm.
One can analyse the complexity (in terms of number of queries) independently of how these queries are resolved.
Nevertheless, at some point in time one has to implement them.
In our case of learning software behaviour, membership queries are easily implemented:
Simply provide the word $w$ to a running instance of the software and observe the output.
\footnote{In reality, it is a bit harder than this.
There are plentiful of challenges to solve, such as timing, choosing your alphabet, choosing the kind of observations to make and being able to faithfully reset the software.}
Equivalence queries, however, are in general not doable.
Even if we have the (machine) code, it is often way too complicated to check equivalence.
That is why we resort to testing with EX queries.
The EX query from PAC learning normally assumes a fixed, unknown, distribution.
In our case, we implement a distribution to test against.
This cuts both ways:
On one hand, it allows us to only test behaviour we really case about, on the other hand the results are only as good as our choice of distribution.
We deviate even further from the PAC-model as we sometimes change our distribution while learning.
Yet, as applications show, this is a useful way of learning software behaviour.


\stopsection
\startsection
  [title={Research challenges}]

Model learning is far from a
In this thesis, we will mostly see learning of DFAs or Mealy machines.
Although this is restrictive as many pieces of software require richer models, it has been successfully applied in many different areas.
The restrictions include the following.
\startitemize[after]
\item The system behaves deterministically.
\item One can reliably reset the system.
\item The system can be modelled with a finite state space.
This also means that the model cannot cope with time or data.
\item The set of input actions is finite.
\item One knows when the target is reached.
\stopitemize

\description{Research challenge 1: Confidence in the hypothesis.}
Having confidence in a learned model is difficult.
We have PAC guarantees (as discussed before), but sometimes we may require other guarantees.
For example, we may require the hypothesis to be correct, provided that the real system is implemented with a certain number of states.
Efficiency is important here:
We want to obtain those guarantees fast and we want to find quickly counterexamples when the hypothesis is wrong.
Test generation methods is the topic of the first part in this thesis.
We will review existing algorithms and discuss new algorithms for test generation.

\startdescription[title={Research challenge 2: Generalisation to infinite alphabets.}]
Automata over infinite alphabets are very useful for modelling protocols which involve identifiers or timestamps.
Not only the alphabet is infinite in these cases, the state-space is as well, since the values have to be remembered.
In the second part of this thesis, we will see how nominal techniques can be used to tackle this challenge.

Being able to learn automata over an infinite alphabet is not new.
It has been tackled in the theses of \citet[Aarts14], \citet[Cassel15], and \citet[Fiterau-Brostean18].
In the first, the problem is solved by considering abstractions, which reduces the alphabet to a finite one.
These abstractions are automatically refined when a counterexample is presented to the algorithms.
\citet[Fiterau-Brostean18] extends this approach to cope with \quotation{fresh values}, crucial for protocols such as TCP.
In the thesis by \citet[Cassel15], another approach is taken.
The queries are changed to tree queries.
The approach in my thesis will be based on symmetries, which gives yet another insight into the problem of learning such automata.
\stopdescription


\stopsection
\startsection
  [title={Applications of Model Learning}]

Since this thesis only contains one real-world application on learning in \in{Chapter}[chap:applying-automata-learning], I think it is good to mention a few others.
Although we remain in the context of learning software behaviour, the applications are quite different.
This is by no means a complete list.

\description{Bug finding in protocols.}
A prominent example is by \citet[Fiterau-Brostean16].
They learn models of TCP implementations -- both clients and server sides.
Interestingly, they found bugs in the (closed source) Windows implementation.
Later, \citet[Fiterau-BrosteanH17] also found a bug in the sliding window of the Linux implementation.
Other protocols have been learned as well, such as the MQTT protocol by \citet[TapplerAB17], TLS by \citet[RuiterP15], and SSH by \citet[Fiterau-BrosteanLPRVV17].
Many of these applications reveal bugs by learning a model and consequently apply model checking.
The combination of learning and model checking was first described by \citet[PeledVY02].

\description{Bug finding in smart cards.}
\citet[AartsRP13] learn the software on smart cards of several Dutch and German banks.
These cards use the EMV protocol, which is run on the card itself.
So this is an example of a real black box system, where no other monitoring is possible and no code is available.
No vulnerabilities were found, although each card had a slightly different protocol.
Card readers have been learned by \citet[ChaluparPPR14].
They built a Lego machine which could automatically press buttons and the researchers found a security flaw.

\description{Regression testing.}
\citet[HungarNS03] describe the potential of automata learning in regression testing.
The aim is not to find bugs, but to monitor the development process of a system.
By considering the differences between models at different stages, one can generate regressions tests.

\description{Refactoring legacy software.}
Model learning can also be used in order to verify refactored software.
\citet[SchutsHV16] have applied this at a project within Philips.
They learn both an old version and a new version of the same component.
By comparing the learned models, some differences could be seen.
This gave developers opportunities to solve problems before replacing the old component by the new one.


\stopsection
\startsection
  [title={Testing Techniques}]

An important step in automata learning is equivalence checking.
Normally, this is abstracted away and done by an oracle, but we intend to implement such an oracle ourselves for our applications.
Concretely, the problem we need to solve is that of \emph{conformance checking}\footnote{It is also known as \emph{machine verification} or \emph{fault detection}.} and this was first described by \citet[Moore56].

The problem is as follows:
We are given the description of a finite state machine and a black box system, does the system behave exactly as the description?
We wish te determine this by running experiments on the system (as it is black box).
It should be clear that this is a hopelessly difficult task, as an error can be hidden arbitrarily deep in the system.
That is why we often assume some knowledge of the system.
In this thesis we often assume a \emph{bound on the number of states} of the system.
Under these conditions, \citet[Moore56] already solved the problem.
Unfortunately, his experiment is exponential in size, or in his own words: \quotation{fantastically large.}

Years later, \citet[Chow78, Vasilevskii73] independently designed efficient experiments.
In particular, the set of experiments is polynomial in size.
These techniques will be discussed in detail in \in{Chapter}[chap:test-methods].
More background and other related problems, as well as their complexity results, are well exposed in a survey of \citet[LeeY94].

\todo{I want to give an example (by giving a state machine) of why black box testing is hard.}
\todo{I want to move some related work from the second chapter to here}.


\stopsection
\startsection
  [title={Nominal Techniques}]

In the second part of this thesis, I will present results related to \emph{nominal automata}.
Usually, nominal techniques are introduced in order to solve problems involving name binding in topics like lambda calculus.
However, we use them to deal with \emph{register automata}.
These are automata which have an infinite alphabet, often thought of as input actions with \emph{data}.
The control flow of the automaton may actually depend on the data.
However, the data cannot be used in an arbitrary way as this would lead to many problems being undecidable.
\footnote{The class of automata with arbitrary data operations is sometimes called \emph{extended finite state machines}.}
A crucial concept in nominal automata is that of \emph{symmetries}.

To motivate the use of symmetries, we will look at an example of a register auttomaton.
In the following automaton we model a (not-so-realistic) login system for a single person.
The alphabet consists of the following actions:
\startformula\startmathmatrix[n=2, distance=1cm]
\NC \kw{register}(p) \NC \kw{logout}() \NR
\NC \kw{login}(p)    \NC \kw{view}()   \NR
\stopmathmatrix\stopformula
The \kw{register} action allows one to set a password $p$.
This can only be done when the system is initialised.
The \kw{login} and \kw{logout} actions speak for themselves and the \kw{view} action allows one to see the secret data (we abstract away from what the user actually gets to see here).
A simple automaton with roughly this behaviour is given in \in{Figure}[fig:login-system].
We will only informally discuss its semantics for now.

\startplacefigure
  [title={A simple register automaton. The symbol $\ast$ denotes any input otherwise not specified. The $r$ in states $q_1$ and $q_2$ is a register.},
   list={A simple register automaton.},
   reference=fig:login-system]
\hbox{\starttikzpicture[node distance=4cm]
  \node[state,initial]    (0) {$q_0$};
  \node[state with output,right of=0] (1) {$q_1$ \nodepart{lower} $r$};
  \node[state with output,right of=1] (2) {$q_2$ \nodepart{lower} $r$};

  \path[->]
  (0) edge [loop below] node [below] {$\ast$ / \checknok}
  (0) edge node [above, align=center] {$\kw{register}(p)$ / \checkok \\ set $r := p$} (1)
  (1) edge [loop below] node [below] {$\ast$ / \checknok}
  (1) edge [bend left, align=center] node [above] {$\kw{login}(p)$ / \checkok \\ if $r = p$} (2)
  (2) edge [loop below] node [right, align=left] {$\kw{view}()$ / \checkok \\ $\ast$ / \checknok}
  (2) edge [bend left] node [below] {$\kw{logout}()$ / \checkok} (1);
\stoptikzpicture}
\stopplacefigure

To model the behaviour nicely, we want the domain of passwords to be infinite.
After all, one should allow arbitrarily long passwords to be secure.
This means that a register automaton is actually an automaton over an infinite alphabet.

Common algorithms for automata, such as learning, will not work with an infinite alphabet.
Any loop which iterates over the alphabet will diverge.
In order to cope with this, we will use the \emph{symmetries} present in the alphabet.

Let us continue with the example and look at its symmetries.
If a person registers with a password \quotation{\kw{hello}} and consequently logins with \quotation{\kw{hello}}, then this is not distinguishable from a person registering and logging in with \quotation{\kw{bye}}.
This is an example of symmetry:
the values \quotation{\kw{hello}} and \quotation{\kw{bye}} can be permuted, or interchanged.
Note, however, that the trace $\kw{register}(\kw{hello})\, \kw{login}(\kw{bye})$ is different from the two before.
So we see that we cannot simply collapse the whole alphabet into one equivalence class.
For this reason, we keep the alphabet infinite and explicitly mention its symmetries.

Before formalising symmetries, let me remark the the use of symmetries in automata theory is not new.
One of the first to use symmetries was \citet[Sakamoto97].
He devised an \LStar{} learning algorithm for register automata, much like the one presented in \in{Chapter}[chap:learning-nominal-automata].
The symmetries are crucial to reduce the problem to a finite alphabet and use the regular \LStar{} algorithm.
(\in{Chapter}[chap:learning-nominal-automata] shows how to do it with more general symmetries.)
Around the same time \citet[FerrariMT05] worked on automata theoretic algorithms for the \pi-calculus.
Their approach was based on the same symmetries and they developed a theory of \emph{named sets} to implement their algorithms.
Named sets are equivalent to nominal sets.
However, nominal sets are defined in a more elementary way.
The nominal sets we will soon see are introduced by \citet[GabbayP02] to solve certain problems in name binding in abstract syntaxes.
Although this is not really related to automata theory, it was picked up by \citet[BojanczykKL14].
They provide an equivalence between register automata and nominal automata.
Additionally, they generalise the work on nominal sets to other symmetries.

The symmetries we encounter in this thesis are the following, but other symmetries can be found in the literature.
The symmetry directly corresponds to the data values (and operations) used in an automaton.
The data values are often called \emph{atoms}.
\startitemize[after]
\item
The \quotation{equality symmetry}.
Here the domain can be any countably infinite set.
We can take, for example, the set of string we used before as the domain from which we take passwords.
No further structure is used on this domain, meaning that any value is just as good as any other.
The symmetries therefore consist of all bijections on this domain.
\item
The \quotation{total order symmetry}.
In this case, we take a countable infinite set with a dense total order.
Typically, this means we use the rational numbers, $\Q$, as data values and symmetries which respect the ordering.
\stopitemize


\startsubsection
  [title={What is a nominal set?}]

So what exactly is a nominal set?
I will not define it here and leave the formalities to the corresponding chapters.
It suffices, for now, to think of nominal sets as abstract sets (often infinite) on which a group of symmetries acts.
This action makes it possible to interpret the symmetries of the data values in the abstract set.
For automata, this allows us to talk about symmetries on the state space, the set of transitions, and the alphabet.

In order to implement these sets algorithmically, we impose two finiteness requirements.
Both properties can be expressed using only the group action.
\startitemize[after]
\item
Each element is \emph{finitely supported}.
A way to think of this requirement is that each element is \quotation{constructed} out of finitely many data values.
\item
The set is \emph{orbit-finite}.
This means that we can choose finitely many elements such that any other element is a permuted version of one of those elements.
\stopitemize

If we wish to model the automaton from \in{Figure}[fig:login-system] as a nominal automaton, then we can simply define the state space as
\startformula
Q = \{ q_0 \} \cup \{ q_{1, a} \mid a \in \atoms \} \cup \{ q_{2, a} \mid a \in \atoms \},
\stopformula
where $\atoms$ is the set of atoms.
In this example, $\atoms$ is the set of all possible passwords.
The set $Q$ is infinite, but satisfies the two finiteness requirements.

The upshot of doing this is that the set $Q$ (and transition structure) corresponds directly to the semantics of the automaton.
We do not have to \emph{encode} how values relate or how they interact.
Instead, the set (and transition structure) defines all we need to know.
Algorithms, such as reachability, minimisation, and learning, can be run on such automata, despite the sets being infinite.
These algorithms can be implemented rather easily, by using a library which implements nominal sets.

One has to be careful as not all results from automata theory transfer to nominal automata.
A notable example is the powerset construction, which converts a non-deterministic automaton into a deterministic one.
The problem here is that the powerset of a set is generally \emph{not} orbit-finite and so the finiteness requirement is not met.
Consequently, languages accepted by nominal DFAs are \emph{not} closed under Kleene star, or even concatenation.


\stopsubsection
\stopsection
\startsection
  [title={Contributions}]

The following chapters are split into two parts.
\in{Part}[part:testing] contains material about testing techniques, while \in{Part}[part:nominal] is about nominal techniques.
Each chapter could be read in isolation.
However, the chapters do get more technical and mathematical -- especially in \in{Part}[part:nominal].
\todo{Should I be more clear about my contribution?}

\startdescription[title={\in{Chapter}[chap:test-methods]: FSM-based test methods.}]
This chapter introduces test generation methods which can be used for learning.
The methods are presented in a uniform way, which allows to give a single proof of completeness for all these methods.
Moreover, the uniform presentation gives room to develop new test generation methods.
One particular contribution (besides the theoretical framework) is the new \emph{hybrid ADS} test generation method.
\stopdescription

\startdescription[title={\in{Chapter}[chap:applying-automata-learning]: Applying automata learning to embedded control software.}]
In this chapter we will apply model learning to an industrial case study.
It is a unique benchmark as it is much bigger than any of the applications seen before (3410 states and 77 inputs).
This makes it very challenging to learn a model and the main obstacle is finding counterexamples.
Here, my main contribution is the new test generation method from \in{Chapter}[chap:test-methods].

This is based on the following publication:

\cite[entry][DBLP:conf/icfem/SmeenkMVJ15].
\stopdescription

\startdescription[title={\in{Chapter}[chap:minimal-separating-sequences]: Minimal separating sequences for all pairs of states.}]
Continuing on test generation methods, this chapter presents an efficient algorithm to construct separating sequences.
Not only is the algorithm efficient -- it runs in $\bigO(n \log n)$ time -- it also constructs minimal length sequences.
The algorithm is based on a minimisation algorithm by \citet[Hopcroft71], but extending it to construct witnesses is non-trivial.

This is based on the following publication:

\cite[entry][DBLP:conf/lata/SmetsersMJ16].
\stopdescription

\startdescription[title={\in{Chapter}[chap:learning-nominal-automata]: Learning Nominal Automata.}]
In this chapter, we show how to learn automata over infinite alphabets.
We do this by translating the \LStar{} algorithm directly to a nominal version, \nLStar{}.
The correctness proofs mimic the original proofs by \citet[Angluin87].
Since our new algorithm is close to the original, we are able to translate variants of the \LStar{} algorithm as well.
In particular, we provide a learning algorithm for nominal non-deterministic automata.

This is based on the following publication:

\cite[entry][MoermanS0KS17].
\stopdescription

\startdescription[title={\in{Chapter}[chap:ordered-nominal-sets]: Fast computations on ordered nominal sets.}]
In this chapter, we provide a library to compute with nominal sets.
We restrict our attention to nominal sets over the total order symmetry.
This symmetry allows for a rather easy characterisation of orbits, and hence an easy implementation.
We experimentally show that it is competitive with existing tools, which are based on SMT solvers.

This is based on the following publication:

\cite[entry][VenhoekMR18].
\stopdescription

\startdescription[title={\in{Chapter}[chap:separated-nominal-automata]: Separated Nominal Automata.}]
We investigate how to reduce the size of certain nominal automata.
This is based on the observation that some languages (with outputs) are not just invariant under symmetries, but invariant under arbitrary \emph{transformations}, or \emph{renamings}.
We define a new type of automaton, the \emph{separated nominal automaton}, and show that they exactly accept those languages which are closed under renamings.
All of this is shown by using a theoretical framework: we establish a strong relationship between nominal sets on one hand, and nominal renaming sets on the other.

This is based on a paper under submission:

\cite[entry][MoermanR19].
\stopdescription

Besides these chapters, of which many are based on publications, the author has published the following papers.
These are not included in this thesis, but a short summary of those papers is presented below.

\startdescription[title={Complementing Model Learning with Mutation-Based Fuzzing.}]
Our group at the Radboud University participated in the RERS challenge.
This is a challenge where reactive software is provided and researchers have to asses validity of certain properties (given as LTL specifications).
We approached this with model learning:
Instead of analysing the source code, we simply learned the external behaviour, and then used model checking on the learned model.
This has worked remarkably well, as the models of the external behaviour are not too big.
Our results were presented at the RERS workshop (ISOLA 2016).
The report can be found on ar$\Chi$iv:

\cite[entry][SmetsersMJV16].
\stopdescription

\startdescription[title={$n$-Complete test suites for IOCO.}]
In this paper, we investigate complete test suites for labelled transition systems (LTSs), instead of deterministic Mealy machines.
This is a much harder problem than conformance testing of deterministic systems.
The system may adversarially avoid certain states the tester wishes to test.
We provide a test suite which is $n$-complete (provided the implementation is a suspension automaton).
My personal contribution here is the proof of completeness, which resembles the proof presented in \in{Chapter}[chap:test-methods] closely.
The conference paper was presented at ICTSS:

\cite[entry][BosJM17].

An extended version will appear soon:

\cite[entry][BosJM18].
\stopdescription

\startdescription[title={Learning Product Automata.}]
In this article we consider Moore machines with multiple outputs.
These machines can be decomposed by projecting on each output,
resulting in smaller components that can be learned with fewer queries.
We give experimental evidence that this is a useful technique which can reduce the number of queries substantially.
This is all motivated by the idea that compositional methods are widely used throughout engineering and that we should use this in model learning.
This was presented by Vaandrager at ICGI:

\cite[entry][Moerman18].
\stopdescription

\stopsection
\referencesifcomponent
\stopchapter
\stopcomponent