331 lines
20 KiB
TeX
331 lines
20 KiB
TeX
\project thesis
|
|
\startcomponent introduction
|
|
|
|
\startchapter
|
|
[title={Introduction},
|
|
reference=chap:introduction]
|
|
|
|
When I was younger, I often learned how to play with new toys by messing around with them, i.e., by pressing buttons at random, observing its behaviour, pressing more buttons, and so on.
|
|
Only resorting to the manual -- or asking \quotation{experts} -- to confirm my beliefs on how the toy works.
|
|
Now that I am older, I do mostly the same with new devices, new tools, or new software.
|
|
However, now I know that this is an established computer science technique, called \emph{model learning}.
|
|
|
|
Model learning
|
|
\footnote{There are many names for the type of learning, such as \emph{active automata learning}.
|
|
The generic name \quotation{model learning} is chosen as a counterpoint to model checking.}
|
|
is a automated technique to construct a model -- often a type of \emph{automaton} -- from a black box system.
|
|
The goal of this technique can be manifold:
|
|
It can be used to reverse-engineer a system, to find bugs in it, to verify properties of the system, or to understand the system in one way or another.
|
|
It is \emph{not} just random testing: the information learned during the interaction with the system is actively used to guide following interactions.
|
|
Additionally, the information learned can be inspected and analysed.
|
|
\todo{We open the box?}
|
|
|
|
This thesis is about model learning and related techniques.
|
|
In the first part, I present results concerning \emph{black box testing} of automata.
|
|
Testing is a crucial part in learning software behaviour and often remains a bottleneck in applications of model learning.
|
|
In the second part, I show how \emph{nominal techniques} can be used to learn automata over structured infinite alphabets.
|
|
This was directly motivated by work on learning networks protocols which rely on identifiers or sequence numbers.
|
|
|
|
But before we get ahead of ourselves, we should first understand what we mean by learning, as learning means very different things to different people.
|
|
In educational science, learning may involve concepts such as teaching, blended learning, and interdisciplinarity.
|
|
Data scientist may think of data compression, feature extraction, and neural networks.
|
|
In this thesis we are mostly concerned with software verification.
|
|
But even in the field of verification several types of learning are relevant.
|
|
|
|
|
|
\startsection
|
|
[title={Model Learning}]
|
|
|
|
In the context of software verification, we often look at stateful computation with inputs and outputs.
|
|
For this reason, it makes sense to look at \emph{words}, or \emph{traces}.
|
|
For an alphabet $\Sigma$, we denote the set of words by $\Sigma^{*}$.
|
|
|
|
The learning problem is defined as follows.
|
|
There is some fixed, but unknown to us, language $\lang \subseteq \Sigma^{*}$.
|
|
This language may define the behaviour of a software component, a property in model checking, a set of traces from a protocol, etc.
|
|
We wish to infer a description of $\lang$ by only using finitely many words.
|
|
Note that this is not an easy task, as $\lang$ is often infinite.
|
|
|
|
Such a learning problem can be stated and solved in a variety of ways.
|
|
In the applications we do in our research group, we often try to infer a model of a software component.
|
|
(\in{Chapter}[chap:applying-automata-learning] describes such an application.)
|
|
In such cases, a learning algorithm can interact with the software.
|
|
So it makes sense to study a learning paradigm which allows for \emph{queries}, and not just a data set of samples.
|
|
\footnote{Instead of query learning, people also use the term \emph{active learning}.}
|
|
|
|
A typical query learning framework was established by \citet[Angluin87].
|
|
In her framework, the learning algorithm may pose two types of queries to a \emph{teacher}, or \emph{oracle}:
|
|
|
|
\description{Membership queries (MQs)}
|
|
The learner poses such a query by providing a word $w \in \Sigma^{*}$ to the teacher.
|
|
The teacher will then reply whether $w \in \lang$ or not.
|
|
This type of query is often generalised to more general output, so the teacher replies with $\lang(w)$.
|
|
In some papers, such a query is then called an \emph{output query}.
|
|
|
|
\description{Equivalence queries (EQs)}
|
|
The learner can provide a hypothesised description of $\lang$ to the teacher.
|
|
If the hypothesis is correct, the teacher replies with \kw{yes}.
|
|
If, however, the hypothesis is incorrect, the teacher replies with \kw{no} together with a counterexample, i.e., a word which is in $\lang$ but not in the hypothesis or vice versa.
|
|
|
|
By posing many such queries, the learner algorithm is supposed to converge to a correct model.
|
|
This type of learning is hence called \emph{exact learning}.
|
|
\citet[Angluin87] showed that one can do this efficiently for deterministic finite automata DFAs (when $\lang$ is in the class of regular languages).
|
|
|
|
It should be clear why this is called \emph{query learning} or \emph{active learning}.
|
|
The learning algorithm initiates interaction with the teacher by posing queries, it may construct its own data points and ask for their corresponding label.
|
|
\todo{Nog afzetten tegen passive learning?}
|
|
|
|
Another paradigm which is relevant for our type of applications is \emph{PAC-learning with membership queries}.
|
|
Here, the algorithm can again use MQs as before, but the EQs are replace by random sampling.
|
|
So the allowed query is:
|
|
|
|
\description{Random sample queries (EXs)}
|
|
If the learner poses this query (there are no parameters), the teacher responds with a random word $w$, together with its label $w \in \lang$.
|
|
|
|
Instead of requiring that the learner exactly learns the model, we only require the following.
|
|
The learner should \emph{probably} return a model which is \emph{approximate} to the target.
|
|
This gives the name \emph{probably approximately correct} (PAC).
|
|
Note that there are two uncertainties: the probable and the approximate part.
|
|
Both part are bounded by parameters, so one can determine the confidence.
|
|
|
|
As with many problems in computer science, we are also interested in the \emph{efficiency} of learning algorithms.
|
|
Instead of measuring time or space, we often analyse the number of queries posed by an algorithm.
|
|
Efficiency often means that we require a polynomial number of queries.
|
|
But polynomial in what?
|
|
The learner has no input, other than the access to a teacher.
|
|
We ask the algorithms to be polynomial in the \emph{size of the target} (i.e., the size of the description which has yet to be learned).
|
|
In the case of PAC learning we also require it to be polynomial in the two parameters for confidence.
|
|
|
|
Deterministic automata can be efficiently learned in the PAC model.
|
|
In fact, any efficient exact learning algorithm with MQs and EQs can be transformed into an efficient PAC algorithm with MQs (see \citenp[KearnsV94], exercise 8.1).
|
|
For this reason, we mostly focus on the former type of learning in this thesis.
|
|
The transformation from exact learning to PAC learning is implemented by simply \emph{testing} the hypothesis with random samples.
|
|
This can be postponed until we actually implement a learning algorithm and apply it.
|
|
|
|
When using only EQs, only MQs, or only EXs, then there are hardness results for learning DFAs.
|
|
So the combinations MQs + EQs and MQs + EXs have been carefully been picked.
|
|
They provide a minimal basis for efficient learning.
|
|
See the book of \citet[KearnsV94] for such hardness results and more information on PAC learning.
|
|
|
|
So far, all the queries are assumed to be \emph{just there}.
|
|
Somehow, these are existing procedures which we can invoke with \kw{MQ}$(w)$, \kw{EQ}$(H)$, or \kw{EX}$()$.
|
|
This is a useful abstraction when designing a learning algorithm.
|
|
One can analyse the complexity (in terms of number of queries) independently of how these queries are resolved.
|
|
Nevertheless, at some point in time one has to implement them.
|
|
In our case of learning software behaviour, membership queries are easily implemented:
|
|
Simply provide the word $w$ to a running instance of the software and observe the output.
|
|
\footnote{In reality, it is a bit harder than this.
|
|
There are plentiful of challenges to solve, such as timing, choosing your alphabet, choosing the kind of observations to make and being able to faithfully reset the software.}
|
|
Equivalence queries, however, are in general not doable.
|
|
Even if we have the (machine) code, it is often way too complicated to check equivalence.
|
|
That is why we resort to testing with EX queries.
|
|
The EX query from PAC learning normally assumes a fixed, unknown, distribution.
|
|
In our case, we implement a distribution to test against.
|
|
This cuts both ways:
|
|
On one hand, it allows us to only test behaviour we really case about, on the other hand the results are only as good as our choice of distribution.
|
|
We deviate even further from the PAC-model as we sometimes change our distribution while learning.
|
|
Yet, as applications show, this is a useful way of learning software behaviour.
|
|
|
|
|
|
\stopsubsection
|
|
\startsubsection
|
|
[title={Research challenges}]
|
|
|
|
Model learning is far from a
|
|
In this thesis, we will mostly see learning of DFAs or Mealy machines.
|
|
Although this is restrictive as many pieces of software require richer models, it has been successfully applied in many different areas.
|
|
The restrictions include the following.
|
|
\startitemize[after]
|
|
\item The system behaves deterministically.
|
|
\item One can reliably reset the system.
|
|
\item The system can be modelled with a finite state space.
|
|
\item The set of input actions is finite.
|
|
\item One knows when the target is reached.
|
|
\stopitemize
|
|
|
|
\description{Research challenge 1: Confidence in the hypothesis.}
|
|
Having confidence in a learned model is difficult.
|
|
We have PAC guarantees (as discussed before), but sometimes we may require other guarantees.
|
|
For example, we may require the hypothesis to be correct, provided that the real system is implemented with a certain number of states.
|
|
Efficiency is important here:
|
|
We want to obtain those guarantees fast and we want to find quickly counterexamples when the hypothesis is wrong.
|
|
Test generation methods is the topic of the first part in this thesis.
|
|
We will review existing algorithms and discuss new algorithms for test generation.
|
|
|
|
\startdescription[title={Research challenge 2: Generalisation to infinite alphabets.}]
|
|
Automata over infinite alphabets are very useful for modelling protocols which involve identifiers or timestamps.
|
|
Not only the alphabet is infinite in these cases, the state-space is as well, since the values have to be remembered.
|
|
In the second part of this thesis, we will see how nominal techniques can be used to tackle this challenge.
|
|
|
|
Being able to learn automata over an infinite alphabet is not new.
|
|
It has been tackled in the theses of \cite[Aarts14, Fiterau-Brostean18, Cassel15].
|
|
\todo{Dus...}
|
|
\stopdescription
|
|
|
|
|
|
\stopsubsection
|
|
\startsubsection
|
|
[title={A few applications of learning}]
|
|
|
|
Since this thesis only contains one \quote{real-world} application on learning in \in{Chapter}[chap:applying-automata-learning], I think it is good to mention a few others.
|
|
Although we remain in the context of learning software behaviour, the applications are quite different.
|
|
|
|
Finite state machine conformance testing is a core topic in testing literature that is relevant for communication protocols and other reactive systems.
|
|
The combination of conformance testing and automata learning is applied in practice and it proves to be an effective way of finding bugs \cite[Joeri], replacing legacy software \cite[Philips], or inferring models to be used in model checking \cite[MeinkeSindhu].
|
|
|
|
Several model learning applications:
|
|
learning embedded controller software \cite[DBLP:conf/icfem/SmeenkMVJ15], learning the TCP protocol \cite[Fiterau-Brostean16] and learning the MQTT protocol \cite[TapplerAB17].
|
|
|
|
|
|
\stopsubsection
|
|
\stopsection
|
|
\startsection
|
|
[title={Testing Techniques}]
|
|
|
|
\todo{Even lang maken als Nominal Techniques.}
|
|
|
|
An important step in automata learning is equivalence checking.
|
|
Normally, this is abstracted away and done by an oracle, but we intend to implement such an oracle ourselves for our applications.
|
|
Concretely, the problem we need to solve is that of \emph{conformance checking}\footnote{It is also known as \emph{machine verification} or \emph{fault detection}.} and this was first described by \citet[Moore56].
|
|
|
|
The problem is as follows:
|
|
We are given the description of a finite state machine and a black box system, does the system behave exactly as the description?
|
|
We wish te determine this by running experiments on the system (as it is black box).
|
|
It should be clear that this is a hopelessly difficult task, as an error can be hidden arbitrarily deep in the system.
|
|
That is why we often assume some knowledge of the system.
|
|
In this thesis we often assume a \emph{bound on the number of states} of the system.
|
|
Under these conditions, \citet[Moore56] already solved the problem.
|
|
Unfortunately, his experiment is exponential in size, or in his own words: \quotation{fantastically large.}
|
|
|
|
Years later, \citet[Chow78, Vasilevskii73] independently designed efficient experiments.
|
|
In particular, the set of experiments is polynomial in size.
|
|
These techniques will be discussed in detail in \in{Chapter}[chap:test-methods].
|
|
More background and other related problems, as well as their complexity results, are well exposed in a survey of \citet[LeeY94].
|
|
|
|
|
|
\stopsection
|
|
\startsection
|
|
[title={Nominal Techniques}]
|
|
|
|
In the second part of this thesis, I will present results related to \emph{nominal automata}.
|
|
Usually, nominal techniques are introduced in order to solve problems involving name binding in topics like lambda calculus.
|
|
However, we use them to deal with \emph{register automata}.
|
|
These are automata which have an infinite alphabet, often thought of as input actions with \emph{data}.
|
|
The control flow of the automaton may actually depend on the data.
|
|
However, the data cannot be used in an arbitrary way as this would lead to many problems being undecidable.
|
|
\footnote{The class of automata with arbitrary data operations is sometimes called \emph{extended finite state machines}.}
|
|
A crucial concept in nominal automata is that of \emph{symmetries}.
|
|
|
|
To motivate the use of symmetries, we will look at an example of a register auttomaton.
|
|
In the following automaton we model a (not-so-realistic) login system for a single person.
|
|
The alphabet consists of the following actions:
|
|
\startformula\startmathmatrix[n=2, distance=1cm]
|
|
\NC \kw{register}(p) \NC \kw{logout}() \NR
|
|
\NC \kw{login}(p) \NC \kw{view}() \NR
|
|
\stopmathmatrix\stopformula
|
|
The \kw{register} action allows one to set a password $p$.
|
|
This can only be done when the system is initialised.
|
|
The \kw{login} and \kw{logout} actions speak for themselves and the \kw{view} action allows one to see the secret data (we abstract away from what the user actually gets to see here).
|
|
A simple automaton with roughly this behaviour is given in \in{Figure}[fig:login-system].
|
|
We will only informally discuss its semantics for now.
|
|
|
|
\startplacefigure
|
|
[title={A simple register automaton. The symbol $\ast$ denotes any input otherwise not specified. The $r$ in states $q_1$ and $q_2$ is a register.},
|
|
list={A simple register automaton.},
|
|
reference=fig:login-system]
|
|
\hbox{\starttikzpicture[node distance=4cm]
|
|
\node[state,initial] (0) {$q_0$};
|
|
\node[state with output,right of=0] (1) {$q_1$ \nodepart{lower} $r$};
|
|
\node[state with output,right of=1] (2) {$q_2$ \nodepart{lower} $r$};
|
|
|
|
\path[->]
|
|
(0) edge [loop below] node [below] {$\ast$ / \checknok}
|
|
(0) edge node [above, align=center] {$\kw{register}(p)$ / \checkok \\ set $r := p$} (1)
|
|
(1) edge [loop below] node [below] {$\ast$ / \checknok}
|
|
(1) edge [bend left, align=center] node [above] {$\kw{login}(p)$ / \checkok \\ if $r = p$} (2)
|
|
(2) edge [loop below] node [right, align=left] {$\kw{view}()$ / \checkok \\ $\ast$ / \checknok}
|
|
(2) edge [bend left] node [below] {$\kw{logout}()$ / \checkok} (1);
|
|
\stoptikzpicture}
|
|
\stopplacefigure
|
|
|
|
To model the behaviour nicely, we want the domain of passwords to be infinite.
|
|
After all, one should allow arbitrarily long passwords to be secure.
|
|
This means that a register automaton is actually an automaton over an infinite alphabet.
|
|
|
|
Common algorithms for automata, such as learning, will not work with an infinite alphabet.
|
|
Any loop which iterates over the alphabet will diverge.
|
|
In order to cope with this, we will use the \emph{symmetries} present in the alphabet.
|
|
|
|
Let us continue with the example and look at its symmetries.
|
|
If a person registers with a password \quotation{\kw{hello}} and consequently logins with \quotation{\kw{hello}}, then this is not distinguishable from a person registering and logging in with \quotation{\kw{bye}}.
|
|
This is an example of symmetry:
|
|
the values \quotation{\kw{hello}} and \quotation{\kw{bye}} can be permuted, or interchanged.
|
|
Note, however, that the trace $\kw{register}(\kw{hello})\, \kw{login}(\kw{bye})$ is different from the two before.
|
|
So we see that we cannot simply collapse the whole alphabet into one equivalence class.
|
|
For this reason, we keep the alphabet infinite and explicitly mention its symmetries.
|
|
|
|
Before formalising symmetries, let me remark the the use of symmetries in automata theory is not new.
|
|
One of the first to use symmetries was \citet[Sakamoto97].
|
|
He devised an \LStar{} learning algorithm for register automata, much like the one presented in \in{Chapter}[chap:learning-nominal-automata].
|
|
The symmetries are crucial to reduce the problem to a finite alphabet and use the regular \LStar{} algorithm.
|
|
(\in{Chapter}[chap:learning-nominal-automata] shows how to do it with more general symmetries.)
|
|
Around the same time \citet[FerrariMT05] worked on automata theoretic algorithms for the \pi-calculus.
|
|
Their approach was based on the same symmetries and they developed a theory of \emph{named sets} to implement their algorithms.
|
|
Named sets are equivalent to nominal sets.
|
|
However, nominal sets are defined in a more elementary way.
|
|
The nominal sets we will soon see are introduced by \citet[GabbayP02] to solve certain problems in name binding in abstract syntaxes.
|
|
Although this is not really related to automata theory, it was picked up by \citet[BojanczykKL14].
|
|
They provide an equivalence between register automata and nominal automata.
|
|
Additionally, they generalise the work on nominal sets to other symmetries.
|
|
|
|
The symmetries we encounter in this thesis are the following, but other symmetries can be found in the literature.
|
|
\startitemize[after]
|
|
\item
|
|
The \quotation{equality symmetry}.
|
|
Here the domain can be any countably infinite set.
|
|
We can take, for example, the set of string we used before as the domain from which we take passwords.
|
|
No further structure is used on this domain, meaning that any value is just as good as any other.
|
|
The symmetries therefore consist of all bijections on this domain.
|
|
\item
|
|
The \quotation{total order symmetry}.
|
|
In this case, we take a countable infinite set with a dense total order.
|
|
Typically, this means we use the rational numbers, $\Q$, as data values and symmetries which respect the ordering.
|
|
\stopitemize
|
|
|
|
|
|
\startsubsection
|
|
[title={What is a nominal set?}]
|
|
|
|
So what exactly is a nominal set?
|
|
I will not define it here and leave the formalities to the corresponding chapters.
|
|
It suffices, for now, to think of nominal sets with abstract sets (often infinite) on which a group of symmetries acts.
|
|
The group of symmetries is not just any group, we fix a group of bijections on some fixed data domain.
|
|
|
|
In order to implement such sets algorithmically, we impose two finiteness requirements
|
|
\startitemize
|
|
\item
|
|
Finite support.
|
|
\item
|
|
Orbit-finite.
|
|
\stopitemize
|
|
|
|
|
|
|
|
\stopsubsection
|
|
\startsubsection
|
|
[title={Contributions}]
|
|
|
|
The following chapters are split into two parts.
|
|
\in{Part}[part:testing] contains material about testing techniques, while \in{Part}[part:nominal] is about nominal techniques.
|
|
Each chapter could be read in isolation.
|
|
However, the chapters do get more technical and mathematical -- espacially in \in{Part}[part:nominal].
|
|
|
|
\description{\in{Chapter}[chap:test-methods].}
|
|
Bla
|
|
|
|
|
|
\stopsubsection
|
|
\stopsection
|
|
\referencesifcomponent
|
|
\stopchapter
|
|
\stopcomponent
|