word-parse/words/moerman19a.txt

Proceedings of Machine Learning Research 93:54–66, 2019

International Conference on Grammatical Inference

Learning Product Automata
Joshua Moerman

joshua.moerman@cs.ru.nl

Institute for Computing and Information Sciences
Radboud University
Nijmegen, the Netherlands

Editors: Olgierd Unold, Witold Dyrka, and Wojciech Wieczorek

Abstract
We give an optimisation for active learning algorithms, applicable to learning Moore machines with decomposable outputs. These machines can be decomposed themselves by
projecting on each output. This results in smaller components that can then be learnt with
fewer queries. We give experimental evidence that this is a useful technique which can reduce the number of queries substantially. Only in some cases the performance is worsened
by the slight overhead. Compositional methods are widely used throughout engineering,
and the decomposition presented in this article promises to be particularly interesting for
learning hardware systems.
Keywords: query learning, product automata, composition

1. Introduction
Query learning (or, active learning) is becoming a valuable tool in engineering of both
hardware and software systems (Vaandrager, 2017). Indeed, applications can be found in
a broad range of applications: finding bugs in network protocols as shown by FiterăuBroştean et al. (2016, 2017), assisting with refactoring legacy software as shown by Schuts
et al. (2016), and reverse engineering bank cards by Chalupar et al. (2014).
These learning techniques originate from the field of grammatical inference. One of the
crucial steps for applying these to black box systems was to move from deterministic finite
automata to deterministic Moore or Mealy machines, capturing reactive systems with any
kind of output. With little adaptations, the algorithms work well, as shown by the many
applications. This is remarkable, since little specific knowledge is used beside the input
alphabet of actions.
Realising that composition techniques are ubiquitous in engineering, we aim to use more
structure of the system during learning. In the present paper we use the simplest type of
composition; we learn product automata, where outputs of several components are simply
paired. Other types of compositions, such as sequential composition of Mealy machines, are
discussed in Section 5.
To the best of the author’s knowledge, this has not been done before explicitly. Furthermore, libraries such as LearnLib (see Isberner et al., 2015) and libalf (see Bollig et al.,
2010b) do not include such functionality out of the box. Implicitly, however, it has been
done before. Rivest and Schapire (1994) use two tricks to reduce the size of some automata
in their paper “Diversity-based inference of finite automata”. The first trick is to look at

c 2019 J. Moerman.

Learning Product Automata

o1

o2
i

M

i

o1

M1
o2
M2

Figure 1: A Moore machine with two outputs (left) can be equivalently seen as two (potentially smaller) Moore machines with a single output each (right).

the reversed automaton (in their terminology, the diversity-based automaton). The second
trick (which is not explicitly mentioned, unfortunately) is to have a different automaton
for each observable (i.e., output). In one of their examples the two tricks combined give a
reduction from ±1019 states to just 54 states.
In this paper, we isolate this trick and use it in query learning. We give an extension
of L* which handles products directly and we give a second algorithm which simply runs
two learners simultaneously. Furthermore, we argue that this is particularly interesting
in the context of model learning of hardware, as systems are commonly engineered in a
compositional way. We give preliminary experimental evidence that the technique works
and improves the learning process. As benchmarks, we learn (simulated) circuits which
provide several output bits.

2. Preliminaries
We use the formalism of Moore machines to describe our algorithms. Nonetheless, the
results can also be phrased in terms of Mealy machines.
Definition 1 A Moore machine is a tuple M = (Q, I, O, δ, o, q0 ) where Q, I and O are
finite sets of states, inputs and outputs respectively, δ : Q × I → Q is the transition function,
o : Q → O is the output function, and q0 ∈ Q is the initial state. We define the size of the
machine, |M |, to be the cardinality of Q.
We extend the definition of the transition function to words as δ : Q × I ∗ → Q. The
behaviour of a state q is the map JqK : I ∗ → O defined by JqK(w) = o(δ(q, w)). Two states
q, q 0 are equivalent if JqK = Jq 0 K. A machine is minimal if all states have different behaviour
and all states are reachable. We will often write JM K to mean Jq0 K and say that machines
are equivalent if their initial states are equivalent.
Definition 2 Given two Moore machines with equal input sets, M1 = (Q1 , I, O1 , δ1 , o1 , q0 1 )
and M2 = (Q2 , I, O2 , δ2 , o2 , q0 2 ), we define their product M1 × M2 by:
M1 × M2 = (Q1 × Q2 , I, O1 × O2 , δ, o, (q0 1 , q0 2 )),
where δ((q1 , q2 ), a) = (δ1 (q1 , a), δ2 (q2 , a)) and o((q1 , q2 )) = (o1 (q1 ), o2 (q2 )).

55

Learning Product Automata

0

0

1

0

0

1

0

1

Figure 2: A state of the 8-bit register machine.
The product is formed by running both machines in parallel and letting I act on both
machine synchronously. The output of both machines is observed. Note that the product
Moore machine might have unreachable states, even if the components are reachable. The
product of more than two machines is defined by induction.
Let M be a machine with outputs in O1 × O2 . By post-composing the output function with projection functions we get two machines, called components, M1 and M2 with
outputs in O1 and O2 respectively. This is depicted in Figure 1. Note that M is equivalent to M1 × M2 . If M and its components Mi are taken to be minimal,
then we have
p
|M | ≤ |M1 | · |M2 | and |Mi | ≤ |M |. In the p
best case we have |Mi | = |M | and so the behaviour of M can be described using only 2 |M | states, which is less than |M | (if |M | > 4).
With iterated products the reduction can be more substantial as shown in the following example. This reduction in state space is beneficial for learning algorithms.
We introduce basic notation: πi : A1 × A2 → Ai are the usual projection functions. On
a function f : X → A1 × A2 we use the shorthand πi f to denote πi ◦ f . As usual, uv denotes
concatenation of string u and v, and this is lifted to sets of strings U V = {uv | u ∈ U, v ∈ V }.
We define the set [n] = {1, . . . , n} and the set of Boolean values B = {0, 1}.
2.1. Example
We take the n-bit register machine example from Rivest and Schapire (1994). The state
space of the n-bit register machine Mn is given by n bits and a position of the reading/writing head, see Figure 2. The inputs are commands to control the position of the
head and to flip the current bit. The output is the current bit vector. Formally it is defined
as Mn = (Bn × [n], {L, R, F }, Bn , δ, o, i), where the initial state is i = ((0, . . . , 0), 1) and the
output is o(((b1 , . . . , bn ), k)) = (b1 , . . . , bn ). The transition function is defined such that L
moves the head to the left, R moves the head to the right (and wraps around on either
ends), and F flips the current bit. Formally,
(
((b1 , . . . , bn ), k − 1) if k > 1,
δ(((b1 , . . . , bn ), k), L) =
((b1 , . . . , bn ), n)
if k = 1,
(
((b1 , . . . , bn ), k + 1) if k < n,
δ(((b1 , . . . , bn ), k), R) =
((b1 , . . . , bn ), 1)
if k = n,
δ(((b1 , . . . , bn ), k), F ) = ((b1 , . . . , ¬bk , . . . , bn ), k).
The machine Mn is minimal and has n · 2n states. So although this machine has very
simple behaviour, learning it will require a lot of queries because of its size. Luckily, the
machine can be decomposed into smaller components. For each bit l, we define a component
Mnl = (B × [n], {L, R, F }, B, δ l , π1 , (0, 1)) which only stores one bit and the head position.
The transition function δ l is defined similarly as before on L and R, but only flips the bit
on L if the head is on position l (i.e., δ l ((b, l), F ) = (¬b, l) and δ l ((b, k)) = (b, k) if k 6= l).
56

Learning Product Automata

The product Mn1 × · · · × Mnn is equivalent to Mn . Each of the components Mnl is minimal
and has only 2n states. So by this decomposition, we only need 2 · n2 states to describe the
whole behaviour of Mn . Note, however, that the product Mn1 × · · · × Mnn is not minimal;
many states are unreachable.

3. Learning
We describe two approaches for active learning of product machines. One is a direct extension of the well-known L* algorithm. The other reduces the problem to any active learning
algorithm, so that one can use more optimised algorithms.
We fix an unknown target machine M with a known input alphabet I and output
alphabet O = O1 × O2 . The goal of the learning algorithm is to infer a machine equivalent
to M , given access to a minimally adequate teacher as introduced by Angluin (1987). The
teacher will answer the following two types of queries.
• Membership queries (MQs): The query consists of a word w ∈ I ∗ and the teacher will
answer with the output JM K(w) ∈ O.
• Equivalence queries (EQs): The query consists of a Moore machine H, the hypothesis,
and the teacher will answer with YES if M and H are equivalent and she will answer
with a word w such that JM K(w) 6= JHK(w) otherwise.
3.1. Learning product automata with an L* extension
We can use the general framework for automata learning as set up by van Heerdt et al.
(2017). The general account does not directly give concrete algorithms, but it does give
generalised definitions for closedness and consistency. The main data structure for the
algorithm is an observation table.
Definition 3 An observation table is a triple (S, E, T ) where S, E ⊆ I ∗ are finite sets of
words and T : S ∪ SI → OE is defined by T (s)(e) = JM K(se).
During the L* algorithm the sets S, E grow and T encodes the knowledge of JM K so far.
Definition 4 Let (S, E, T ) be an observation table.
• The table is product-closed if for all t ∈ SI there exist s1 , s2 ∈ S such that
πi T (t) = πi T (si ) for i = 1, 2.
• The table is product-consistent if for i = 1, 2 and for all s, s0 ∈ S we have
πi T (s) = πi T (s0 ) implies πi T (sa) = πi T (s0 a) for all a ∈ I.
These definitions are related to the classical definitions of closedness and consistency as
shown in the following lemma. The converses of the first two points do not necessarily hold.
We also proof that if a observation table is product-closed and product-consistent, then a
well-defined product machine can be constructed which is consistent with the table.

57

Learning Product Automata

Algorithm 1 The product-L* algorithm.
1: Initialise S and E to {}
2: Initialise T with MQs
3: repeat
4:
while (S, E, T ) is not product-closed or -consistent do
5:
if (S, E, T ) not product-closed then
6:
find t ∈ SI such that there is no s ∈ S with πi T (t) = πi T (s) for some i
7:
add t to S and fill the new row using MQs
8:
if (S, E, T ) not product-consistent then
9:
find s, s0 ∈ S, a ∈ I and e ∈ E such that πi T (s) = πi T (s0 ) but πi T (sa)(e) 6=
πi T (s0 a)(e) for some i
10:
add ae to E and fill the new column using MQs
11:
Construct H (by Lemma 6)
12:
if EQ(H) gives a counterexample w then
13:
add w and all its prefixes to S
14:
fill the new rows with MQs
15: until EQ(H) = YES
16: return H
Lemma 5 Let OT = (S, E, T ) be an observation table and let πi OT = (S, E, πi T ) be a
component. The following implications hold.
1.
2.
3.
4.

OT
OT
OT
OT

is
is
is
is

closed =⇒ OT is product-closed.
consistent ⇐= OT is product-consistent.
product-closed ⇐⇒ πi OT is closed for each i.
product-consistent ⇐⇒ πi OT is consistent for each i.

Proof (1) If OT is closed, then each t ∈ SI has a s ∈ S such that T (t) = T (s). This
implies in particular that πi T (t) = πi T (s), as required. (In terms of the definition, this
means we can take s1 = s2 = s.)
(2) Let OT be product-consistent and s, s0 ∈ S such that T (s) = T (s0 ). We then know
that πi T (s) = πi T (s0 ) for each i and hence πi T (sa) = πi T (s0 a) for each i and a. This means
that T (sa) = T (s0 a) as required.
Statements (3) and (4) just rephrase the definitions.

Lemma 6 Given a product-closed and -consistent table we can define a product Moore
machine consistent with the table, where each component is minimal.
Proof If the table OT is product-closed and -consistent, then by the previous lemma, the
tables πi OT are closed and consistent in the usual way. For these tables we can use the
construction of Angluin (1987). As a result we get a minimal machine Hi which is consistent
with table πi OT . Taking the product of these gives a machine which is consistent with OT .
(Beware that this product is not necessarily the minimal machine consistent with OT .)

58

Learning Product Automata

Algorithm 2 Learning product machines with other learners.
1: Initialise two learners L1 and L2
2: repeat
3:
while Li queries M Q(w) do
4:
forward M Q(w) to the teacher and get output o
5:
return πi o to Li
{at this point both learners constructed a hypothesis}
6:
Let Hi be the hypothesis of Li
7:
Construct H = H1 × H2
8:
if EQ(H) returns a counterexample w then
9:
if JH1 K(w) 6= π1 JM K(w) then
10:
return w to L1
11:
if JH2 K(w) 6= π2 JM K(w) then
12:
return w to L2
13: until EQ(H) = YES
14: return YES to both learners
15: return H
The product-L* algorithm (Algorithm 1) resembles the original L* algorithm, but uses
the new notions of closed and consistent. Its termination follows from the fact that L*
terminates on both components.
By Lemma 5 (1) we note that the algorithm does not need more rows than we would
need by running L* on M . By point (4) of the same lemma, we find that it does not need
more columns than L* would need on each component combined. This means that in the
worst case, the table is twice as big as the original L* would do. However, in good cases
(such as the running example), the table is much smaller, as the number of rows is less for
each component and the columns needed for each component may be similar.
3.2. Learning product automata via a reduction
The previous algorithm constructs two machines from a single table. This suggests that we
can also run two learning algorithms to construct two machines. We lose the fact that the
data structure is shared between the learners, but we gain that we can use more efficient
algorithms than L* without any effort.
Algorithm 2 is the algorithm for learning product automata via this reduction. It runs
two learning algorithms at the same time. All membership queries are passed directly to the
teacher and only the relevant output is passed back to the learner. (In the implementation,
the query is cached, so that if the other learner poses the same query, it can be immediately answered.) If both learners are done posing membership queries, they will pose an
equivalence query at which point the algorithm constructs the product automaton. If the
equivalence query returns a counterexample, the algorithm forwards it to the learners.
The crucial observation is that a counterexample is necessarily a counterexample for at
least one of the two learners. (If at a certain stage only one learner makes an error, we keep
the other learner suspended, as we may obtain a counterexample for that one later on.)

59

Learning Product Automata

This observation means that at least one of the learners makes progress and will eventually
converge. Hence, the whole algorithm will converge.
In the worst case, twice as many queries will be posed, compared to learning the whole
machine at once. (This is because learning the full machine also learns its components.)
In good cases, such as the running example, it requires much less queries. Typical learning
algorithms require roughly O(n2 ) membership queries, where n is the number of states in
the minimal machine. For the example Mn this bound gives O((n · 2n )2 ) = O(n2 · 22n )
queries. When learning the components Mnl with the above algorithm, the bound gives just
O((2n)2 + · · · + (2n)2 ) = O(n3 ) queries.

4. Experiments
We have implemented the algorithm via reduction in LearnLib.1 As we expect the reduction
algorithm to be the most efficient and simpler, we leave an implementation of the direct
extension of L* as future work. The implementation handles products of any size (as opposed
to only products of two machines). Additionally, the implementation also works on Mealy
machines and this is used for some of the benchmarks.
In this section, we compare the product learner with a regular learning algorithm. We
use the TTT algorithm by Isberner et al. (2014) for the comparison and also as the learners
used in Algorithm 2. We measure the number of membership and equivalence queries. The
results can be found in Table 1.
The equivalence queries are implemented by random sampling so as to imitate the intended application of learning black-box systems. This way, an exact learning algorithm
turns into a PAC (probably approximately correct) algorithm. Efficiency is typically measured by the total number of input actions which also accounts for the length of the membership queries (including the resets). This is a natural measure in the context of learning
black box systems, as each action requires some amount of time to perform.
We evaluated the product learning algorithm on the following two classes of machines.
n-bit register machine The machines Mn are as described before. We note that the
product learner is much more efficient, as expected.
Circuits In addition to the (somewhat artificial) examples Mn , we use circuits which
appeared in the logic synthesis workshops (LGSynth89/91/93), part of the ACM/SIGDA
benchmarks.2 These models have been used as benchmarks before for FSM-based testing
methods by Hierons and Türker (2015) and describe the behaviour of real-world circuits.
The circuits have bit vectors as outputs, and can hence be naturally be decomposed by
taking each bit individually. As an example, Figure 3 depicts one of the circuits (bbara).
The behaviour of this particular circuit can be modelled with seven states, but when restricting to each individual output bit, we obtain two machines of just four states. For the
circuits bbsse and mark1, we additionally regrouped bit together in order to see how the
performance changes when we decompose differently.
1. The implementation and models can be found on-line at https://gitlab.science.ru.nl/moerman/
learning-product-automata.
2. The original files describing these circuits can be found at https://people.engr.ncsu.edu/brglez/
CBL/benchmarks/.

60

Learning Product Automata

/0

--0-/0
--10/0
-111/0
0011/0

1011

/0
-111

-11
1/0

0

1
00 011
-1
11 /0
11
/0
/0

10
11
/00
-111 0011
/
/00 00

-111/00
1011/00

--0-/1
--10/1
-111/1

1011/00

1/0
101

--0-/00
--10/00

/0
11 /0
10 011
0

0

1/0
001
--0-/01
--10/01
1011/01

--0-/0
--10/0
0011/0
1011/0

-111/0

--10/0
--0-/0

1011/0

/0
11
1/0 0
10
-11011/
0

--0-/00
--10/00
0011/00

/00
0011
1/00
/00
101
11
-1

0
1011/0
0011
/00
-111/0
0
--0-/00
--10/00

1/0
001

--0-/00
--10/00

-111/0

001
1/0

0

0

/0
11

10
/00

0011

--0-/1
--10/1
1011/1

--10/0
--0-/0

--0-/00
--10/00

--0-/0
--10/0

001
1/0

-111/00

00 111/
11 0
/0

--0-/10
--10/10
-111/10

--0-/0
--10/0

Figure 3: The bbara circuit (left) has two output bits. This can be decomposed into two
smaller circuits with a single output bit (middle and right).

For some circuits the number of membership queries is reduced compared to a regular
learner. Unfortunately, the results are not as impressive as for the n-bit register machine.
An interesting case is ex3 where the number of queries is slightly increased, but the total
amount of actions performed is substantially reduced. The number of actions needed in
total is actually reduced in all cases, except for bbsse. This exception can be explained
by the fact that the biggest component of bbsse still has 25 states, which is close to the
original 31 states. We also note that the choice of decomposition matters, for both mark1
and bbsse it was beneficial to regroup components.
In Figure 4, we look at the size of each hypothesis generated during the learning process.
We note that, although each component grows monotonically, the number of reachable
states in the product does not grow monotonically. In this particular instance where we
learn mark1 there was a hypothesis of 58 128 states, much bigger than the target machine of
202 states. This is not an issue, as the teacher will allow it and answer the query regardless.
Even in the PAC model with membership queries, this poses no problem as we can still
efficiently determine membership. However, in some applications the equivalence queries
are implemented with a model checker (e.g., in the work by Fiterău-Broştean et al., 2016)
or a sophisticated test generation tool. In these cases, the increased size of intermediate
hypotheses may be undesirable.

5. Discussion
We have shown two query learning algorithms which exploit a decomposable output. If
the output can be split, then also the machine itself can be decomposed in components.
As the preliminary experiments show, this can be a very effective optimization for learning
black box reactive systems. It should be stressed that the improvement of the optimization
depends on the independence of the components. For example, the n-bit register machine
has nearly independent components and the reduction in the number of queries is big. The
more realistic circuits did not show such drastic improvements in terms of queries. When

61

Learning Product Automata

Machine
M2
M3
M4
M5
M6
M7
M8
bbara
keyb
ex3
bbsse
mark1
bbsse*
mark1*

States
8
24
64
160
384
896
2048
7
41
28
31
202
31
202

Components
2
3
4
5
6
7
8
2
2
2
7
16
4
8

Product learner
EQs
MQs Actions
3
100
621
3
252
1 855
8
456
3 025
6
869
7 665
11
1 383
12 870
11
2 087
24 156
13
3 289
41 732
3
167
1 049
25 12 464 153 809
24
1 133
9 042
20 14 239 111 791
30 16 712 145 656
19 11 648
89 935
22 13 027 117 735

EQs
5
5
6
17
25
52
160
3
24
18
8
67
8
67

TTT learner
MQs Actions
115
869
347
2946
1 058
13 824
2 723
34 657
6 250
90 370
14 627 226 114
34 024 651 678
216
1 535
6024 265 805
878
91 494
4 872
35 469
15 192 252 874
4 872
35 469
15 192 252 874

number of states

Table 1: Comparison of the product learner with an ordinary learner.

104
102
100
2

4

6

8

10
12
14
Hypothesis-number

16

18

20

22

Figure 4: The number of states for each hypothesis while learning mark1.

62

Learning Product Automata

taking the length of the queries in account as well (i.e., counting all actions performan on
the system), we see an improvement for most of the test cases.
In the remainder of this section we discuss related ideas and future work.
5.1. Measuring independence
As the results show, the proposed technique is often beneficial but not always. It would
be interesting to know when to use decomposition. It is an interesting question how to
(quantitatively) measure the independence. Such a measure can potentially be used by the
learning algorithm to determine whether to decompose or not.
5.2. Generalisation to subsets of products
In some cases, we might know even more about our output alphabet. The output set O may
be a proper subset of O1 × O2 , indicating that some outputs can only occur “synchronised”.
For example, we might have O = {(0, 0)} ∪ {(a, b) | a, b ∈ [3]}, that is, the output 0 for
either component can only occur if the other component is also 0.
In such cases we can use the above algorithm still, but we may insist that the teacher
only accepts machines with output in O for the equivalence queries (as opposed to outputs
in {0, 1, 2, 3}2 ). When constructing H = H1 × H2 in line 7 of Algorithm 2, we can do a
reachability analysis on H to check for non-allowed outputs. If such traces exist, we know
it is a counterexample for at least one of the two learners. With such traces we can fix the
defect ourselves, without having to rely on the teacher.
5.3. Product DFAs
For two DFAs (Q1 , δ1 , F1 , q0 1 ) and (Q2 , δ2 , F2 , q0 2 ), a state in the product automaton is
accepting if both components are accepting. In the formalism of Moore machines, the finals
states are determined by their characteristic function and this means that the output is given
by o(q1 , q2 ) = o1 (q1 )∧o2 (q2 ). Again, the components may be much smaller than the product
and this motivated Heinz and Rogers (2013) to learn (a subclass of) product DFAs. This
type of product is more difficult to learn as the two components are not directly observable.
Such automata are also relevant in model checking and some of the (open) problems are
discussed by Kupferman and Mosheiff (2015).
5.4. Learning automata in reverse
The main result of Rivest and Schapire (1994) was to exploit the structure of the socalled “diversity-based” automaton. This automaton may also be called the reversed Moore
machine. Reversing provides a duality between reachability and equivalence. This duality is
theoretically explored by Rot (2016) and Bonchi et al. (2014) in the context of Brzozowski’s
minimization algorithm.
Let M R denote the reverse of M , then we have JM R K(w) = JM K(wR ). This allows us
to give an L* algorithm which learns M R by posing membership queries with the words
reversed. We computed M R for the circuit models and all but one of them was much larger
than the original. This suggests that it might not be useful as an optimisation in learning
hardware or software systems. However, a more thorough investigation is desired.
63

Learning Product Automata

A; B
i

o
A

B

Figure 5: The sequential compostion A; B of two Mealy machines A and B.
5.5. Other types of composition
The case of learning a sequential composition is investigated by Abel and Reineke (2016).
In their work, there are two Mealy machines, A and B, and the output of A is fed into B, see
Figure 5. The goal is to learn a machine for B, assuming that A is known (i.e., white box).
The oracle only answers queries for the sequential composition, which is defined formally as
JA; BK(w) = JBK(JAK(w)). Since B can only be interacted with through A, we cannot use
L* directly. The authors show how to learn B using a combination of L* and SAT solvers.
Moreover, they give evidence that this is more efficient than learning A; B as a whole.
An interesting generalisation of the above is to consider A as an unknown as well.
The goal is to learn A and B simultaneously, while observing the outputs of B and the
communication between the components. The authors conjecture that this would indeed
be possible and result in a learning algorithm which is more efficient than learning A; B
(private communication).
Another type of composition is used by Bollig et al. (2010a). Here, several automata
are put in parallel and communicate with each other. The goal is not to learn a black box
system, but to use learning when designing such a system. Instead of words, the teacher
(i.e., designer in this case) receives message sequence charts which encode the processes and
actions. Furthermore, they exploit partial order reduction in the learning algorithm.
We believe that a combination of our and the above compositional techniques can improve the scalability of learning black box systems. Especially in the domain of software
and hardware we expect such techniques to be important, since the systems themselves are
often designed in a modular way.

Acknowledgments
We would like to thank Nathanaël Fijalkow, Ramon Janssen, Gerco van Heerdt, Harco Kuppens, Alexis Linard, Alexandra Silva, Rick Smetsers, and Frits Vaandrager for proofreading
this paper and providing useful feedback. Thanks to Andreas Abel for discussing the case
of learning a sequential composition of two black box systems. Also thanks to anonymous
reviewers for interesting references and comments.