mirror of
https://git.cs.ou.nl/joshua.moerman/utf8-learner.git
synced 2025-07-01 22:27:46 +02:00
55 lines
1.6 KiB
Markdown
55 lines
1.6 KiB
Markdown
UTF-8 Automaton Learner
|
|
=======================
|
|
|
|
See [my blog post](https://joshuamoerman.nl/2025/6/The-UTF-8-Automaton.html).
|
|
|
|
Using LearnLib to learn a model of *UTF-8* validators (or decoders). It only
|
|
learns the acceptance behaviour, not the transduction to unicode code points.
|
|
|
|
UTF-8 implementations tested:
|
|
* JDK decoder in `java.nio.charset.CharsetDecoder` (depends on java platform)
|
|
* Guava validator `com.google.common.base.Utf8`
|
|
* Apache decoder `org.apache.commons.codec.binary.StringUtils`
|
|
* ICU4J has a charset detector; this gives a very different result
|
|
|
|
For the equivalence oracle, I have a chain of several testers:
|
|
1. First a small but precise test suite is tried
|
|
2. Then some random testing based on the Wp method
|
|
3. Then exhaustive testing based on the W method
|
|
|
|
All implementations tested result in the same DFA (except for the ICU4J,
|
|
because it is not a validator, but a detector and accepts much more).
|
|
|
|
How to build and run (should run in a couple of seconds):
|
|
```bash
|
|
./run.sh
|
|
```
|
|
|
|
|
|
## Decomposition
|
|
|
|
See the subdirectory `dfa-decompose`.
|
|
|
|
|
|
## Dependencies
|
|
|
|
I currently use the development version of `LearnLib` (and `automatalib`).
|
|
And I build them as follows:
|
|
```bash
|
|
mvn clean package -Pbundles -DskipTests
|
|
```
|
|
|
|
Other dependencies can be installed with maven. Note that I have very limited
|
|
experience in java development, and that my maven set-up may be less than
|
|
ideal.
|
|
|
|
|
|
## Copyright notice
|
|
|
|
(c) 2025 Joshua Moerman, Open Universiteit, licensed under the EUPL (European
|
|
Union Public License). If you want to use this code and find the license not
|
|
suitable for you, then please do get in touch.
|
|
|
|
```
|
|
SPDX-License-Identifier: EUPL-1.2
|
|
```
|