1
Fork 0
mirror of https://git.cs.ou.nl/joshua.moerman/utf8-learner.git synced 2025-07-01 14:17:45 +02:00

initial commit

This commit is contained in:
Joshua Moerman 2025-06-13 13:20:11 +02:00
commit 9a09a24df3
2 changed files with 59 additions and 0 deletions

4
.gitignore vendored Normal file
View file

@ -0,0 +1,4 @@
target
dependency-reduced-pom.xml
.vscode

55
README.md Normal file
View file

@ -0,0 +1,55 @@
UTF-8 Automaton Learner
=======================
See [my blog post](https://joshuamoerman.nl/2025/6/The-UTF-8-Automaton.html).
Using LearnLib to learn a model of *UTF-8* validators (or decoders). It only
learns the acceptance behaviour, not the transduction to unicode code points.
UTF-8 implementations tested:
* JDK decoder in `java.nio.charset.CharsetDecoder` (depends on java platform)
* Guava validator `com.google.common.base.Utf8`
* Apache decoder `org.apache.commons.codec.binary.StringUtils`
* ICU4J has a charset detector; this gives a very different result
For the equivalence oracle, I have a chain of several testers:
1. First a small but precise test suite is tried
2. Then some random testing based on the Wp method
3. Then exhaustive testing based on the W method
All implementations tested result in the same DFA (except for the ICU4J,
because it is not a validator, but a detector and accepts much more).
How to build and run (should run in a couple of seconds):
```bash
./run.sh
```
## Decomposition
See the subdirectory `dfa-decompose`.
## Dependencies
I currently use the development version of `LearnLib` (and `automatalib`).
And I build them as follows:
```bash
mvn clean package -Pbundles -DskipTests
```
Other dependencies can be installed with maven. Note that I have very limited
experience in java development, and that my maven set-up may be less than
ideal.
## Copyright notice
(c) 2025 Joshua Moerman, Open Universiteit, licensed under the EUPL (European
Union Public License). If you want to use this code and find the license not
suitable for you, then please do get in touch.
```
SPDX-License-Identifier: EUPL-1.2
```