commit 9a09a24df376655c878e2ec650a6b7df19dd602b Author: Joshua Moerman Date: Fri Jun 13 13:20:11 2025 +0200 initial commit diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..514086d --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +target +dependency-reduced-pom.xml +.vscode + diff --git a/README.md b/README.md new file mode 100644 index 0000000..0f93251 --- /dev/null +++ b/README.md @@ -0,0 +1,55 @@ +UTF-8 Automaton Learner +======================= + +See [my blog post](https://joshuamoerman.nl/2025/6/The-UTF-8-Automaton.html). + +Using LearnLib to learn a model of *UTF-8* validators (or decoders). It only +learns the acceptance behaviour, not the transduction to unicode code points. + +UTF-8 implementations tested: +* JDK decoder in `java.nio.charset.CharsetDecoder` (depends on java platform) +* Guava validator `com.google.common.base.Utf8` +* Apache decoder `org.apache.commons.codec.binary.StringUtils` +* ICU4J has a charset detector; this gives a very different result + +For the equivalence oracle, I have a chain of several testers: +1. First a small but precise test suite is tried +2. Then some random testing based on the Wp method +3. Then exhaustive testing based on the W method + +All implementations tested result in the same DFA (except for the ICU4J, +because it is not a validator, but a detector and accepts much more). + +How to build and run (should run in a couple of seconds): +```bash +./run.sh +``` + + +## Decomposition + +See the subdirectory `dfa-decompose`. + + +## Dependencies + +I currently use the development version of `LearnLib` (and `automatalib`). +And I build them as follows: +```bash +mvn clean package -Pbundles -DskipTests +``` + +Other dependencies can be installed with maven. Note that I have very limited +experience in java development, and that my maven set-up may be less than +ideal. + + +## Copyright notice + +(c) 2025 Joshua Moerman, Open Universiteit, licensed under the EUPL (European +Union Public License). If you want to use this code and find the license not +suitable for you, then please do get in touch. + +``` +SPDX-License-Identifier: EUPL-1.2 +```