mirror of https://git.cs.ou.nl/joshua.moerman/utf8-learner.git synced 2025-07-01 06:07:46 +02:00

https://joshuamoerman.nl/2025/6/The-UTF-8-Automaton.html

Find a file

Joshua Moerman 65f891e731 already did all the things :-)		2025-06-13 13:21:37 +02:00
dfa-decompose	already did all the things :-)	2025-06-13 13:21:37 +02:00
results	already did all the things :-)	2025-06-13 13:21:37 +02:00
src/main/java/nl/ou/utf8learner	already did all the things :-)	2025-06-13 13:21:37 +02:00
.gitignore	initial commit	2025-06-13 13:20:11 +02:00
pom.xml	already did all the things :-)	2025-06-13 13:21:37 +02:00
README.md	initial commit	2025-06-13 13:20:11 +02:00
run.sh	already did all the things :-)	2025-06-13 13:21:37 +02:00

README.md

UTF-8 Automaton Learner

See my blog post.

Using LearnLib to learn a model of UTF-8 validators (or decoders). It only learns the acceptance behaviour, not the transduction to unicode code points.

UTF-8 implementations tested:

JDK decoder in java.nio.charset.CharsetDecoder (depends on java platform)
Guava validator com.google.common.base.Utf8
Apache decoder org.apache.commons.codec.binary.StringUtils
ICU4J has a charset detector; this gives a very different result

For the equivalence oracle, I have a chain of several testers:

First a small but precise test suite is tried
Then some random testing based on the Wp method
Then exhaustive testing based on the W method

All implementations tested result in the same DFA (except for the ICU4J, because it is not a validator, but a detector and accepts much more).

How to build and run (should run in a couple of seconds):

./run.sh

Decomposition

See the subdirectory dfa-decompose.

Dependencies

I currently use the development version of LearnLib (and automatalib). And I build them as follows:

mvn clean package -Pbundles -DskipTests

Other dependencies can be installed with maven. Note that I have very limited experience in java development, and that my maven set-up may be less than ideal.

Copyright notice

(c) 2025 Joshua Moerman, Open Universiteit, licensed under the EUPL (European Union Public License). If you want to use this code and find the license not suitable for you, then please do get in touch.

SPDX-License-Identifier: EUPL-1.2