mirror of
https://git.cs.ou.nl/joshua.moerman/utf8-learner.git
synced 2025-07-01 14:17:45 +02:00
initial commit
This commit is contained in:
commit
9a09a24df3
2 changed files with 59 additions and 0 deletions
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
target
|
||||||
|
dependency-reduced-pom.xml
|
||||||
|
.vscode
|
||||||
|
|
55
README.md
Normal file
55
README.md
Normal file
|
@ -0,0 +1,55 @@
|
||||||
|
UTF-8 Automaton Learner
|
||||||
|
=======================
|
||||||
|
|
||||||
|
See [my blog post](https://joshuamoerman.nl/2025/6/The-UTF-8-Automaton.html).
|
||||||
|
|
||||||
|
Using LearnLib to learn a model of *UTF-8* validators (or decoders). It only
|
||||||
|
learns the acceptance behaviour, not the transduction to unicode code points.
|
||||||
|
|
||||||
|
UTF-8 implementations tested:
|
||||||
|
* JDK decoder in `java.nio.charset.CharsetDecoder` (depends on java platform)
|
||||||
|
* Guava validator `com.google.common.base.Utf8`
|
||||||
|
* Apache decoder `org.apache.commons.codec.binary.StringUtils`
|
||||||
|
* ICU4J has a charset detector; this gives a very different result
|
||||||
|
|
||||||
|
For the equivalence oracle, I have a chain of several testers:
|
||||||
|
1. First a small but precise test suite is tried
|
||||||
|
2. Then some random testing based on the Wp method
|
||||||
|
3. Then exhaustive testing based on the W method
|
||||||
|
|
||||||
|
All implementations tested result in the same DFA (except for the ICU4J,
|
||||||
|
because it is not a validator, but a detector and accepts much more).
|
||||||
|
|
||||||
|
How to build and run (should run in a couple of seconds):
|
||||||
|
```bash
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Decomposition
|
||||||
|
|
||||||
|
See the subdirectory `dfa-decompose`.
|
||||||
|
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
I currently use the development version of `LearnLib` (and `automatalib`).
|
||||||
|
And I build them as follows:
|
||||||
|
```bash
|
||||||
|
mvn clean package -Pbundles -DskipTests
|
||||||
|
```
|
||||||
|
|
||||||
|
Other dependencies can be installed with maven. Note that I have very limited
|
||||||
|
experience in java development, and that my maven set-up may be less than
|
||||||
|
ideal.
|
||||||
|
|
||||||
|
|
||||||
|
## Copyright notice
|
||||||
|
|
||||||
|
(c) 2025 Joshua Moerman, Open Universiteit, licensed under the EUPL (European
|
||||||
|
Union Public License). If you want to use this code and find the license not
|
||||||
|
suitable for you, then please do get in touch.
|
||||||
|
|
||||||
|
```
|
||||||
|
SPDX-License-Identifier: EUPL-1.2
|
||||||
|
```
|
Loading…
Add table
Reference in a new issue