mirror of
https://git.cs.ou.nl/joshua.moerman/utf8-learner.git
synced 2025-07-01 14:17:45 +02:00
initial commit
This commit is contained in:
commit
9a09a24df3
2 changed files with 59 additions and 0 deletions
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
|
@ -0,0 +1,4 @@
|
|||
target
|
||||
dependency-reduced-pom.xml
|
||||
.vscode
|
||||
|
55
README.md
Normal file
55
README.md
Normal file
|
@ -0,0 +1,55 @@
|
|||
UTF-8 Automaton Learner
|
||||
=======================
|
||||
|
||||
See [my blog post](https://joshuamoerman.nl/2025/6/The-UTF-8-Automaton.html).
|
||||
|
||||
Using LearnLib to learn a model of *UTF-8* validators (or decoders). It only
|
||||
learns the acceptance behaviour, not the transduction to unicode code points.
|
||||
|
||||
UTF-8 implementations tested:
|
||||
* JDK decoder in `java.nio.charset.CharsetDecoder` (depends on java platform)
|
||||
* Guava validator `com.google.common.base.Utf8`
|
||||
* Apache decoder `org.apache.commons.codec.binary.StringUtils`
|
||||
* ICU4J has a charset detector; this gives a very different result
|
||||
|
||||
For the equivalence oracle, I have a chain of several testers:
|
||||
1. First a small but precise test suite is tried
|
||||
2. Then some random testing based on the Wp method
|
||||
3. Then exhaustive testing based on the W method
|
||||
|
||||
All implementations tested result in the same DFA (except for the ICU4J,
|
||||
because it is not a validator, but a detector and accepts much more).
|
||||
|
||||
How to build and run (should run in a couple of seconds):
|
||||
```bash
|
||||
./run.sh
|
||||
```
|
||||
|
||||
|
||||
## Decomposition
|
||||
|
||||
See the subdirectory `dfa-decompose`.
|
||||
|
||||
|
||||
## Dependencies
|
||||
|
||||
I currently use the development version of `LearnLib` (and `automatalib`).
|
||||
And I build them as follows:
|
||||
```bash
|
||||
mvn clean package -Pbundles -DskipTests
|
||||
```
|
||||
|
||||
Other dependencies can be installed with maven. Note that I have very limited
|
||||
experience in java development, and that my maven set-up may be less than
|
||||
ideal.
|
||||
|
||||
|
||||
## Copyright notice
|
||||
|
||||
(c) 2025 Joshua Moerman, Open Universiteit, licensed under the EUPL (European
|
||||
Union Public License). If you want to use this code and find the license not
|
||||
suitable for you, then please do get in touch.
|
||||
|
||||
```
|
||||
SPDX-License-Identifier: EUPL-1.2
|
||||
```
|
Loading…
Add table
Reference in a new issue