Recently, I’ve been investigating some test failures that I only experienced on my own machine, which happens to run some flavor of Linux. Investigating those failures, I ran down a rabbit hole that involves Unix, Unicode, Java, filesystems, internationalization and normalization. Here is the story of what I found down at the very bottom.

A story about Unix internationalization

One test that was failing is testAccessUniCodeFile, with the following exception:

java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: swedish-å.txt
	at java.base/sun.nio.fs.UnixPath.encode(UnixPath.java:145)
	at java.base/sun.nio.fs.UnixPath.(UnixPath.java:69)
	at java.base/sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:279)
	at java.base/java.nio.file.Path.resolve(Path.java:515)
	at org.eclipse.jetty.util.resource.FileSystemResourceTest.testAccessUniCodeFile(FileSystemResourceTest.java:335)
	...

This test asserts that Jetty can read files with non-ASCII characters in their names. But the failure happens in Path.resolve, when trying to create the file, before any Jetty code is executed. But why?

When accessing a file, the JVM has to deal with Unix system calls. The Unix system call typically used to create a new file or open an existing one is int open(const char *path, int oflag, …); which accepts the file name as its first argument.

In this test, the file name is "swedish-å.txt" which is a Java String. But that String isn’t necessarily encoded in memory in a way that the Unix system call expects. After all, a Java String is not the same as a C const char * so some conversion needs to happen before the C function can be called.

We know the Java String is represented by a UTF-8-encoded byte[] internally. But how is the C const char * actually is supposed to be represented? Well, that depends. The Unix spec specifies that internationalization depends on environment variables. So the encoding of the C const char * depends on the LANG, LC_CTYPE and LC_ALL environment variables and the JVM has to transform the Java String to a format determined by these environment variables.

Let’s have a look at those in a terminal:

$ echo "LANG=\"$LANG\" LC_CTYPE=\"$LC_CTYPE\" LC_ALL=\"$LC_ALL\""
LANG="C" LC_CTYPE="" LC_ALL=""
$

C is interpreted as a synonym of ANSI_X3.4-1968 by the JVM which itself is a synonym of US-ASCII.

I’ve explicitly set this variable in my environment as some commands use it for internationalization and I appreciate that all the command line tools I use strictly stick to English. For instance:

$ sudo LANG=C apt-get remove calc
...
Do you want to continue? [Y/n] n
Abort.
$ sudo LANG=fr_BE.UTF-8 apt-get remove calc
Do you want to continue? [O/n] n
Abort.
$

Notice the prompt to the question Do you want to continue? that is either [Y/n] (C locale) or [O/n] (Belgian-French locale) depending on the contents of this variable. Up until now, I didn’t know that it also impacted what files the JVM could create or open!

Knowing that, it is now obvious why the file cannot be created: it is not possible to convert the "swedish-å.txt" Java String to an ASCII C const char * simply because there is no way to represent the å character in ASCII.

Changing the LANG environment variable to en_US.UTF-8 allowed the JVM to successfully make that Java-to-C string conversion which allowed that test to pass.

Our build has now been changed to force the LC_ALL environment variable (as it is the one that overrides the other ones) to en_US.UTF-8 before running our tests to make sure this test passes even on environments with non-unicode locales.

A story about filesystem Unicode normalization

There was an extra pair of failing tests, that reported the following error:

java.lang.AssertionError: 
Expected: is <404>
     but: was <200>
Expected :is <404>
Actual   :<200>

For the context, those tests are about creating a file with a non-ASCII name encoded in some way and trying to serve it over HTTP with a request to the same non-ASCII name encoded in a different way. This is needed because Unicode supports different forms of encoding, notably Normalization Form Canonical Composition (NFC) and Normalization Form Canonical Decomposition (NFD). For our example string “swedish-å.txt”, this means there are two ways to encode the letter “å”: either U+00e5 LATIN SMALL LETTER A WITH RING ABOVE (NFC) or U+0061 LATIN SMALL LETTER A followed by U+030a COMBINING RING ABOVE (NFD).

Both are canonically equivalent, meaning that a unicode string with the letter “å” encoded either as NFC or NFD should be considered the same. Is that true in practice?

The failing tests are about creating a file whose name is NFC-encoded then trying to serve it over HTTP with the file name encoded in the URL as NFD and vice-versa.

When running those tests on MacOS on APFS, the encoding never matters and MacOS will find the file with a NFC-encoded filename when you try to open it with a NFD-encoded canonically equivalent filename and vice-versa.

When running those tests on Linux on ext4 or Windows on NTFS, the encoding always matters and Linux/Windows will not find the file with a NFC-encoded filename when you try to open it with a NFD-encoded canonically equivalent filename and vice-versa.

And this is exactly what the tests expect:

if (OS.MAC.isCurrentOs())
  assertThat(response.getStatus(), is(HttpStatus.OK_200));
else
  assertThat(response.getStatus(), is(HttpStatus.NOT_FOUND_404));

What I discovered is that when running those tests on Linux on ZFS, the encoding sometimes matters and Linux may find the file with a NFC-encoded filename when you try to open it with a NFD-encoded canonically equivalent filename and vice-versa, depending upon the ZFS normalization property; quoting the manual:

normalization = none | formC | formD | formKC | formKD
    Indicates whether the file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. File names are always stored unmodified, names are normalized as part of any comparison process. If this property is set to a legal value other than none, and the utf8only property was left unspecified, the utf8only property is automatically set to on. The default value of the normalization property is none. This property cannot be changed after the file system is created. 

So if we check the normalization of the filesystem upon which the test is executed:

$ zfs get normalization /
NAME                      PROPERTY       VALUE          SOURCE
rpool/ROOT/nabo5t         normalization  formD          -
$

we can understand why the tests fail: due to the normalization done by ZFS, Linux can open the file given canonically equivalent filenames, so the test mistakenly assumes that Linux cannot serve this file. But if we create a new filesystem with no normalization property:

$ zfs get normalization /unnormalized/test/directory
NAME                      PROPERTY       VALUE          SOURCE
rpool/unnormalized        normalization  none           -
$

and run a copy of the tests from it, the tests succeed.

So we’ve adapted both tests to make them detect if the filesystem supports canonical equivalence and basing the assertion on that detection instead of hardcoding which OS behaves in which way.

A story about Unix, Unicode, Java, filesystems, internationalization and normalization

One thought on “A story about Unix, Unicode, Java, filesystems, internationalization and normalization

  • March 24, 2021 at 9:21 am
    Permalink

    Ludovic: thanks for taking the trouble to write this up, it is helpful.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *