Debugging APFS Unicode Normalization on Linux

An emoji directory that ls could see but nothing could open:

$ ls -lah
total 0
d????????? ? ?  ?     ?            ?  🫣
-rwx------ 1 99 99  11K Oct 18  2025  .DS_Store
drwxr-xr-x 1 99 99 2.1K Aug 25  2025  some-other-dir

Context: encrypted APFS USB stick, no Mac around, mounted on Linux with apfs-fuse. lstat, rsync, tar, Python’s os.lstat, all returned ENOENT on 🫣. The directory was simultaneously there and not there.

Why This Happens

APFS stores directory entries in a B-tree. Each entry’s key contains the parent inode ID, the filename, and a precomputed hash of the filename. That hash exists for fast lookups; instead of scanning every entry, the kernel can jump straight to the right B-tree node.

This creates a split between two operations:

readdir (what ls uses): walks B-tree entries sequentially, no hashing needed
lookup (what happens when you actually access a path): searches by hash

If apfs-fuse computes a different hash than what macOS stored, lookup returns ENOENT even though the entry is sitting right there. readdir still works because it never uses the hash. That’s the paradox.

Diagram of the APFS B-tree lookup: apfs-fuse computes a different hash than what's stored, so the B-tree traversal misses the entry and returns ENOENT

Finding the Inode

apfs-dump-quick reads raw APFS structures, bypassing FUSE entirely:

DirRec   A6 84EBA405 '🫣' => D0  [DT_DIR]
Inode    D0 => A6 D0 [TS] 8000 [INODE_NO_RSRC_FORK] 3 0 1C 0 [] 99 99 40755

Inode D0 (208). Contents of that directory:

DirRec   D0 ... 'tapes' => B5  [DT_DIR]
DirRec   D0 ... 'imgs'  => A7  [DT_DIR]

The data is intact. The inodes are valid. The problem is purely in the lookup path.

The Root Cause

Unicode normalization. The same visual character can have multiple valid byte representations, and the Unicode standard defines rules for picking a canonical one.

Take é. You can encode it two ways:

NFC (composed): single codepoint U+00E9, 2 bytes
NFD (decomposed): e (U+0065) followed by a combining acute accent (U+0301), 3 bytes

They render identically. They are byte-for-byte different. A naive string comparison between them fails. NFC is what the web and Linux mostly use; NFD is what macOS historically used for HFS+.

macOS switched to APFS in 2017 and officially treats filenames as a bag of bytes with no enforced normalization. But the hash function used for B-tree lookups is computed on an internally normalized form. The exact normalization APFS uses is a variant of NFD, but with additional rules for things like case folding on case-insensitive volumes.

apfs-fuse’s HashFilename() is reverse-engineered from the APFS spec and binary analysis. It handles the common cases correctly, but gets it wrong for some codepoints, particularly newer emoji. Emoji above a certain Unicode version weren’t around when most of the reverse-engineering happened, so they fall through edge cases in the hash function.

For 🫣 specifically (U+1FAE3, added in Unicode 14.0), it has no decomposition, so NFD and NFC are identical. The hash mismatch here is more likely a bug in how HashFilename() handles codepoints in that range rather than a normalization issue strictly speaking. The end result is the same though:

stored   (macOS):     0x84EBA405
computed (apfs-fuse): 0x91FF...

Wrong hash, missed B-tree lookup, ENOENT.

The Fix

In LookupName inside ApfsLib/ApfsDir.cpp, the B-tree lookup failure path used to just return false. Fall back to a linear scan instead:

 if (!rc)
 {
     if (g_debug & Dbg_Dir)
-        std::cout << "Lookup failed!" << std::endl;
+        std::cout << "Lookup failed! Trying linear scan..." << std::endl;
+
+    std::vector<DirRec> entries;
+    if (ListDirectory(entries, parent_id))
+    {
+        for (const auto &de : entries)
+        {
+            if (de.name == name)
+            {
+                res = de;
+                return true;
+            }
+        }
+    }
     return false;
 }

ListDirectory uses the sequential iterator, same path as readdir. It’s $O(n)$ instead of $O(\log n)$, but only runs when the hash lookup has already failed so normal performance is untouched.

After rebuilding and remounting, the directory was fully accessible.

Upstreamed as apfs-fuse#218.

2026-05-06

../