Debugging APFS Unicode Normalization on Linux
An emoji directory that ls could see but nothing could open:
$ ls -lah
total 0
d????????? ? ? ? ? ? 🫣
-rwx------ 1 99 99 11K Oct 18 2025 .DS_Store
drwxr-xr-x 1 99 99 2.1K Aug 25 2025 some-other-dir
Context: encrypted APFS USB stick, no Mac around, mounted on Linux with apfs-fuse. lstat, rsync, tar, Python’s os.lstat, all returned ENOENT on 🫣. The directory was simultaneously there and not there.
Why This Happens
APFS stores directory entries in a B-tree. Each entry’s key contains the parent inode ID, the filename, and a precomputed hash of the filename. That hash exists for fast lookups; instead of scanning every entry, the kernel can jump straight to the right B-tree node.
This creates a split between two operations:
readdir(whatlsuses): walks B-tree entries sequentially, no hashing neededlookup(what happens when you actually access a path): searches by hash
If apfs-fuse computes a different hash than what macOS stored, lookup returns ENOENT even though the entry is sitting right there. readdir still works because it never uses the hash. That’s the paradox.

Finding the Inode
apfs-dump-quick reads raw APFS structures, bypassing FUSE entirely:
DirRec A6 84EBA405 '🫣' => D0 [DT_DIR]
Inode D0 => A6 D0 [TS] 8000 [INODE_NO_RSRC_FORK] 3 0 1C 0 [] 99 99 40755
Inode D0 (208). Contents of that directory:
DirRec D0 ... 'tapes' => B5 [DT_DIR]
DirRec D0 ... 'imgs' => A7 [DT_DIR]
The data is intact. The inodes are valid. The problem is purely in the lookup path.
The Root Cause
Unicode normalization. The same visual character can have multiple valid byte representations, and the Unicode standard defines rules for picking a canonical one.
Take é. You can encode it two ways:
- NFC (composed): single codepoint
U+00E9, 2 bytes - NFD (decomposed):
e(U+0065) followed by a combining acute accent (U+0301), 3 bytes
They render identically. They are byte-for-byte different. A naive string comparison between them fails. NFC is what the web and Linux mostly use; NFD is what macOS historically used for HFS+.
macOS switched to APFS in 2017 and officially treats filenames as a bag of bytes with no enforced normalization. But the hash function used for B-tree lookups is computed on an internally normalized form. The exact normalization APFS uses is a variant of NFD, but with additional rules for things like case folding on case-insensitive volumes.
apfs-fuse’s HashFilename() is reverse-engineered from the APFS spec and binary analysis. It handles the common cases correctly, but gets it wrong for some codepoints, particularly newer emoji. Emoji above a certain Unicode version weren’t around when most of the reverse-engineering happened, so they fall through edge cases in the hash function.
For 🫣 specifically (U+1FAE3, added in Unicode 14.0), it has no decomposition, so NFD and NFC are identical. The hash mismatch here is more likely a bug in how HashFilename() handles codepoints in that range rather than a normalization issue strictly speaking. The end result is the same though:
stored (macOS): 0x84EBA405
computed (apfs-fuse): 0x91FF...
Wrong hash, missed B-tree lookup, ENOENT.
The Fix
In LookupName inside ApfsLib/ApfsDir.cpp, the B-tree lookup failure path used to just return false. Fall back to a linear scan instead:
if (!rc)
{
if (g_debug & Dbg_Dir)
- std::cout << "Lookup failed!" << std::endl;
+ std::cout << "Lookup failed! Trying linear scan..." << std::endl;
+
+ std::vector<DirRec> entries;
+ if (ListDirectory(entries, parent_id))
+ {
+ for (const auto &de : entries)
+ {
+ if (de.name == name)
+ {
+ res = de;
+ return true;
+ }
+ }
+ }
return false;
}
ListDirectory uses the sequential iterator, same path as readdir. It’s $O(n)$ instead of $O(\log n)$, but only runs when the hash lookup has already failed so normal performance is untouched.
After rebuilding and remounting, the directory was fully accessible.
Upstreamed as apfs-fuse#218.