Views of a Coder: May 2019

11 May 2019

Cloning APFS Volumes & Containers ("APFS inverter failed")

(This is an older article that I hadn't published back then because it might not be fully accurate, i.e. the steps may not be applicable. Yet, it contains some valuable information so I decided to publish it now. Read with a grain of salt. If you have corrections, don't hesitate to comment or email me.)

TL;DR

If Disk Utility fails to clone an APFS container, giving "APFS inverter failed" as error message, a work-around solution is to copy the partition with the "dd" command into a same-sized partition, then fix the cloned container's UUIDs.

About cloning (copying) entire Mac volumes

Usually, when you want to clone a Mac volume, you'd use Disk Utility's Restore operation. It performs a sector-by-sector copy (while skipping unused sector . The alternative would be to perform a file-by-file copy, as it's done by 3rd party tools like Carbon Copy Cloner.

A clone operation by copying individual files has several disadvantages over a sector copy:

It's much slower.
If you're using Finder Alias files, these may not work any more afterwards (that's because they rely on each file's unique ID, and those IDs change when copying files over to the destination).
More programs may request re-activation (re-registration).

However, when I recently tried to make an identical copy of my macOS Mojave system from my MacBook Air (2015), copying it to an external SSD, I ran into problems:

After the copy and verification operation was already apparently finished, an additional inverter process needs to be run on cloned APFS volumes. While I do understand that that's necessary when I copy only a single volume out of an APFS container, or copy the volume into a target APFS container without replacing it entirely, but even if I try to clone the entire container, with erasing the target, it still wants to run an inversion process - and that makes no sense to me.

Now, the error that I kept seeing is: APFS inverter failed to invert the volume

And I'm not the only one, see here and here.

I tried many things, including First Aid and removal of all snapshots, and running the cmdline tool "asr", which showed me more detailed error messages. Still, no success. I keep getting variations of the same issue.

What I'm showing now is a way to clone a complete APFS container (with all contained volumes) the way it should work.

How to clone an APFS container

Note: This may not work with encrypted volumes. Or it might. I have not tried.

We simply copy the entire partition (which contains the APFS container) sector by sector. (Small disadvantage over Disk Utility's Restore operation: This will also copy unused sectors, so it'll take a bit longer.)

Afterwards, we need to change the UUIDs of the cloned container and its volumes, or the Mac (and especially Disk Utility) may get confused when both the original and the cloned volumes are present on the same Mac.

Perform the sector copy operation

In case you want to copy your bootable macOS system, you will have to first start up from a different macOS system. If you have no other external disk or partition with another macOS system, simply start up from the Recover system.

When ready, connect the target disk, then start Terminal.app.

Get an overview of our disk names by entering (and always pressing Return afterwards):

diskutil list

Here's a sample output of my Mac that has four partitions, two of which use HFS+ ("AirElCap", "AirData") and the other two use APFS ("AirMojave", "AirHighSierra"):

/dev/disk0 (internal):

#: TYPE NAME SIZE IDENTIFIER

0: GUID_partition_scheme 121.3 GB disk0

1: EFI EFI 314.6 MB disk0s1

2: Apple_APFS Container disk1 40.9 GB disk0s2

3: Apple_HFS AirElCap 20.1 GB disk0s3

4: Apple_Boot Boot OS X 134.2 MB disk0s4

5: Apple_APFS Container disk2 39.8 GB disk0s5

   6: Apple_HFS AirData 19.9 GB disk0s6

/dev/disk1 (synthesized):

   #: TYPE NAME SIZE IDENTIFIER

   0: APFS Container Scheme - +40.9 GB disk1

   Physical Store disk0s2

   1: APFS Volume AirMojave 20.7 GB disk1s1

   2: APFS Volume Preboot 47.1 MB disk1s2

   3: APFS Volume Recovery 512.7 MB disk1s3

   4: APFS Volume VM 2.1 GB disk1s4

/dev/disk2 (synthesized):

   #: TYPE NAME SIZE IDENTIFIER

   0: APFS Container Scheme - +39.8 GB disk2

   Physical Store disk0s5

   1: APFS Volume AirHighSierra 23.2 GB disk2s1

   2: APFS Volume Preboot 20.9 MB disk2s2

   3: APFS Volume Recovery 519.0 MB disk2s3

   4: APFS Volume VM 3.2 GB disk2s4

I plan to clone "AirMojave", so the disk identifier would be disk1, because that is the container for that volume. In your case it may be a different identifier. I will write the commands below using diskS for the source container and diskTsP for the target partition.

Now unmount all volumes of our source container. In Terminal, enter:

diskutil unmountDisk diskS

If this does not succeed, then you either have programs or files open on one of the volumes or you're trying to unmount the boot volume (in that case, you should have restarted into a different system as explained above). Do not continue if the mount was not successful or you're likely to end up with a corrupted clone.

diskutil apfs list diskS

+-- Container disk2 5099518D-36C0-47CC-B034-0CDA52C51CE8

====================================================

APFS Container Reference: disk2

Size (Capacity Ceiling): 39838416896 B (39.8 GB)

Capacity In Use By Volumes: 27042205696 B (27.0 GB)

Capacity Not Allocated: 12796211200 B (12.8 GB)

sudo diskutil resizeVolume diskTsP size

In place of size, use the number that's show after "Size (Capacity Ceiling):".

diskutil unmount diskTsP

This is the command that will perform the copying:

sudo dd bs=1m if=/dev/diskS of=/dev/diskTsP

(See also Adrian's comment below, suggesting to use "rdisk" instead of "disk" for higher speed.)

While dd is doing its work, it won't show any progress on its own. And it can take hours, even days. To check where it's at, type ctrl-T in the Terminal window. That will print a line showing its current progress.

Fix the UUIDs

Enter in Terminal:

sudo /System/Library/Filesystems/apfs.fs/Contents/Resources/apfs.util -s /dev/diskT

You'll likely not get any response from this command, telling you success or failure.

To check whether the UUIDs have been changed successfully, enter in Terminal:

diskutil apfs list

In the output, find your source and target disks, and compare the UUIDs of their container and container volumes. They should all be different. If the UUIDs of source and target volumes show the same code, then the fix did not work. Try again.

Make the system volume bootable

The last step is to make the cloned volume bootable again (in case it was before). For that, you need to mount the Preboot volume, then rename the single folder inside, which contains the old UUID, into the new UUID of the main volume.

To mount it:

diskutil mount diskTsX

(X stands for the partition number containing the "Preboot" volume)

To show it in Finder:

open /Volumes/Preboot

Now you should see a Finder window with a folder named after the UUID of the original volume you copied. Rename that folder into the new UUID. To verify that you used the correct UUID, open System Preferences, Startup Disk, select the cloned bootable volume and click the Restart button. If it says that it can't bless it, you got it wrong. Otherwise, the system should now restart, meaning the UUID was correctly set to a bootable volume.

Understanding bugs in Xojo, or not getting them fixed

A former employee of Xojo Inc. once met with an Apple Developer Support (DTS) engineer, looking over some code. The Apple engineer saw a note mentioning my name and told the Xojo employee: You know Thomas Tempelmann? I know him, too: "He's the best type of user, the one that debugs a problem so far so that he tells you exactly what you're doing wrong, and how to fix it."

I'm not infallible, but I believe I can claim that I have quite some experience and understanding of a lot of things under a computer's hood. After all, I've been doing this for nearly 40 years. I'm not so great with abstract algorithms, but when it comes to writing efficient code or debugging it, I'm surely not the best, but have skills that are well above average.

And I've proven that a lot of times with a development system I really love to use: Xojo, formerly known as REALbasic.

I've been using Xojo (or RB) for about 20 years now. I've been one of the first to write plugins for it, and they were quite popular (the plugins eventually became part of the MBS plugins).

Here are a few examples of bugs and solutions I found in Xojo:

Back when we still used 68k CPUs, there was a serious issue that only a few customers had: Their apps crashed when they got large. I had some suspicions, looked at the compiled code and soon found the issue - an overflow with a 16 bit offset in a jmp instruction. Worth noting here was that previously the Xojo engineers were not able to find the bug. Yet I, without even having the source code, found it within an hour or so. Because I wrote a 68k compiler once myself and knew what could go wrong. I also like to point out that Xojo's current EULA prohibits us from looking at the compiled code the way I did, in order to learn how it works. I later argued against that, even mentioning this example where I had to do that because no one else was able to do find the bug, to no avail. This was not the last time I had to dig into Xojo's code in order to work around a bug when Xojo refused to look into it, and solved the issue for myself, but I can't publish them any more to the benefit of others because, well, Xojo's EULA and the threat it emcompasses.
In 2008, I ran into a very rare case where I'd lose some data when I had 10,000s of objects in a particular data structure (WeakRefs, IIRC). Turned out that the internal dictionary code did not handle collisions correctly. After proposing a code change, the reproducible issue was gone. I had to personally pursue this issue down into the code because, again, the resposible engineer (who was, admittedly, not its original author) did not even believe in the problem I described. Regardless, I was able to get the fix into the framework due to some lucky circumstances, for all of us.
The Xojo IDE's Back (History) button does not work reliably, since 2013 when this IDE was introduced. More often than not, the Back button simply does not go back to previously visited locations. I have then, as a proof-of-concept, written an external program that talks to the IDE via the IDE communication socket, regularly requesting the current location, and offering a list with the history. You can then click any history item and the IDE actually jumps back to it. Surprisingly, this works more reliably than the IDE's own back button. Yet, when Xojo CEO Geoff Perlman was recently asked at the MBS conference in Munich about this shortcoming, he insisted that this is a very complex matter that is not easy to solve. I find that hard to believe if even I can do better with an external program.
Xojo code can use Threads, but can run them only cooperatively, not concurrently. That's because the runtime functions are not using locking to protect the sensitive operations such as object creation against interruption by another thread. Thus, Xojo's runtime has its own thread scheduler that uses semaphores to make sure only one Xojo thread runs at any time. Now, there are cases where we users need to use Declare statements to invoke OS-provided functions. Some of them may even call back into our own Xojo functions. But if those callbacks happen on threads that Xojo does not control, this can lead to crashes when our Xojo code then accesses Xojo objects. I've come up with a proposal to make this safe, effectively by using locks that suspect the callback task until Xojo is in a safe state. Apart from the possibility of creating a deadlock (which is under the control of the programmer), I was able to supply a demo project that yet has to be proven not to be stable. Xojo, however, ignores all my explanations and demonstrations and simply keeps telling their users that this can't ever be safe.
Related to callbacks, there's also a long-known issue with passing function addresses to the OS. This is done by creating a so-called delegate object via the AddressOf operator. The delegate object can be used inside Xojo like a function variable, i.e. one can store the address of a function in a property and call it later. This even works for object instance methods, i.e. methods that are part of an object and have a "self" reference. This self reference is simply a pointer ot the object that gets passed to it when it's invoked from an object. A delegate stores this object reference and passes it to the function if necessary. However, when passing such as delegate to a OS function (via Declare), then Xojo does not pass a pointer into the stub function that sets up the self reference but instead passes the target function address. That means that if the callback is invoked, the self reference is not properly set up, leading to a crash. The issue is known for long, and in the past bug reports of this kind have been closed as "works as designed". Since this could be fixed, the official answer sounds like an excuse for "we don't like to deal with it" or "we don't really care" to me.

All this shows that there are a lot of things in Xojo, mostly low level, that could be fixed - but they don't - because of a lack of comprehension. Sadly, in many cases where I offered solutions, even proofs, I hit a wall. I don't understand how a company that specifically caters to developers can be so ignorant to the needs and offerings of their willing customers.