Backing Up Dataless Files

What dataless files are

A dataless file is a file whose metadata (name, size, modification date, etc.) exists on local disk, but whose content lives only in the cloud. On macOS, dataless files are created by Apple’s FileProvider framework, which is what backs iCloud Drive and third-party cloud storage providers like Dropbox, Google Drive, OneDrive, and Box.

When disk space is tight, or when “Optimize Mac Storage” is enabled, macOS evicts the contents of files you haven’t used recently. The file’s icon shows a small cloud badge in Finder. Reading the file (for example, opening it in an app) tells FileProvider to download the content again — a process called materialization.

Why dataless files can’t be backed up via APFS snapshot

Arq by default takes an APFS snapshot of the volume at the start of a backup, then reads files from the snapshot. This gives Arq a consistent point-in-time view of the filesystem and avoids problems caused by files changing during the backup.

Dataless files don’t work with this approach. The snapshot captures the placeholder (the metadata stub) but not the file’s contents, because the contents aren’t on the local disk at the moment the snapshot is taken. Worse, reading a dataless file from inside a snapshot does not trigger materialization — FileProvider only materializes files in the live filesystem, not in snapshots.

To back up a dataless file, Arq has to bypass the snapshot and read the file live from its real path. Reading it live triggers FileProvider to download the content, and Arq backs up the materialized content. Because the read happens outside the snapshot, the file could in principle be modified during the backup; Arq accepts this trade-off because it’s the only way to capture the file’s contents at all.

Why “Operation timed out” errors can appear

Materialization is performed by macOS daemons (bird for iCloud Drive, fileproviderd for third-party providers) that run inside the target user’s login session. They will only download content on behalf of a process that’s also running in that session. If the user account that owns the dataless file isn’t currently logged in, FileProvider has no session to act in, the download never starts, and the read eventually fails with “Operation timed out”.

This is why Arq can only back up a user’s dataless files while that user is logged in. Backups that run when no user is logged in (for example, scheduled overnight backups on a Mac that’s been left at the login window) will time out on any dataless file they encounter that needs to be backed up.

How Arq evicts files it materialized

Materializing a dataless file fills its content onto local disk. If Arq left every materialized file in that state, a single backup could consume gigabytes of disk space that the user had deliberately freed up by letting macOS evict the content.

To avoid this, Arq tracks every dataless file it materializes during a backup and evicts each one after backing it up — telling FileProvider to discard the local content and return the file to its dataless state.

If eviction fails for some reason (for example, FileProvider is unresponsive), the file simply stays materialized on local disk. The cloud provider’s own background cleanup will eventually re-evict it.

Arq only materializes changed files

Materializing a dataless file is expensive — it has to download the content from the cloud. To minimize that cost, Arq compares each dataless file’s metadata (size, modification date, etc.) against the previous backup record before deciding to materialize it. If the file is unchanged since the last backup, Arq reuses the previously backed-up content and skips materialization entirely.

This means the first backup of a folder full of dataless files is slow (every file has to be downloaded), but subsequent backups touch only the files the user has actually modified.