16 KiB
Crate File Deduplication - Comprehensive Report
Target
OpenUSD v25.08 (Crate format 0.8.0)
Overview
The USD Crate file format implements a sophisticated multi-level deduplication system to minimize file size and optimize memory usage. Deduplication occurs during the write phase when packing data into the binary format.
Source File: pxr/usd/sdf/crateFile.cpp
Key Principle: Write each unique value exactly once, then reference it by offset or index.
Deduplication Levels
1. Structural Deduplication (Global)
Implemented in _PackingContext (lines 896-1033), these tables deduplicate fundamental structural elements across the entire file:
| Table | Type | Purpose | Location |
|---|---|---|---|
tokenToTokenIndex |
unordered_map<TfToken, TokenIndex> |
Dedup all tokens | Line 1013 |
stringToStringIndex |
unordered_map<string, StringIndex> |
Dedup all strings | Line 1014 |
pathToPathIndex |
unordered_map<SdfPath, PathIndex> |
Dedup all paths | Line 1015 |
fieldToFieldIndex |
unordered_map<Field, FieldIndex> |
Dedup all fields | Line 1016 |
fieldsToFieldSetIndex |
unordered_map<vector<FieldIndex>, FieldSetIndex> |
Dedup field sets | Line 1019-1020 |
Initialization: These tables are populated in parallel during _PackingContext construction (lines 917-973).
Persistence: Structural elements are written to dedicated sections:
TOKENSsection (line 227)STRINGSsection (line 228)FIELDSsection (line 229)FIELDSETSsection (line 230)PATHSsection (line 231)
2. Value-Level Deduplication (Per-Type)
Implemented in _ValueHandler<T> template (lines 1593-1737), this deduplicates actual data values:
template <class T>
struct _ValueHandler : _ValueHandlerBase {
// Dedup map for scalar values
std::unique_ptr<std::unordered_map<T, ValueRep, _Hasher>> _valueDedup;
// Dedup map for array values
std::unique_ptr<std::unordered_map<VtArray<T>, ValueRep, _Hasher>> _arrayDedup;
};
Key Characteristics:
- One dedup map per concrete type (e.g., separate maps for
int,float,GfVec3f) - Lazy allocation - maps created on first use (lines 1612-1615, 1658-1661)
- Cleared after write via
Clear()method (lines 1724-1731)
Value Classification
Category 1: Always Inlined (No Dedup Needed)
Definition (lines 239-246):
template <class T>
struct _IsAlwaysInlined : std::integral_constant<
bool, sizeof(T) <= sizeof(uint32_t) && _IsBitwiseReadWrite<T>::value> {};
// Special cases always inlined:
template <> struct _IsAlwaysInlined<string> : std::true_type {};
template <> struct _IsAlwaysInlined<TfToken> : std::true_type {};
template <> struct _IsAlwaysInlined<SdfPath> : std::true_type {};
template <> struct _IsAlwaysInlined<SdfAssetPath> : std::true_type {};
Examples:
bool,uint8_t,int32_t,float(≤4 bytes + bitwise)string,TfToken,SdfPath,SdfAssetPath(via index lookup)
Storage: Value stored directly in ValueRep.payload (32 bits for small types, index for strings/tokens/paths)
Structural Dedup: While inlined in ValueReps, the underlying strings/tokens/paths are still deduplicated in their respective tables.
Category 2: Conditionally Inlined
Some values of a type might fit in 4 bytes even if the type is larger.
Implementation (lines 1602-1609):
// Try to encode value in 4 bytes
uint32_t ival = 0;
if (_EncodeInline(val, &ival)) {
auto ret = ValueRepFor<T>(ival);
ret.SetIsInlined();
return ret; // No dedup needed
}
Use Case: Optimizes storage for values that happen to be small, even if the type allows larger values.
Category 3: Value-Deduplicated
Values too large to inline are deduplicated.
Pack Algorithm (lines 1611-1625):
// Lazy allocate dedup map
if (!_valueDedup) {
_valueDedup.reset(new typename decltype(_valueDedup)::element_type);
}
// Try to insert value
auto iresult = _valueDedup->emplace(val, ValueRep());
ValueRep &target = iresult.first->second;
if (iresult.second) {
// First occurrence - write to file
target = ValueRepFor<T>(writer.Tell());
writer.Write(val);
}
return target; // Existing or new offset
How It Works:
- Hash the value and check map
- If new: Write to file, store offset in map
- If duplicate: Return existing offset
- All duplicates reference same file location
Category 4: Array Deduplication
Arrays use a separate dedup map.
Pack Algorithm (lines 1651-1680):
ValueRep PackArray(_Writer w, VtArray<T> const &array) {
auto result = ValueRepForArray<T>(0);
// Empty arrays always inlined (payload = 0)
if (array.empty())
return result;
// Check array dedup map
if (!_arrayDedup) {
_arrayDedup.reset(new typename decltype(_arrayDedup)::element_type);
}
auto iresult = _arrayDedup->emplace(array, result);
ValueRep &target = iresult.first->second;
if (iresult.second) {
// First occurrence - write array
if (writeVersion < Version(0,5,0)) {
// Old format
} else {
// Possibly compressed
target = _WritePossiblyCompressedArray(w, array, writeVersion, 0);
}
}
return target;
}
Special Cases:
- Empty arrays: Always inlined with payload=0 (lines 1654-1656)
- Compressed arrays: Deduped at compressed representation level (lines 1675-1676)
Implementation Details
ValueRep Structure
The ValueRep is the core structure storing value references:
struct ValueRep {
uint64_t payload; // Offset in file OR inlined value
TypeEnum type;
bool isInlined;
bool isArray;
bool isCompressed;
};
Usage:
- Inlined:
payloadcontains the value directly (or index) - Not inlined:
payloadcontains file offset - Dedup benefit: Multiple ValueReps can share same offset
Hashing Strategy
Dedup maps use _Hasher (line 1013-1014, 1733):
std::unordered_map<T, ValueRep, _Hasher>
Requirements for Type T:
- Must be hashable via
_Hasher - Must have equality comparison
- Must be copyable (for map storage)
Memory Management
Lazy Allocation (lines 1612-1615):
if (!_valueDedup) {
_valueDedup.reset(new typename decltype(_valueDedup)::element_type);
}
- Maps only created when first non-inlined value encountered
- Reduces memory for files with only inlined values
Cleanup (lines 1724-1731):
void Clear() {
if constexpr (!_IsAlwaysInlined<T>::value) {
_valueDedup.reset();
}
if constexpr (_SupportsArray<T>::value) {
_arrayDedup.reset();
}
}
Deduplication Workflow
Write Phase
1. CrateFile::_PackValue(VtValue)
↓
2. Determine type T from VtValue
↓
3. Get _ValueHandler<T> for this type
↓
4. Check if value can be inlined
↓
YES → Store in ValueRep.payload (4 bytes)
↓
NO → Check _valueDedup map
↓
EXISTS → Return existing ValueRep with offset
↓
NEW → Write value, store offset in map
↓
Return new ValueRep
Read Phase
1. CrateFile::UnpackValue(ValueRep)
↓
2. Check ValueRep.isInlined
↓
YES → Extract value from payload
↓
NO → Seek to payload offset
↓
Read value from file
Key Insight: Dedup is transparent to readers - they just follow offsets.
Array Compression Integration
Combined Optimization (Version 0.5.0+)
Arrays can be both deduplicated and compressed (lines 1673-1677):
if (writeVersion >= Version(0,5,0)) {
target = _WritePossiblyCompressedArray(w, array, writeVersion, 0);
}
Compression Types (lines 1786-1893):
-
Integer Compression (int, uint, int64, uint64)
- Uses
Sdf_IntegerCompression/Sdf_IntegerCompression64 - Minimum array size: 16 elements (line 1740)
- Uses
-
Float Compression (GfHalf, float, double)
- As Integers: If all values exactly representable as int32 (lines 1828-1848)
- Lookup Table: If few distinct values (<1024, ≤25% of size) (lines 1850-1886)
- Uncompressed: Otherwise
-
Other Types: Uncompressed
Dedup + Compression:
- Arrays deduplicated at compressed representation level
- Two identical arrays compressed the same way → same offset
- Different compression of same logical array → different entries (rare)
Performance Characteristics
Time Complexity
| Operation | Complexity | Notes |
|---|---|---|
| Lookup in dedup map | O(1) average | Hash map lookup |
| Insert in dedup map | O(1) average | Hash map insert |
| Hash computation | O(n) | n = value size (array length, etc.) |
| Write value | O(n) | Only on first occurrence |
Space Complexity
Memory Overhead:
- Per type:
sizeof(unordered_map) + entries * (sizeof(T) + sizeof(ValueRep)) - For large arrays: Can be significant
- Mitigated by: Lazy allocation, cleared after write
File Size Savings:
- Highly data-dependent
- Best case: Many duplicates → linear reduction
- Worst case: All unique → small overhead (map structure)
Real-World Benefits
High Dedup Scenarios:
- Tokens/Strings: USD uses many repeated property names, type names
- Paths: Hierarchical paths share prefixes (deduplicated)
- Default Values: Many attributes share defaults (e.g.,
GfVec3f(0,0,0)) - Time Samples: Common time arrays across multiple attributes
- Metadata: Repeated dictionary entries
Low Dedup Scenarios:
- Unique geometry data (positions, normals)
- Random/noise values
- Unique identifiers
Code Examples
Example 1: String Deduplication
// Writing three properties with same string value
crate->Set(path1, "documentation", VtValue("Hello")); // Written at offset 1000
crate->Set(path2, "documentation", VtValue("Hello")); // Reuses offset 1000
crate->Set(path3, "comment", VtValue("Hello")); // Reuses offset 1000
// File contains "Hello" exactly once
Process:
- First "Hello" → Added to
stringToStringIndex→ StringIndex(42) - Second "Hello" → Found in map → StringIndex(42)
- String written once to STRINGS section
Example 2: Array Deduplication
VtArray<float> zeros(1000, 0.0f);
crate->Set(path1, "data", VtValue(zeros)); // Compressed as integers, offset 5000
crate->Set(path2, "data", VtValue(zeros)); // Reuses offset 5000
crate->Set(path3, "data", VtValue(zeros)); // Reuses offset 5000
// Array written and compressed exactly once
Process:
- First array → Compressed via integer encoding → Write at 5000
- Insert into
_arrayDedup[zeros]→ ValueRep(offset=5000, compressed=true) - Subsequent arrays → Map lookup → Same ValueRep
Example 3: VtValue Recursion
For nested VtValues (e.g., VtValue containing VtDictionary containing VtValues):
// Prevent infinite recursion (lines 1239-1253)
auto &recursionGuard = _LocalUnpackRecursionGuard::Get();
if (!recursionGuard.insert(rep).second) {
TF_RUNTIME_ERROR("Recursive VtValue detected");
return VtValue();
}
result = crate->UnpackValue(rep);
recursionGuard.erase(rep);
Protection: Thread-local set prevents circular references in corrupt files.
Version History Impact
Version 0.0.1 → 0.5.0
- Basic deduplication
- Arrays stored uncompressed
- 32-bit array sizes
Version 0.5.0
- Integer array compression (lines 1786-1809)
- Dedup maps store compressed representations
- No rank storage for arrays (always 1D)
Version 0.6.0
- Float array compression (lines 1811-1893)
- Lookup table encoding
- Integer encoding for floats
Version 0.7.0
- 64-bit array sizes (lines 1799-1801, 1837-1839)
- Enables larger arrays
- Dedup still works with larger arrays
Version 0.8.0+
- SdfPayloadListOp deduplication (lines 1485-1491)
- Layer offset support in payloads
Thread Safety
Write Path
Not Thread-Safe - Single-threaded packing:
_PackingContextpopulated serially (parallel initialization, lines 917-973)- Dedup maps modified during sequential value packing
_BufferedOutputusesWorkDispatcherfor async writes
Read Path
Thread-Safe with Caveats:
- Immutable structures after file open
_sharedTimesdedup usestbb::spin_rw_mutex(lines 1267-1288)- Zero-copy arrays: Concurrent reads safe, but mapping destruction requires synchronization
Zero-Copy Integration
Zero-Copy Deduplication (lines 1912-1963)
Arrays can be deduplicated at the ValueRep level while still supporting zero-copy reads:
if (zeroCopyEnabled &&
numBytes >= MinZeroCopyArrayBytes && // ≥2048 bytes
/* properly aligned */) {
void const *addr = reader.src.TellMemoryAddress();
*out = VtArray<T>(
foreignSrc,
static_cast<T *>(const_cast<void *>(addr)),
size, /*addRef=*/false);
}
Key Points:
- Multiple ValueReps can point to same mmap region
_FileMapping::_Impl::ZeroCopySourcetracks outstanding references (lines 460-471)- On mapping destruction, copy-on-write detachment (lines 490-523)
Dedup Benefit: Multiple attributes with same large array share:
- Single file offset (dedup)
- Single mmap region (zero-copy)
- Minimal memory overhead
Environment Variables
USDC_ENABLE_ZERO_COPY_ARRAYS (lines 127-132)
TF_DEFINE_ENV_SETTING(
USDC_ENABLE_ZERO_COPY_ARRAYS, true,
"Enable the zero-copy optimization for numeric array values...");
Impact on Dedup: Disabled zero-copy still benefits from dedup (reads from same offset).
USD_WRITE_NEW_USDC_FILES_AS_VERSION (lines 111-117)
Impact on Dedup: Older versions have fewer compression options, affecting array dedup effectiveness.
Limitations and Edge Cases
1. Type Granularity
Dedup is per-type: VtArray<int> and VtArray<float> use separate maps even if values numerically identical.
2. Floating Point Precision
IEEE floats: 0.0 and -0.0 are distinct in memory but may hash the same - implementation dependent.
3. Compression Variance
Same array might compress differently based on:
- Version flags
- Size thresholds
- Content patterns
This can prevent dedup of logically identical arrays.
4. Map Memory Growth
For files with many unique large arrays, dedup maps can consume significant RAM during write.
5. No Inter-File Dedup
Each file write creates fresh dedup maps. Common values across files stored separately.
Best Practices
For USD Authors
- Reuse Values: Prefer referencing same value objects rather than creating duplicates
- Common Defaults: Use standard default values (0, 1, identity matrices) that dedup well
- Shared Time Samples: Reuse time arrays across attributes when possible
- Token Interning: Use TfToken for repeated strings
For Implementation
- Monitor Memory: Large dedup maps can OOM on huge files
- Version Selection: Use latest version for best compression+dedup
- Profiling: Check dedup effectiveness with file size metrics
- Clear Maps: Ensure
Clear()called after write to free memory
Debugging and Diagnostics
Checking Dedup Effectiveness
- Compare File Size: Measure size with/without likely duplicates
- Section Sizes: Inspect TOKENS/STRINGS sections for redundancy
- Memory Profiling: Monitor
_valueDedup/_arrayDedupsizes during write
Common Issues
Symptom: File larger than expected
- Cause: Values not hashing/comparing correctly
- Solution: Verify
_Hasherimplementation for type
Symptom: High memory during write
- Cause: Too many unique large arrays
- Solution: Write in chunks, or accept lack of dedup
Symptom: Slow writes
- Cause: Hash computation expensive for large arrays
- Solution: Profile hash function, consider size limits
Summary
The Crate deduplication system provides:
✅ Multi-level dedup: Structural (global) + Value (per-type) ✅ Automatic: Transparent to API users ✅ Efficient: O(1) lookup, lazy allocation ✅ Integrated: Works with compression and zero-copy ✅ Versioned: Evolves with format capabilities
Result: Significant file size reduction for typical USD data with shared tokens, paths, defaults, and time samples, while maintaining fast read/write performance.
References
- Source:
pxr/usd/sdf/crateFile.cpp - Key Types:
_PackingContext,_ValueHandler<T>,ValueRep - Key Methods:
Pack(),PackArray(),_PackValue() - Sections: TOKENS, STRINGS, FIELDS, FIELDSETS, PATHS, SPECS