Files
Nikola Pajkovsky 6255955d08 method store performance improvements
The proposed architectural change focuses on improving concurrency and
reducing contention within the method store. The fundamental concept
involves moving away from a monolithic synchronisation
mechanism—specifically, a single read-write lock (rwlock)—that
currently guards the entire method store.

Instead of this single point of contention, the strategy is to
introduce per-shard synchronisation. This means the method store will
be partitioned, or sharded, into several independent segments. Each of
these segments, or shards, will be protected by its own dedicated
read-write lock.

The data in the table below was generated by running evp_fetch twenty times per thread.

|---------+----------+---------+---------+---------+---------+---------+---+--------+--------+--------+--------+--------|
|                    | Shards (u/sec)                                  |   | Improvements %
|---------+----------+---------+---------+---------+---------+---------+---+--------+--------+--------+--------+--------|
| Threads | Base     |       2 |       4 |       8 |      16 |      32 |   |      2 |      4 |      8 |     16 |     32 |
|---------+----------+---------+---------+---------+---------+---------+---+--------+--------+--------+--------+--------|
|       1 |  0.18282 | 0.18497 | 0.18306 | 0.18314 | 0.18485 | 0.18352 |   |   1.17 |   0.13 |   0.18 |   1.11 |   0.39 |
|       2 |  0.43588 | 0.35560 | 0.34131 | 0.32516 | 0.33948 | 0.35076 |   | -18.42 | -21.70 | -25.40 | -22.12 | -19.53 |
|       4 |  1.58185 | 1.06459 | 1.06258 | 0.98698 | 0.98700 | 1.06689 |   | -32.70 | -32.83 | -37.61 | -37.60 | -32.55 |
|       8 |  3.15686 | 1.75061 | 1.67458 | 1.50241 | 1.62453 | 1.74750 |   | -44.55 | -46.95 | -52.41 | -48.54 | -44.64 |
|      16 |  5.53647 | 2.83137 | 2.58007 | 2.65972 | 2.64882 | 2.82755 |   | -48.86 | -53.40 | -51.96 | -52.16 | -48.93 |
|      32 | 10.72727 | 4.97483 | 4.43692 | 4.52524 | 4.68358 | 4.84840 |   | -53.62 | -58.64 | -57.82 | -56.34 | -54.80 |
|      64 | 21.12103 | 9.43241 | 7.79981 | 7.91148 | 8.33305 | 8.34230 |   | -55.34 | -63.07 | -62.54 | -60.55 | -60.50 |

Perf tests were running on the system:
  Architecture: x86_64
  CPU op-mode(s): 32-bit, 64-bit
  Address sizes: 46 bits physical, 48 bits virtual
  Byte Order: Little Endian
  CPU(s): 96
  On-line CPU(s) list: 0-95
  Vendor ID: GenuineIntel
  Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
  CPU family: 6
  Model: 85
  Thread(s) per core: 2
  Core(s) per socket: 24
  Socket(s): 2

The most performant option is a configuration with 512 cache entries with
4 shards. There are two new defines NUM_SHARDS, and CACHE_SIZE which
can be tweaked at will.

Signed-off-by: Nikola Pajkovsky <nikolap@openssl.org>

Reviewed-by: Saša Nedvědický <sashan@openssl.org>
Reviewed-by: Norbert Pocs <norbertp@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/29205)
2025-12-17 12:29:17 +01:00
..
2025-12-09 00:28:19 -07:00
2025-12-09 00:28:19 -07:00
2025-12-09 00:28:19 -07:00
2025-12-09 00:28:19 -07:00
2025-12-09 00:28:19 -07:00
2025-12-09 00:28:19 -07:00

Selecting algorithm implementations by properties

Properties are associated with algorithms and are used to select between different implementations dynamically.

This implementation is based on a number of assumptions:

  • Property definition is uncommon. I.e. providers will be loaded and unloaded relatively infrequently, if at all.

  • The number of distinct property names will be small.

  • Providers will often give the same implementation properties to most or all of their implemented algorithms. E.g. the FIPS property would be set across an entire provider. Likewise for, hardware, accelerated, software, HSM and, perhaps, constant_time.

  • There are a lot of algorithm implementations, therefore property definitions should be space efficient. However...

  • ... property queries are very common. These must be fast.

  • Property queries come from a small set and are reused many times typically. I.e. an application tends to use the same set of queries over and over, rather than spanning a wide variety of queries.

  • Property queries can never add new property definitions.

Some consequences of these assumptions are:

  • That definition is uncommon and queries are very common, we can treat the property definitions as almost immutable. Specifically, a query can never change the state of the definitions.

  • That definition is uncommon and needs to be space efficient, it will be feasible to use a hash table to contain the names (and possibly also values) of all properties and to reference these instead of duplicating strings. Moreover, such a data structure need not be garbage collected. By converting strings to integers using a structure such as this, string comparison degenerates to integer comparison. Additionally, lists of properties can be sorted by the string index which makes comparisons linear time rather than quadratic time - the O(n log n) sort cost being amortised.

  • A cache for property definitions is also viable, if only implementation properties are used and not algorithm properties, or at least these are maintained separately. This cache would be a hash table, indexed by the property definition string, and algorithms with the same properties would share their definition structure. Again, reducing space use.

  • A query cache is desirable. This would be a hash table keyed by the algorithm identifier and the entire query string and it would map to the chosen algorithm. When a provider is loaded or unloaded, this cache must be invalidated. The cache will also be invalidated when the global properties are changed as doing so removes the need to index on both the global and requested property strings.

The implementation:

  • property_lock.c contains some wrapper functions to handle the global lock more easily. The global lock is held for short periods of time with per algorithm locking being used for longer intervals.

  • property_string.c contains the string cache which converts property names and values to small integer indices. Names and values are stored in separate hash tables. The two Boolean values, the strings "yes" and "no", are populated as the first two members of the value table. All property names reserved by OpenSSL are also populated here. No functions are provided to convert from an index back to the original string (this can be done by maintaining parallel stacks of strings if required).

  • property_parse.c contains the property definition and query parsers. These convert ASCII strings into lists of properties. The resulting lists are sorted by the name index. Some additional utility functions for dealing with property lists are also included: comparison of a query against a definition and merging two queries into a single larger query.

  • property.c contains the main APIs for defining and using properties. Algorithms are discovered from their NID and a query string. The results are cached.

    The caching of query results has to be efficient but it must also be robust against a denial of service attack. The cache cannot be permitted to grow without bounds and must garbage collect under-used entries. The garbage collection does not have to be exact.

  • defn_cache.c contains a cache that maps property definition strings to parsed properties. It is used by property.c to improve performance when the same definition appears multiple times.