Pooling#2345
Conversation
Added connection pool options for SQL Alchemy engine.
Change pool_recycle to -1 to preserve current behavior.
Added SQLAlchemy connection-pool tuning options to configuration.
test_sql_pool_options.py exercises `store_db_parameters()` directly, requires no database, and runs in standard CI. It asserts the zero-behaviour-change defaults, override + typing, no DBAPI leakage, the existing dict-filtering, hashable/deterministic cache keys, and coexistence with search_path.
|
Is there a reason to not pop the attributes from the connect_args inside of get_engine? This would consolidate a bit of the complications noted in the PR between hashing and the manager using get_engine. Maybe I am missing something |
ricardogsilva
left a comment
There was a problem hiding this comment.
Just leaving my two cents here - I'm not a core committer so take these with a grain of salt.
Overall I agree with the PR, as adding these connection-related options seems relevant - thanks for your work and I look forward to having it merged!
Personally, I would simplify the implementation a bit, by relying on pygeoapi's JSON Schema document for the validation of the config.
And I would not include most of these tests, which I see as not being relevant.
| # Defaults keep SQLAlchemy's QueuePool sizing but, unlike SQLAlchemy's | ||
| # default of -1, recycle connections after an hour so that pooled | ||
| # connections cannot sit IDLE on the server indefinitely. |
There was a problem hiding this comment.
This part of the comment seems to be outdated, as you end up setting the default value of pool_recycle to -1
| # SQLAlchemy connection-pool tuning (optional). Defaults match | ||
| # SQLAlchemy's QueuePool and preserve previous behaviour. | ||
| # Persistent connections held open per worker process. | ||
| pool_size: 5 | ||
| # Extra short-lived connections allowed above pool_size. | ||
| max_overflow: 10 | ||
| # Recreate connections older than this many seconds. -1 (the | ||
| # default) never recycles; set a finite value (e.g. 300) so | ||
| # pooled connections cannot sit IDLE on the server indefinitely. | ||
| pool_recycle: -1 | ||
| # Seconds to wait for a connection from the pool before erroring. | ||
| pool_timeout: 30 | ||
| # Test connections with a lightweight ping before use. | ||
| pool_pre_ping: true |
There was a problem hiding this comment.
All of these new parameters need to be added to the config schema at
pygeoapi/resources/schemas/config/pygeoapi-config-0.x.yml
This will make it possible to test a pygeoapi configuration for correctness even before starting up the server.
| (key, type(default)(options.pop(key, default))) | ||
| for key, default in pool_defaults.items() | ||
| )) | ||
|
|
There was a problem hiding this comment.
In my opinion this could be made easier to read and less complex by:
- Storing
self.db_pool_optionsas adictinstead of atuple, and defer tuple creation to whenget_engineis called; - Relying on the types of passed
optionsalready being correct. Adding these new parameters to the config JSON Schema (as I mentioned in my other comment) would mean that the type of each parameter would already be documented and would be enforceable by doing a validation of the config.
Also, note that your implementation contains a subtle bug when trying to parse pool_pre_ping. If the original value had been:
{'pool_pre_ping': 'False'} # I'm passing a string with the value of "False"then the outcome would be:
# type(True)("False")
TrueIn other words, bool("False") is actually True because non-empty strings are truthy.
-pool_defaults = {
- 'pool_size': 5,
- 'max_overflow': 10,
- 'pool_recycle': -1, # SQLAlchemy default; preserves current behaviour
- 'pool_timeout': 30,
- 'pool_pre_ping': True,
-}
-self.db_pool_options = tuple(sorted(
- (key, type(default)(options.pop(key, default)))
- for key, default in pool_defaults.items()
-))
+self.pool_defaults = {
+ 'pool_size': options.pop('pool_size', 5),
+ 'max_overflow': options.pop('max_overflow', 10),
+ 'pool_recycle': options.pop('pool_recycle', -1), # SQLALchemy default - never release connections
+ 'pool_timeout': options.pop('pool_timeout', 30),
+ 'pool_pre_ping': options.pop('pool_pre_ping', True),
+}| self.db_user, | ||
| self._db_password, | ||
| self.db_conn, | ||
| self.db_pool_options, |
There was a problem hiding this comment.
as per my other comment, in my opinion it would be clearer if the tuple would be generated here, perhaps also accompanied with a comment mentioning that this is made as a way to enable making use of functools.cache.
Also, in modern Python, a dict's insertion ordering is preserved, so I don't think sorting the tuple would be needed.
| self.db_pool_options, | |
| tuple(self.db_pool_options.items()), # convert to hashable type, for using with functools.cache |
| def test_pool_options_defaults_preserve_current_behaviour(): | ||
| obj = _Dummy() | ||
| store_db_parameters(obj, dict(CONN), {}) | ||
| pool = dict(obj.db_pool_options) | ||
| # Defaults must match pre-existing effective behaviour: | ||
| # pool_pre_ping was hardcoded True; pool_recycle was unset (-1). | ||
| assert pool['pool_size'] == 5 | ||
| assert pool['max_overflow'] == 10 | ||
| assert pool['pool_timeout'] == 30 | ||
| assert pool['pool_pre_ping'] is True | ||
| assert pool['pool_recycle'] == -1 |
There was a problem hiding this comment.
This test seems unnecessary to me - when this PR gets merged, the behavior it implements will become the current behavior, so the test looses its relevancy.
| def test_pool_options_are_overridable_and_typed(): | ||
| obj = _Dummy() | ||
| store_db_parameters( | ||
| obj, dict(CONN), | ||
| {'pool_size': 2, 'max_overflow': 3, 'pool_recycle': 300}, | ||
| ) | ||
| pool = dict(obj.db_pool_options) | ||
| assert pool['pool_size'] == 2 and isinstance(pool['pool_size'], int) | ||
| assert pool['max_overflow'] == 3 | ||
| assert pool['pool_recycle'] == 300 | ||
| # untouched keys keep defaults | ||
| assert pool['pool_timeout'] == 30 | ||
| assert pool['pool_pre_ping'] is True |
There was a problem hiding this comment.
This test would be unnecessary if you'd go with my suggestion above, of storing db_pool_options as a dict instead of a tuple and you'd rely on the configuration being valid after having added the JSON schema bits that are missing.
| def test_dict_valued_options_still_filtered(): | ||
| obj = _Dummy() | ||
| store_db_parameters( | ||
| obj, dict(CONN), | ||
| {'pool_size': 2, 'zoom': {'min': 0, 'max': 22}}, | ||
| ) | ||
| assert 'zoom' not in obj.db_options | ||
| assert dict(obj.db_pool_options)['pool_size'] == 2 | ||
|
|
There was a problem hiding this comment.
This test seems to be unnecessary, as it is not testing the changes you made in this PR. It verifies that the contents of obj.db_options are correct.
IMO this PR does not make any changes that would warrant this verification, unless you would not trust the behavior of dict.pop, which is a Python builtin.
| def test_pool_options_hashable_and_deterministic(): | ||
| a, b = _Dummy(), _Dummy() | ||
| store_db_parameters(a, dict(CONN), {'pool_size': 2}) | ||
| store_db_parameters(b, dict(CONN), {'pool_size': 2}) | ||
| # identical config -> identical key -> shared engine via functools.cache | ||
| assert a.db_pool_options == b.db_pool_options | ||
| assert hash(a.db_pool_options) == hash(b.db_pool_options) | ||
|
|
||
| c = _Dummy() | ||
| store_db_parameters(c, dict(CONN), {'pool_size': 9}) | ||
| # differing pool config -> distinct key (separate engine, by design) | ||
| assert c.db_pool_options != a.db_pool_options | ||
|
|
There was a problem hiding this comment.
This is testing Python's own implementation of how tuples are hashed, so I don't think it is relevant to include in pygeoapi.
| def test_pool_options_coexist_with_search_path(): | ||
| obj = _Dummy() | ||
| store_db_parameters( | ||
| obj, dict(CONN), | ||
| {'search_path': ['published', 'public'], 'pool_size': 4}, | ||
| ) | ||
| assert obj.db_search_path == ('published', 'public') | ||
| assert dict(obj.db_pool_options)['pool_size'] == 4 | ||
|
|
There was a problem hiding this comment.
This test seems unnecessary, as it does not test the functionality introduced in this PR
Overview
Makes the SQLAlchemy connection pool of the SQL provider configurable per provider via the existing
options:block, exposingpool_size,max_overflow,pool_recycle,pool_timeoutandpool_pre_ping.Previously
get_engine()calledcreate_engine(conn_str, connect_args=connect_args, pool_pre_ping=True)with no pool sizing or recycle, so the defaultQueuePoolheldpool_sizeconnections open for the life of each worker process and never recycled them. In multi-process deployments this produces a large number of permanently-IDLE server-side connections (we saw connections idle for days, eventually exhaustingmax_connections). There was no way to bound or recycle the pool from configuration.Changes:
store_db_parameters()now extracts the five pool keys fromoptions, coerces them to their declared types, and stores them as a sorted, hashabletuple(self.db_pool_options). They are popped out ofoptionsso they are not forwarded to the DBAPI asconnect_args.get_engine()takes apool_optionstuple parameter and applies**dict(pool_options)tocreate_engine(). It stays@functools.cache-able because the parameter is a hashable tuple, so engine sharing per process is preserved; providers with differing pool config correctly get distinct engines.pygeoapi/process/manager/postgresql.pyalso callsget_engine(); its call site is updated to passself.db_pool_optionsso the manager does not losepool_pre_pingor skip recycling.Backward compatibility: defaults preserve current behaviour exactly —
pool_size=5,max_overflow=10,pool_pre_ping=True, andpool_recycle=-1(SQLAlchemy's default, i.e. the current effective behaviour).This PR is therefore a pure, opt-in feature add with no behaviour change for existing users. (See the issue for discussion of whether a finite default
pool_recycleshould be adopted as a separate follow-up.)New tests and documentation are included.
Related Issue / discussion
Closes #2344.
Additional information
Example configuration:
Note (documented): because
get_engine()is@functools.cache-d on its full argument set, providers that share a database must use identical pool options to continue sharing a single engine per worker; differing options intentionally yield separate engines.Dependency policy (RFC2)
No new dependencies are introduced; only the standard library and the already-required SQLAlchemy are used.
Updates to public demo
local.config.ymldoes not need to change.Contributions and licensing