Commit Graph

63 Commits

Author SHA1 Message Date
Ryan Petrello
baad765179 refactor some callback receiver code
the bigint migration removed the foreign key constraints for:

- host_id
- job_id (and projectupdate_id, etc...)

because of this, we don't really need to check explicitly for a host_id
IntegrityError anymore (because it won't occur)

additionally, while it's possible to insert an event with a mismatched
job_id now (for example, you can totally start a long-running job, and
delete the job record in the background using the ORM or psql), doing
so results in DoesNotExist errors in the code that handles the
playbook_on_stats events
2020-09-25 13:12:42 -04:00
Ryan Petrello
cd0b9de7b9 remove multiprocessing.Queue usage from the callback receiver
instead, just have each worker connect directly to redis
this has a few benefits:

- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
  observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
2020-09-24 13:53:58 -04:00
Ryan Petrello
57f8e48894 make --status more robust for dispatcher, and add support for receiver
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
2020-09-17 15:33:37 -04:00
Ryan Petrello
0df6409244 remove task state tracking from the callback receiver
we don't have support for displaying these stats anyways, so there's
no point in using resources tracking them, especially for high-volume
installs
2020-09-16 13:40:42 -04:00
Ryan Petrello
a0e5e74cab fix a typo in an f-string 2020-07-31 12:48:45 -04:00
Rebeccah
118e1b8df1 removing memchache mentions in comments
remove memcached folder as it is no longer needed, also address a couple grammatical errors
2020-06-18 15:52:59 -04:00
Jeff Bradberry
ced8f42835 Force worker processes to have a different signal handler from the parent
Situations have come up where the 5+ minute kill signal for
run_task_manager is emitted to the worker process running it, but
since the worker improperly inherited the AWXConsumerBase().stop()
handler a deadlock ultimately was triggered on the database
connection.
2020-06-04 15:41:28 -04:00
Ryan Petrello
b4b261b918 fix busted flake8 2020-05-01 13:51:37 -04:00
chris meyers
a8f52c1639 actually do exponential calc rather than *2
* Log the time til reconnect attemp to log message rather than attempt
number
2020-04-28 15:24:08 -04:00
chris meyers
2ecd055d1e sleep backoff on cb receiver reconnect
* Sleep before trying to reconnect
Most common reason for entering this reconnect loop is when Redis
service stops before the callback receiver when stopping tower services.
2020-04-28 12:47:40 -04:00
Christian Adams
a899a147e1 Fix new flake8 from pyflakes 2.2.0 release 2020-04-20 09:50:50 -04:00
Ryan Petrello
80147acc1c work around redis connection failures in the callback receiver
if redis stops/starts, sometimes the callback receiver doesn't recover
without a restart; this fixes that
2020-04-09 15:38:03 -04:00
Ryan Petrello
c8044b4755 migrate event table primary keys from integer to bigint
see: https://github.com/ansible/awx/issues/6010
2020-03-26 15:54:38 -04:00
softwarefactory-project-zuul[bot]
0fb800f5d0 Merge pull request #6344 from chrismeyersfsu/redis-cleanup1
Redis cleanup1

Reviewed-by: https://github.com/apps/softwarefactory-project-zuul
2020-03-20 13:07:40 +00:00
Ryan Petrello
d40a5dec8f change when we send job notifications to avoid a race condition
success/failure notifications for *playbooks* include summary data about
the hosts in based on the contents of the playbook_on_stats event

the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for *playbooks* we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event
2020-03-19 10:01:52 -04:00
chris meyers
5e481341bc flake8 2020-03-19 10:01:20 -04:00
chris meyers
c7de3b0528 fix spelling 2020-03-19 10:01:20 -04:00
chris meyers
7f2e1d46bc replace janky unique channel name w/ uuid
* postgres notify/listen channel names have size limitations as well as
character limitations. Respect those limitations while at the same time
generate a unique channel name.
2020-03-19 08:59:15 -04:00
chris meyers
12158bdcba remove dead code 2020-03-19 08:57:05 -04:00
Egor Margineanu
f858eda6b1 Made OPTIONS optional 2020-03-19 13:43:06 +01:00
Egor Margineanu
3a208a0be2 Added support for PG port and options. related #6340 2020-03-19 13:29:06 +01:00
chris meyers
093d204d19 fix flake8 2020-03-18 16:10:19 -04:00
chris meyers
be58906aed remove kombu 2020-03-18 16:10:17 -04:00
chris meyers
dc6c353ecd remove support for multi-reader dispatch queue
* Under the new postgres backed notify/listen message queue, this never
actually worked. Without using the database to store state, we can not
provide a at-most-once delivery mechanism w/ multi-readers.
* With this change, work is done ONLY on the node that requested for the
work to be done. Under rabbitmq, the node that was first to get the
message off the queue would do the work; presumably the least busy node.
2020-03-18 16:10:16 -04:00
chris meyers
2a2c34f567 combine all the broker replacement pieces
* local redis for event processing
* postgres for message broker
* redis for websockets
2020-03-18 16:10:15 -04:00
chris meyers
558e92806b POC postgres broker 2020-03-18 16:10:15 -04:00
chris meyers
355fb125cb redis events 2020-03-18 16:10:15 -04:00
chris meyers
c8eeacacca POC channels 2 2020-03-18 16:10:12 -04:00
Ryan Petrello
5364e78397 switch the periodic scheduler to a child process (instead of a thread)
I have a hunch that our usage of a daemon thread is causing import lock
contention related to https://github.com/ansible/awx/issues/5617
We've encountered similar issues before with threads across dispatcher
processes at fork time, and cpython has had bugs like this in recent
history:

https://bugs.python.org/issue38884

My gut tells me this might be related.

The prior implementation - based on celerybeat - ran its code in
a process (not a thread), and the timing of that merge matches the
period of time we started noticing issues.

Currently testing it to see if it resolves some of the issues we're
seeing.
2020-02-27 12:15:15 -05:00
Ryan Petrello
8b1806d4ca add code for detecting (and killing) a hung task manager task 2020-02-26 07:53:04 -05:00
AlanCoding
e59cb07064 Add wording for control message log 2020-02-11 10:01:25 -05:00
Ryan Petrello
38a08d163c get rid of celery/celerybeat
alternative to https://github.com/ansible/awx/pull/2530 which makes use
of https://pypi.org/project/schedule/

this doesn't have support for any persistence (like how celery beat uses
a shelve file), because all of our periodic jobs run at most every few
minutes
2020-02-10 17:32:02 -05:00
Ryan Petrello
3c31e0ed16 some more minor callback cleanup and development tweaks 2020-01-27 17:18:09 -05:00
Ryan Petrello
78b00652bd add the ability to enable profiling for the callback receiver workers 2020-01-27 12:03:53 -05:00
Ryan Petrello
8f33f1a6c2 remove another expensive logging lookup in the parent callback process 2020-01-24 16:46:32 -05:00
Bill Nottingham
4e46d5d7cd Fix some lint 2020-01-20 17:15:27 -05:00
Ryan Petrello
8bd9233d2c remove some unnecessary callback receiver debugging code 2020-01-14 14:21:53 -05:00
Ryan Petrello
306f504fb7 optimize the callback receiver to buffer writes on high throughput
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: https://github.com/ansible/awx/issues/5514
2020-01-14 12:04:26 -05:00
AlanCoding
eec08fdcca Log case of duplicate UUIDs 2020-01-09 07:31:32 -05:00
Ryan Petrello
83550eeba0 make the callback receiver more robust to duplicate UUIDs from ansible 2019-11-01 09:24:52 -04:00
Ryan Petrello
3094b67664 work around a bug in the k8s client that leaves trash in /tmp 2019-10-29 11:24:17 -04:00
Ryan Petrello
d01088d33e Revert "add support for awx-manage run_callback_receiver --status" 2019-10-18 09:49:02 -04:00
Ryan Petrello
ffb1707e74 add support for awx-manage run_callback_receiver --status 2019-10-17 11:10:27 -04:00
Buymov Ivan
f2676064fd Fix error with rejoining node to cluster after lost connection to postgres 2019-09-27 01:17:27 -04:00
Ryan Petrello
40b1e89b67 add the ability to disable RabbitMQ queue durability 2019-05-28 15:49:32 -04:00
Ryan Petrello
17a803f49c remove the old callback plugin import paths and callback-specific tests 2019-04-12 16:11:23 -04:00
Ryan Petrello
32ee9838af use the correct logger for the callback receiver
the callback receiver and dispatcher share several modules, so add logic
to use the correct logger
2019-03-15 08:09:47 -04:00
Ryan Petrello
daeeaf413a clean up unnecessary usage of the six library (awx only supports py3) 2019-01-25 00:19:48 -05:00
Ryan Petrello
4707dc2a05 clean up some unnecessary dispatcher reaping code 2019-01-24 11:11:05 -05:00
Ryan Petrello
b2442d42a3 detect dead DB connections in the dispatcher when reaping jobs 2019-01-22 08:40:26 -05:00