the bigint migration removed the foreign key constraints for:
- host_id
- job_id (and projectupdate_id, etc...)
because of this, we don't really need to check explicitly for a host_id
IntegrityError anymore (because it won't occur)
additionally, while it's possible to insert an event with a mismatched
job_id now (for example, you can totally start a long-running job, and
delete the job record in the background using the ORM or psql), doing
so results in DoesNotExist errors in the code that handles the
playbook_on_stats events
instead, just have each worker connect directly to redis
this has a few benefits:
- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
Situations have come up where the 5+ minute kill signal for
run_task_manager is emitted to the worker process running it, but
since the worker improperly inherited the AWXConsumerBase().stop()
handler a deadlock ultimately was triggered on the database
connection.
* Sleep before trying to reconnect
Most common reason for entering this reconnect loop is when Redis
service stops before the callback receiver when stopping tower services.
success/failure notifications for *playbooks* include summary data about
the hosts in based on the contents of the playbook_on_stats event
the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for *playbooks* we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event
* postgres notify/listen channel names have size limitations as well as
character limitations. Respect those limitations while at the same time
generate a unique channel name.
* Under the new postgres backed notify/listen message queue, this never
actually worked. Without using the database to store state, we can not
provide a at-most-once delivery mechanism w/ multi-readers.
* With this change, work is done ONLY on the node that requested for the
work to be done. Under rabbitmq, the node that was first to get the
message off the queue would do the work; presumably the least busy node.
I have a hunch that our usage of a daemon thread is causing import lock
contention related to https://github.com/ansible/awx/issues/5617
We've encountered similar issues before with threads across dispatcher
processes at fork time, and cpython has had bugs like this in recent
history:
https://bugs.python.org/issue38884
My gut tells me this might be related.
The prior implementation - based on celerybeat - ran its code in
a process (not a thread), and the timing of that merge matches the
period of time we started noticing issues.
Currently testing it to see if it resolves some of the issues we're
seeing.
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow
this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`
see: https://github.com/ansible/awx/issues/5514