Stalled jobs

In BullMQ, every time is picked by a worker, and during the duration of the processing of the job, a lock, which is represented by a special key in Redis™. The locks express that the worker is alive and the job is being processed normally. If the worker fails to keep this lock from expiring, then the system (other running workers basically) will detect that the job has lost its lock and will mark the job as "stalled".

So a stalled job just means that the job has been moved from its active status to either "wait" or "failed" when a lock is missing.

The most common scenario where this could happen with BullMQ Proxy is if the proxy is shut down or restarted abruptly without waiting for current processing jobs to complete. The next time the proxy starts and new workers are instantiated they will detect which jobs have been stalled and moved to wait or to failed.

Note that the time it takes to detect a stalled job depends on the stalled checker interval which is by default 30 seconds.

In order to prevent from a worker malfunction to keep stalled jobs from looping between the active status and wait status there is an option (maxStalledCount )that restricts the number of times a given job can stall, which is by default 1 (assuming that this is a very rare occurence for a given job).

If any job stalls more than "maxStalledCount" the job will instead be moved to failed with the failedReason "job stalled more than allowable limit", indicating that maybe there is something else going on that is making the workers unable to keep the locks alive during the duration of the job. This could happen if for example the jobs are long living and the CPU where the proxy is running approaches 100%, so that the process of renewing locks does not have a chance to run.

With this in mind we can expand the WorkerMetadata interface to reflect this option:

interface WorkerMetadata {
  opts?: {
    maxStalledCount?: number;
    // .. more options
 }
  // .. more options
}

Last updated