# So you want to build a job processing system?

*Sharing lessons learned during a migration from Faktory Enterprise*

By [Shane da Silva](https://paragraph.com/@sds.eth) · 2025-01-08

---

![](https://paragraph.xyz/editor/callout/information-icon.png)

We're building [Farcaster](https://www.farcaster.xyz/), a decentralized social network. This post discusses the job system we now use to power [Warpcast](https://warpcast.com/), the most popular client in the Farcaster ecosystem.

We migrated from [Faktory Enterprise](https://contribsys.com/faktory/) to our own background job processing system in August 2024. Our two primary motivations were:

*   Faktory was beginning to exhibit instability resulting in brief periods of downtime at the loads we were subjecting it to.
    
*   As a [single point of failure](https://en.wikipedia.org/wiki/Single_point_of_failure), any issue would have a user-visible impact (delayed feeds, etc.), and Faktory [cannot (yet) run in a highly available configuration](https://github.com/contribsys/faktory/issues/485#issuecomment-2302328570).
    

![](https://storage.googleapis.com/papyrus_images/88feee625ef27f24a6143279c34e197d.png)

Our Faktory dashboard right before decommission. Over 9 billion jobs processed!

Faktory had served us incredibly well since our adoption of it in January 2022. We're grateful to [Mike Perham](https://www.mikeperham.com/) who has been pioneering this space since the launch of [Sidekiq](https://sidekiq.org/) in 2012. We highly recommend you use a tool like Faktory before building your own job system.

With that said, suppose you've come to the conclusion that you need to build your own job processing solution. The rest of this post discusses factors we considered and lessons learned along the way.

### Alternatives Considered

Before making this decision, we surveyed the background job system landscape circa H1 2024. Available options fell into roughly three categories:

*   **SaaS providers**: AWS SQS, Google Cloud Tasks, Cloudflare Worker Queues, etc.  
    _Rejected because we wanted to minimize vendor lock-in._
    
*   **Ready-to-go open source frameworks**: BullMQ, Celery, etc.  
    _Almost went with BullMQ, but ultimately rejected because it requires_ [_different queue names per shard_](https://docs.bullmq.io/bull/patterns/redis-cluster) _rather than transparently managing shards on your behalf. Maintaining your own sharding system doesn't seem like the right level of abstraction to provide for a job processing framework._
    
*   **Build your own framework** on top of open source tools such as Redis, RabbitMQ, Kafka, etc.
    

We opted to build our own solution with Node.js on top of Redis because we felt it would give us the most control going forward and provide the greatest flexibility, and it would integrate tightly with our existing TypeScript codebase. We also were able to use very mature libraries like [ioredis](https://github.com/redis/ioredis) for interacting with Redis via Node.js.

### Features

At a high level, our new system supports all the Faktory features that we made heavy use of:

*   Delayed jobs (schedule a job to start at a specific time)
    
*   Periodic jobs defined via cron syntax
    
*   Per-job retry configuration logic
    
*   Dead letter queue
    
*   Queue throttling / rate-limiting
    
*   Job uniqueness enforcement
    
*   Job expiration
    
*   Job execution timeouts
    

We also added support for a few features specific to our use case:

*   Replayable/skippable queues: fast forward or rewind a queue to a specific point in time and re-process or skip jobs as needed
    
*   Per queue dead letter queues: Faktory had a global DLQ, which could sometimes make it difficult to sift through and clean up dead jobs if you didn't want to clear the whole DLQ.
    
*   Detailed metrics not available in Faktory Enterprise (e.g. segfaults, OOMKills, etc.)
    

All of these features were implemented using a combination of built-in Redis data structures including:

**Data structure**

**Features**

[Sorted sets](https://redis.io/docs/latest/develop/data-types/sorted-sets/)

Delayed jobs, job uniqueness enforcement, dead letter queues, distributed semaphore

[Streams](https://redis.io/docs/latest/develop/data-types/streams/)

Replayable/skippable queues

[Hashes](https://redis.io/docs/latest/develop/data-types/hashes/)

Job metadata

These features ultimately powered a job system where each job had the following possible state transitions:

![](https://storage.googleapis.com/papyrus_images/7cbb55c83d766c822b0ea6d9b918ecff.png)

Possible states for any given job

The difference between _retries exhausted_ and _dead_ is that dead jobs are saved for potential debugging and re-enqueuing (resurrection). **The vast majority of exhausted/dead jobs don’t need to be preserved**, so the default is to simply delete the job when its retries have exhausted. This is a key difference between Faktory and the new system—Faktory always kept a dead job unless the number of retries was set explicitly to 0.

### Worth it?

Our current job system is fundamentally different from Faktory in that it allows for horizontal scaling and is highly available in cases of single node failures. While you could run multiple Faktory servers and partition your job queues among them, this is painful to manage and you still suffer from any one of those Faktory servers being a SPOF for their respective queues. Our current system handles all of this for us transparently, and we've resized our cluster multiple times as the size of our workload has increased.

![](https://storage.googleapis.com/papyrus_images/e7c74e19bf32e319fd81076a41d7aaf9.png)

A subset of metrics we track in the new job system

Where we previously would have to limit the number of concurrent jobs we processed with the system, now we can continue to scale out our cluster to support larger workloads by simply increasing the number of shards in our Redis Cluster: a simple button click in the AWS console. This was not possible with Faktory since it didn't support clustered Redis.

In short—for our specific use cases and what we needed from our job framework—the effort was worth it. That being said, we learned plenty during the course of implementation.

### Lessons Learned

[**redis-semaphore**](https://github.com/swarthy/redis-semaphore/commits/master/) **requires independent/standalone Redis servers, not a Redis Cluster**  
One of the most important tools we needed was a distributed semaphore to limit concurrent access for features like queue throttling. `redis-semaphore` is an excellent library that has a variety of working implementations, but we spent a long time debugging a strange issue only to realize that the Redlock algorithm it implements requires _independent_ nodes, which is captured in [this comment](https://github.com/swarthy/redis-semaphore/commit/019b15c7e5e40c48e36a381af1dfd27150d169d9). Once we switched from running 3 Redis nodes in clustered mode to 3 standalone Redis servers for our distributed semaphore, our mysterious race conditions were resolved.

**Pipelining commands in Redis is a necessity if you want to increase throughput**  
Our earlier implementations did not use [pipelining](https://redis.io/docs/latest/develop/use/pipelining/) so that we could get a working implementation quickly. However, even with sub-millisecond latency between our workers and the Redis cluster, the round-trip time executing each command serially rather than running commands in a pipeline (batch) quickly becomes noticeable at high volume. Fortunately the `ioredis` library has an [ergonomic pattern](https://github.com/redis/ioredis?tab=readme-ov-file#pipelining) for expressing pipelines that we made heavy use of.

**Weighted round robin queueing is a simple way to avoid queue starvation**  
Faktory's queueing system made it possible for you to express queue weights such that one queue would starve other queues. For example, you could have two queues, `priority` and `default`, and if you have one worker it will always pull from the `priority` queue before the `default` queue. With weighted round robin queuing, each queue is assigned a weight such that you'll never fully starve any other queue—you'll just end up pulling jobs from queues with higher weights more often than lower weight queues, but you are guaranteed to eventually process jobs from lower weight queues.

**Redis Streams are incredibly powerful, but require significant up-front thought**  
Redis gives you a high degree of control over how to process a stream of events. With that said, you need to think about what kind of delivery semantics you want to implement (_at least once_, _at most once_, etc.) so that you aren't surprised during exceptional scenarios (such as a segfault or OOMKill) which can lead to lost/stuck jobs. Account for failure at every step, and you won't be surprised.

---

*Originally published on [Shane da Silva](https://paragraph.com/@sds.eth/building-a-job-processing-system)*
