# So you want to build a job processing system?

> Sharing lessons learned during a migration from Faktory Enterprise

**Published by:** [Shane da Silva](https://paragraph.com/@sds.eth/)
**Published on:** 2025-01-08
**URL:** https://paragraph.com/@sds.eth/building-a-job-processing-system

## Content

We're building Farcaster, a decentralized social network. This post discusses the job system we now use to power Warpcast, the most popular client in the Farcaster ecosystem.We migrated from Faktory Enterprise to our own background job processing system in August 2024. Our two primary motivations were:Faktory was beginning to exhibit instability resulting in brief periods of downtime at the loads we were subjecting it to.As a single point of failure, any issue would have a user-visible impact (delayed feeds, etc.), and Faktory cannot (yet) run in a highly available configuration.Our Faktory dashboard right before decommission. Over 9 billion jobs processed!Faktory had served us incredibly well since our adoption of it in January 2022. We're grateful to Mike Perham who has been pioneering this space since the launch of Sidekiq in 2012. We highly recommend you use a tool like Faktory before building your own job system.With that said, suppose you've come to the conclusion that you need to build your own job processing solution. The rest of this post discusses factors we considered and lessons learned along the way.Alternatives ConsideredBefore making this decision, we surveyed the background job system landscape circa H1 2024. Available options fell into roughly three categories:SaaS providers: AWS SQS, Google Cloud Tasks, Cloudflare Worker Queues, etc. Rejected because we wanted to minimize vendor lock-in.Ready-to-go open source frameworks: BullMQ, Celery, etc. Almost went with BullMQ, but ultimately rejected because it requires different queue names per shard rather than transparently managing shards on your behalf. Maintaining your own sharding system doesn't seem like the right level of abstraction to provide for a job processing framework.Build your own framework on top of open source tools such as Redis, RabbitMQ, Kafka, etc.We opted to build our own solution with Node.js on top of Redis because we felt it would give us the most control going forward and provide the greatest flexibility, and it would integrate tightly with our existing TypeScript codebase. We also were able to use very mature libraries like ioredis for interacting with Redis via Node.js.FeaturesAt a high level, our new system supports all the Faktory features that we made heavy use of:Delayed jobs (schedule a job to start at a specific time)Periodic jobs defined via cron syntaxPer-job retry configuration logicDead letter queue Queue throttling / rate-limitingJob uniqueness enforcementJob expirationJob execution timeoutsWe also added support for a few features specific to our use case:Replayable/skippable queues: fast forward or rewind a queue to a specific point in time and re-process or skip jobs as neededPer queue dead letter queues: Faktory had a global DLQ, which could sometimes make it difficult to sift through and clean up dead jobs if you didn't want to clear the whole DLQ.Detailed metrics not available in Faktory Enterprise (e.g. segfaults, OOMKills, etc.)All of these features were implemented using a combination of built-in Redis data structures including:Data structureFeaturesSorted setsDelayed jobs, job uniqueness enforcement, dead letter queues, distributed semaphoreStreamsReplayable/skippable queuesHashesJob metadataThese features ultimately powered a job system where each job had the following possible state transitions:Possible states for any given jobThe difference between retries exhausted and dead is that dead jobs are saved for potential debugging and re-enqueuing (resurrection). The vast majority of exhausted/dead jobs don’t need to be preserved, so the default is to simply delete the job when its retries have exhausted. This is a key difference between Faktory and the new system—Faktory always kept a dead job unless the number of retries was set explicitly to 0.Worth it?Our current job system is fundamentally different from Faktory in that it allows for horizontal scaling and is highly available in cases of single node failures. While you could run multiple Faktory servers and partition your job queues among them, this is painful to manage and you still suffer from any one of those Faktory servers being a SPOF for their respective queues. Our current system handles all of this for us transparently, and we've resized our cluster multiple times as the size of our workload has increased.A subset of metrics we track in the new job systemWhere we previously would have to limit the number of concurrent jobs we processed with the system, now we can continue to scale out our cluster to support larger workloads by simply increasing the number of shards in our Redis Cluster: a simple button click in the AWS console. This was not possible with Faktory since it didn't support clustered Redis.In short—for our specific use cases and what we needed from our job framework—the effort was worth it. That being said, we learned plenty during the course of implementation.Lessons Learnedredis-semaphore requires independent/standalone Redis servers, not a Redis Cluster One of the most important tools we needed was a distributed semaphore to limit concurrent access for features like queue throttling. redis-semaphore is an excellent library that has a variety of working implementations, but we spent a long time debugging a strange issue only to realize that the Redlock algorithm it implements requires independent nodes, which is captured in this comment. Once we switched from running 3 Redis nodes in clustered mode to 3 standalone Redis servers for our distributed semaphore, our mysterious race conditions were resolved.Pipelining commands in Redis is a necessity if you want to increase throughput Our earlier implementations did not use pipelining so that we could get a working implementation quickly. However, even with sub-millisecond latency between our workers and the Redis cluster, the round-trip time executing each command serially rather than running commands in a pipeline (batch) quickly becomes noticeable at high volume. Fortunately the ioredis library has an ergonomic pattern for expressing pipelines that we made heavy use of.Weighted round robin queueing is a simple way to avoid queue starvation Faktory's queueing system made it possible for you to express queue weights such that one queue would starve other queues. For example, you could have two queues, priority and default, and if you have one worker it will always pull from the priority queue before the default queue. With weighted round robin queuing, each queue is assigned a weight such that you'll never fully starve any other queue—you'll just end up pulling jobs from queues with higher weights more often than lower weight queues, but you are guaranteed to eventually process jobs from lower weight queues. Redis Streams are incredibly powerful, but require significant up-front thought Redis gives you a high degree of control over how to process a stream of events. With that said, you need to think about what kind of delivery semantics you want to implement (at least once, at most once, etc.) so that you aren't surprised during exceptional scenarios (such as a segfault or OOMKill) which can lead to lost/stuck jobs. Account for failure at every step, and you won't be surprised.

## Publication Information

- [Shane da Silva](https://paragraph.com/@sds.eth/): Publication homepage
- [All Posts](https://paragraph.com/@sds.eth/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@sds.eth): Subscribe to updates