Should you write your own database?

April 09, 2026

Should you write your own database?

We run a data pipeline that ingests very large product snapshots, turns them into downstream feed artifacts, and then keeps those feeds fresh with smaller incremental updates. The pipeline has a weekly "start from a new baseline" phase and an ongoing incremental phase layered on top of it.

At a high level, the workflow looks like this:

download very large source snapshots
process them into a normalized internal representation
persist the latest known state for each product
build output partitions and feed fragments from that state
continue applying smaller incremental updates until the next full weekly restart

For a long time, the persistence layer in the middle looked like the boring part of the system. It was "just" a key-value store holding the latest state for a product, and the interesting work seemed to be in downloading, parsing, feed partitioning, and publishing.

That assumption turned out to be wrong.

As the system grew, the persistence layer became the dominant bottleneck during weekly catch-up. We reached the point where a straightforward question mattered more than anything else:

Can the storage engine absorb a huge write-heavy bootstrap, continue serving point reads, and finish the whole cycle inside 24 hours at the scale we need?

The answer, after multiple iterations, was no. That is what led us to build seqs.

This post explains:

the workload that forced the issue
why our early Postgres-based approach was not enough
why RocksDB still did not fit the shape of the problem
what operational changes we tried first
why we ultimately chose to build a custom store
what seqs is designed to do
and, equally important, what it is explicitly not designed to do

This is a detailed engineering write-up, but it is intentionally not a step-by-step implementation guide. The goal is to explain the reasoning and architecture clearly enough that an experienced systems engineer can understand the design space, without publishing an exact blueprint. Why? Because it's not very likely you have the exact same requirements we do, we designed this specifically to solve our needs, not to promote as how you should build yours... That said, if you find yourself having a similar issue, I'd love to hear about it, maybe the pattern is more common than I know.

The Actual Workload

The easiest way to reason about a storage engine is to start with the real access pattern, not the API names.

Our pipeline has a very specific shape. Once per week, we effectively restart from a new baseline.

That means:

downloading large source artifacts
processing them into product records
updating the latest known state for each product
generating large downstream outputs from that state

This phase is dominated by:

sustained bulk writes
very high sequential input throughput
very large total byte volume
backfills that may continue for many hours

Now, when i say huge I'm talking about 6+TB of raw source input data.

Once the weekly baseline is sufficiently complete, smaller incremental updates begin flowing again. Those incremental updates:

modify a subset of records
often touch the same keys repeatedly
still need point lookups into the latest state
still need to feed downstream output generation

Crucially, we do not need broad query support in this storage layer. We do not need:

range scans
secondary indexes
joins
ad hoc querying
analytical access
historical time travel

What we need is much simpler:

upsert a record by key
fetch a record by key
reset the entire market or epoch cheaply

That is a very different problem from "build a general-purpose embedded database". The weekly baseline changes the lifecycle model completely.

This is not an unbounded forever-growing data store where everything must be retained and compacted indefinitely. The system has a natural epoch boundary:

an old epoch becomes irrelevant
the new epoch becomes authoritative
large parts of the old state can be dropped aggressively

That single fact matters enormously. Many general-purpose database tradeoffs assume long-lived continuity. Our system does not always need that.

The First Attempt: Postgres

The first serious version of the control plane was built around Postgres. That was the right choice for coordination. Postgres is excellent for:

job state
leases
checkpoints
retries
auditability
run tracking
operational introspection

In other words, Postgres was and remains a very good control-plane database. What it was not good at, for this workload, was serving as the hot-path per-product state store during the weekly bootstrap.

The problem was not that Postgres is "slow". The problem was mismatch. We were asking a relational database to behave like a high-throughput, point-oriented, write-heavy snapshot store in the middle of a pipeline that:

rewrites the latest value for keys frequently
has enormous ingest bursts
needs low-overhead point lookups
and has destructive weekly reset semantics

Postgres can absolutely store key-value-like data, but when the hot path becomes "rewrite huge volumes of latest state as fast as the disks allow", it stops being the right abstraction. We needed Postgres for orchestration. We did not want Postgres to become the write amplification engine for the entire feed pipeline. So we separated concerns:

Postgres for coordination and durability of workflow state
a dedicated product snapshot store for the hot path

That led to the next iteration.

The Second Attempt: RocksDB

RocksDB was the obvious next candidate. On paper, it matched a lot of what we needed:

embedded
good write throughput
good point lookups
battle-tested
operationally familiar to systems engineers

And compared to pushing all hot-path state through Postgres, it was absolutely an improvement. We got:

lower latency point reads
much better locality for state access
more control over the write path
a storage engine designed for high-throughput key-value workloads

It was a rational choice. It still wasn't the right one.

RocksDB is a general-purpose LSM engine. That brings real strengths, but it also brings a cost model. That cost model did not align with our actual workload. We did not need ordered iteration over keys. We did not need broad scan performance. We did not need multi-version semantics inside the store.

We needed:

latest value by key
high sustained write throughput
point reads
periodic destructive reset

An LSM tree still maintains sorted structures because that is what it is built to do. We were paying for those properties even when they gave us little value. This was the more painful issue. At the application layer, we were already doing expensive updates:

loading the current record
merging changes
rewriting the entire logical payload

Then the storage layer added its own amplification:

WAL writes
memtable flushes
SST creation
compaction
rewrite of data multiple times across levels

That combination was tolerable at smaller scale. It became deeply painful at larger scale. LSM systems are often at their best when they can smooth a write-heavy workload over time and compact continuously in the background. Our weekly bootstrap did the opposite.

It delivered:

huge bursts of sustained writes
immediate pressure on downstream processing
continued point reads while write load was high
a need to drain the entire backlog inside a deadline window

In that environment, the hidden costs of compaction stop being background details and start becoming the dominant problem. The failure mode was not theoretical. We saw it in production runs:

growing compaction debt
widening write queues
increasing backpressure
higher read latencies during heavy load
WAL growth becoming operationally important
storage pressure shifting from one layer to another as we tuned around bottlenecks

At that point, we stopped thinking about RocksDB as a tuning problem and started treating it as a workload fit problem.

We Did Try To Engineer Around It

Before deciding to build something custom, we tried to make the system more survivable operationally. One of the biggest changes was physical disk separation. When a pipeline mixes:

large sequential downloads
a hot embedded database
write-ahead logging
lots of small churn-heavy files
generated outputs waiting for upload

the storage subsystem quickly becomes a noisy-neighbour problem. So we spread different classes of I/O across different devices:

a large volume for spool and input artifacts
a separate large volume for the snapshot database itself
a hot-churn volume for small create/delete-heavy paths
local disk for the write-ahead log

This helped. It definitely helped. It reduced interference between:

long sequential reads and writes
compaction-related random I/O
small-file churn
and WAL writes

It also made the system easier to reason about operationally. When one class of storage pressure rose, we could see where it was happening. But it did not solve the root problem. All we had really done was give an LSM engine a cleaner battlefield. We had not changed the fact that we were still asking it to do the wrong job. We also spent time tuning:

write batch sizes
flush intervals
compaction-related knobs
worker parallelism
backlog control
queue capacities

Again, those changes mattered. They made the system less fragile and increased the ceiling somewhat. But they still left us fighting the same structural issue:

the storage engine was doing too much work relative to what our application semantics required. Once we stopped thinking in terms of "which database should we use?" and started thinking in terms of "what does this layer actually have to do?", the answer became much clearer. The product snapshot layer did not need to be a database in the broad sense. It needed to be a fast, durable-enough, point-oriented state cache with epoch reset semantics. That is a much smaller thing. The real question became:

If our writes are naturally sequential, and our reads are point lookups, why are we forcing the storage engine to constantly reshape data into sorted structures and compact it?

Once framed that way, the architecture for seqs became obvious.

What `seqs` Is

seqs stands for Sequential Store.

The name is intentionally plain because the idea is plain:

write large amounts of data sequentially
keep a small index for point reads
avoid doing any work we do not absolutely need

seqs is not meant to be a general-purpose embedded database.

It is a deliberately narrow storage component built around our exact hot path. At a conceptual level, it has three parts:

an append-oriented value log
a compact key-to-location index
a small amount of checkpointing metadata for restart and recovery

That is the essence of it. The pipeline already receives data in a way that wants to be written sequentially. The expensive part is not discovering some arbitrarily located page to update in place. The expensive part is absorbing a firehose of records and getting them durably into a current-state representation fast enough. That strongly favours:

append-oriented storage
large sequential writes
minimal mutation in the storage layer

When a record is needed later, we do not ask:

"give me the next thousand keys"
"scan a range"
"order by this field"

We ask:

"give me the latest record for this key"

That is a hash index problem, not a B-tree problem and not an LSM compaction problem. One of the biggest advantages of a narrow store is that reset stops being a delicate operation. In a general-purpose engine, resets are often really a kind of managed mutation:

drop this
preserve that
compact around the change
clean up background state

In our case, weekly reset is conceptually closer to:

stop
discard
start fresh

That becomes much easier when the store layout matches the epoch model. What seqs Is Not, this matters as much as what it is.

seqs is not:

a transactional database
an analytics store
a query engine
a long-term history store
a multi-index document store
a system designed to answer arbitrary future questions

That is deliberate. Many custom storage projects go wrong because they start as "one narrow thing" and gradually absorb features until they become a weaker version of an existing database. We do not want that. The strength of seqs depends on saying no to almost everything.

Without describing exact file formats or algorithms, the design follows a few simple principles.

1. Treat the latest value as the only value that matters

Within a given epoch, old versions of a record are only useful insofar as they help us recover the latest value after a crash. They do not need to remain part of the active query model. That removes a large class of complexity.

2. Keep writes sequential

The write path should look as much like "append to a log" as possible. That means:

fewer small random writes
less page churn
simpler durability reasoning
better fit for the actual storage hardware

3. Keep the random-read structure small

The random-read structure should be tiny relative to the value volume. The index should answer one question efficiently:

where is the latest payload for this key?

Nothing more.

4. Separate durability bookkeeping from query behavior

Recovery and query do not have to be the same thing. A checkpoint can exist to make restart cheap without forcing the active write path to behave like a general-purpose page database.

5. Let the epoch boundary do real work

The weekly epoch boundary is not just a logical concept. It is an opportunity to simplify the storage engine:

clean reset
bounded lifetime
bounded recovery horizon
fewer forever-growing structures

That is one of the most important reasons a narrow custom design is viable here.

Why We Did Not Want More "Intermediate Steps". There is a perfectly reasonable conservative roadmap for problems like this:

tune the current engine more
try a different engine in the same family
add another abstraction layer
maybe separate blobs from indexes
maybe do another benchmarking round

That kind of incrementalism is often correct. It was not correct for us. We had already learned enough from operating the current system:

where the throughput collapsed
what the scaling limits looked like
how much compaction and amplification cost us
how much operational complexity we were taking on just to keep the database alive

At some point the question stops being "can we squeeze more out of this?" and becomes "are we solving the wrong storage problem?". We believed the answer was yes.

So instead of spending more time finding the least bad general-purpose solution, we chose to build the right narrow one.

Instrumentation Was a Requirement, Not a Nice-to-Have. One of the lessons from the RocksDB phase was that vague intuition is not enough. Storage performance problems hide behind each other. If you are not careful, you can easily confuse:

serialization cost
queueing cost
disk bandwidth
checkpoint cost
index lookup cost
downstream pipeline pressure

So with seqs, we made observability part of the design from the start.

The new store needs to make it easy to answer questions like:

how many rows per second are we really writing?
how many bytes are we putting on disk relative to input bytes?
how much time is spent waiting to enqueue versus actually writing?
what are point-read latencies during heavy ingest?
how much of the cost is storage versus payload decoding?
how evenly is the workload distributed across shards?
how long does recovery take after an abrupt restart?

That level of instrumentation is not optional. If the store cannot explain itself under load, it is not production-ready. Operational Lessons That Shaped seqs. Even before seqs, the earlier versions of the pipeline taught us several important lessons.

1. Disk layout matters more than most people expect

At this scale, "the disk" is not one thing.

Different I/O classes compete badly:

long sequential artifact reads
embedded store writes
WAL traffic
churn-heavy temp and writer-state paths
upload staging

Moving those onto separate devices did not solve the core architectural mismatch, but it made the failure modes legible. That mattered, seqs inherits that lesson. Even a much better-shaped store still needs an intentional disk layout.

2. Weekly reset semantics should be embraced, not worked around

Once we accepted that the pipeline really does restart from a new baseline every week, a lot of "database" assumptions stopped helping. Reset is not a special edge case. It is a first-class lifecycle event. That changes how you think about:

retention
compaction
recovery
cleanup
directory layout

3. The simplest possible storage semantics are often the best

If the caller only needs:

latest value by key
point lookup
hard reset

then the storage engine should probably only do that. Every extra capability comes with an implementation cost and an operational cost, even if it is rarely used.

What We Expect `seqs` To Fix

The design is intended to address a specific set of problems we hit with RocksDB.

Reduce write amplification

This is the biggest one.

By aligning the storage layout with append-heavy semantics, we expect to reduce the amount of physical work required per logical update.

Reduce compaction-style background pressure

By avoiding an LSM design entirely in this layer, we expect to remove one of the main sources of long-tail write cost and unpredictable background amplification.

Make reset cheap and explicit

Weekly reset should feel natural in the storage model, not like a forced administrative trick.

Improve disk predictability

Append-oriented growth and explicit reset are much easier to reason about than a system whose background maintenance behavior can dominate steady-state performance.

Preserve fast point reads

The store still has to serve the rest of the pipeline while ingest is happening. Fast point reads remain a hard requirement.

What We Are Deliberately Leaving For Later

We are intentionally not trying to solve every future optimization at once. One example is metadata churn. In our current pipeline, some updates are really "state metadata" updates rather than full product payload changes. That is an important optimization opportunity, and we may separate that kind of metadata later. But we are not starting there. The first job of seqs is to win the main throughput battle:

absorb writes fast
keep point reads cheap
recover safely
reset cleanly

If it does that, there are obvious second-order optimizations we can add later. If it does not do that, no amount of refinement around the edges will matter.

The Risks

A custom store is not free. The risks are real:

we lose the maturity of a battle-tested embedded engine
correctness bugs become our responsibility
crash recovery bugs become our responsibility
we have to be disciplined enough not to let the scope expand

Those are serious costs but the right comparison is not "custom engine versus ideal off-the-shelf engine". The right comparison is:

custom narrow engine that matches the workload
versus continuing to fight a structural mismatch in a general engine under a strict processing deadline

For us, the narrow custom engine was the better bet.

A Broader Lesson

There is a common pattern in system design:

start with a familiar general-purpose tool
scale successfully for a while
hit a wall
spend time tuning around the wall
eventually realize the wall is telling you the abstraction is wrong

That does not mean the earlier choices were mistakes. Postgres was the right control-plane choice. RocksDB was a reasonable snapshot-store choice for an earlier stage of the system but workload shape matters more than generic reputation. When your actual needs become:

append-heavy
point-read-only
reset-friendly
deadline-constrained

then a narrower design can be both simpler and faster.

Conclusion

We built seqs because our snapshot store stopped being a background detail and became the central scaling constraint in a weekly batch-plus-incremental pipeline. Postgres was the right place for coordination, not for the hot path. RocksDB was a meaningful improvement, but it still forced us into an LSM cost model that did not match our workload. We tuned it, isolated it onto separate disks, and learned a lot from operating it. That work was valuable, but it also made the mismatch impossible to ignore.

seqs is our answer to that mismatch.

It is not a general-purpose database. It is a storage component designed around a narrow, explicit contract:

write the latest value for a key quickly
read it back by key quickly
reset cheaply at epoch boundaries
and expose enough instrumentation that we can prove where the time and bytes are going

Sometimes the right scaling move is not more tuning. Sometimes it is admitting that the workload is simpler than the tool you are using, and then building something that matches the problem instead of the category.

Search This Blog

zcourts