Looks interesting for something like local development. I don't intend to run pr...

lxpz · 2025-12-19T19:42:00 1766173320

If you know of an embedded key-value store that supports transactions, is fast, has good Rust bindings, and does checksumming/integrity verification by default such that it almost never corrupts upon power loss (or at least, is always able to recover to a valid state), please tell me, and we will integrate it into Garage immediately.

agavra · 2025-12-19T20:13:55 1766175235

Sounds like a perfect fit for https://slatedb.io/ -- it's just that (an embedded, rust, KV store that supports transactions).

It's built specifically to run on object storage, currently relies on the `object_store` crate but we're consdering OpenDAL instead so if Garage works with those crates (I assume it does if its S3 compatible) it should just work OOTB.

evil-olive · 2025-12-20T19:42:32 1766259752

for Garage's particular use case I think SlateDB's "backed by object storage" would be an anti-feature. their usage of LMDB/SQLite is for the metadata of the object store itself - trying to host that metadata within the object store runs into a circular dependency problem.

johncolanduoni · 2025-12-20T05:43:14 1766209394

I’ve used RocksDB for this kind of thing in the past with good results. It’s very thorough from a data corruption detection/rollback perspective (this is naturally much easier to get right with LSMs than B+ trees). The Rust bindings are fine.

It’s worth noting too that B+ tree databases are not a fantastic match for ZFS - they usually require extra tuning (block sizes, other stuff like how WAL commits work) to get performance comparable to XFS/ext4. LSMs on the other hand naturally fit ZFS’s CoW internals like a glove.

fabian2k · 2025-12-19T20:50:06 1766177406

I don't really know enough about the specifics here. But my main points isn't about checksums, but more something like WAL in Postgres. For an embedded KV store this is probably not the solution, but my understanding is that there are data structures like LSM that would result in similar robustness. But I don't actually understand this topic well enough.

Checksumming detects corruption after it happened. A database like Postgres will simply notice it was not cleanly shut down and put the DB into a consistent state by replaying the write ahead log on startup. So that is kind of my default expectation for any DB that handles data that isn't ephemeral or easily regenerated.

But I also likely have the wrong mental model of what Garage does with the metadata, as I wouldn't have expected that to be ever limited by Sqlite.

lxpz · 2025-12-19T21:04:52 1766178292

So the thing is, different KV stores have different trade-offs, and for now we haven't yet found one that has the best of all worlds.

We do recommend SQLite in our quick-start guide to setup a single-node deployment for small/moderate workloads, and it works fine. The "real world deployment" guide recommends LMDB because it gives much better performance (with the current status of Garage, not to say that this couldn't be improved), and the risk of critical data loss is mitigated by the fact that such a deployment would use multi-node replication, meaning that the data can always be recovered from another replica if one node is corrupted and no snapshot is available. Maybe this should be worded better, I can see that the alarmist wording of the deployment guide is creating quite a debate so we probably need to make these facts clearer.

We are also experimenting Fjall as an alternate KV engine based on LSM, as it theoretically has good speed and crash resilience, which would make it the best option. We are just not recommending it by default yet, as we don't have much data to confirm that it works up to these expectations.

BeefySwain · 2025-12-19T19:47:39 1766173659

(genuinely asking) why not SQLite by default?

lxpz · 2025-12-19T19:54:36 1766174076

We were not able to get good enough performance compared to LMDB. We will work on this more though, there are probably many ways performance can be increased by reducing load on the KV store.

srcreigh · 2025-12-19T21:49:44 1766180984

Did you try WITHOUT ROWID? Your sqlite implementation[1] uses a BLOB primary key. In SQLite, this means each operation requires 2 b-tree traversals: The BLOB->rowid tree and the rowid->data tree.

If you use WITHOUT ROWID, you traverse only the BLOB->data tree.

Looking up lexicographically similar keys gets a huge performance boost since sqlite can scan a B-Tree node and the data is contiguous. Your current implementation is chasing pointers to random locations in a different b-tree.

I'm not sure exactly whether on disk size would get smaller or larger. It probably depends on the key size and value size compared to the 64 bit rowids. This is probably a well studied question you could find the answer to.

[1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8...

lxpz · 2025-12-19T22:09:59 1766182199

Very interesting, thank you. It would probably make sense for most tables but not all of them because some are holding large CRDT values.

asa400 · 2025-12-20T22:19:36 1766269176

Other than knowing this about SQLite beforehand, is there any way one could discover that this is happening through tracing?

rapnie · 2025-12-20T00:10:51 1766189451

I learned that Turso apparently have plans for a rewrite of libsql [0] in Rust, and create a more 'hackable' SQLite alternative altogether. It was apparently discussed in this Developer Voices [1] video, which I haven't yet watched.

[0] https://github.com/tursodatabase/libsql

[1] https://www.youtube.com/watch?v=1JHOY0zqNBY

tensor · 2025-12-19T20:46:58 1766177218

Keep in mind that write safety comes with performance penalties. You can turn off write protections and many databases will be super fast, but easily corrupt.

skrtskrt · 2025-12-19T20:22:35 1766175755

Could you use something like Fly's Corrosion to shard and distribute the SQLite data? It uses a CRDT reconciliation, which is familiar for Garage.

lxpz · 2025-12-19T20:44:12 1766177052

Garage already shards data by itself if you add more nodes, and it is indeed a viable path to increasing throughput.

__padding · 2025-12-21T00:29:32 1766276972

I’ve not looked at it in a while but sled/rio were interesting up and coming options https://github.com/spacejam/sled

ndyg · 2025-12-21T05:20:48 1766294448

Fjall

https://github.com/fjall-rs/fjall

__turbobrew__ · 2025-12-19T21:38:20 1766180300

RocksDB possibly. Used in high throughput systems like Ceph OSDs.

patmorgan23 · 2025-12-19T21:11:43 1766178703

Valkey?

VerifiedReports · 2025-12-20T06:06:06 1766210766

It's "key/value store", FYI

kqr · 2025-12-20T06:26:37 1766211997

It's not a store of "keys or values", no. It's a store of key-value pairs.

VerifiedReports · 2025-12-20T17:59:14 1766253554

A key-value store would be a store of one thing: key values. A hyphen combines two words to make an adjective, which describes the word that follows:

  A used-car lot

  A value-added tax

  A key-based access system

When you have two exclusive options, two sides to a situation, or separate things; you separate them with a slash:

  An on/off switch

  A win/win situation

  A master/slave arrangement

Therefore a key-value store and a key/value store are quite different.

kqr · 2025-12-20T19:02:15 1766257335

All of your slash examples represent either–or situations. A swich turns it on or off, the situation is a win in the first outcome or a win in the second outcome, etc.

It's true that key–value store shouldn't be written with a hyphen. It should be written with an en dash, which is used "to contrast values or illustrate a relationship between two things [... e.g.] Mother–daughter relationship"

https://en.wikipedia.org/wiki/Dash#En_dash

I just didn't want to bother with typography at that level of pedanticism.

VerifiedReports · 2025-12-20T20:01:08 1766260868

No, they don't. A master/slave configuration (of hard drives, for example) involves two things. I specifically included it to head off the exact objection you're raising.

"...the slash is now used to represent division and fractions, as a date separator, in between multiple alternative or related terms"

-Wikipedia

And what is a key/value store? A store of related terms.

And if you had a system that only allowed a finite collection of key values, where might you put them? A key-value store.

kqr · 2025-12-20T20:03:38 1766261018

The hard drives are either master or slave. A hard drive is not a master-and-slave.

VerifiedReports · 2025-12-21T05:43:31 1766295811

Exactly. And an entry in a key/value store is either a key or a value. Not both.

kqr · 2025-12-21T06:38:10 1766299090

No, an entry is a key-and-value pair. Are you deriously suggesting it is possible to add only keys without corresponding values, or vice versa?

abustamam · 2025-12-20T06:16:34 1766211394

Wikipedia seems to find "key-value store" an appropriate term.

https://en.wikipedia.org/wiki/Key%E2%80%93value_database

VerifiedReports · 2025-12-20T17:59:33 1766253573

See above.

abustamam · 2025-12-22T16:35:41 1766421341

Still not sure what point you're trying to make. You attempted to correct GP's usage of "key-value store" and I merely pointed out that it is the widely accepted term for what is being discussed.

Whether or not it's semantically "correct" because of usage of hyphen vs slash is irrelevant to that point.

DonHopkins · 2025-12-20T15:48:00 1766245680

Which is infinite of value is zero.

yupyupyups · 2025-12-20T03:16:33 1766200593

Depending on the underlying storage being reliable is far from unique to garage. This is what most other services do too, unless we're talking about something like Ceph which manages the physical storage itself.

Standard filesystems such as ext4 and xfs don't have data checksumming, so you'll have to rely on another layer to provide integrity. Regardless, that's not garage's job imo. It's good that they're keeping their design simple and focus their resources on implementing the S3 spec.

moffkalast · 2025-12-19T17:36:53 1766165813

That's not something you can do reliably in software, datacenter grade NVMe drives come with power loss protection and additional capacitors to handle that gracefully. If power is cut at the wrong moment the partition may not be mountable afterwards otherwise.

If you really live somewhere with frequent outages, buy an industrial drive that has a PLP rating. Or get a UPS, they tend to be cheaper.

crote · 2025-12-19T18:02:18 1766167338

Isn't that the entire point of write-ahead logs, journaling file systems, and fsync in general? A roll-back or roll-forward due to a power loss causing a partial write is completely expected, but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted?

As I understood it, the capacitors on datacenter-grade drives are to give it more flexibility, as it allows the drive to issue a successful write response for cached data: the capacitor guarantees that even with a power loss the write will still finish, so for all intents and purposes it has been persisted, so an fsync can return without having to wait on the actual flash itself, which greatly increases performance. Have I just completely misunderstood this?

unsnap_biceps · 2025-12-19T18:42:42 1766169762

you actually don't need capacitors for rotating media, Western Digital has a feature called "ArmorCache" that uses the rotational energy in the platters to power the drive long enough to sync the volatile cache to a non volatile storage.

https://documents.westerndigital.com/content/dam/doc-library...

toomuchtodo · 2025-12-19T18:51:28 1766170288

Very cool, like the ram air turbine that deploys on aircraft in the event of a power loss.

patmorgan23 · 2025-12-19T21:12:39 1766178759

Good I love engineers

Aerolfos · 2025-12-19T23:38:14 1766187494

> but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted?

That doesn't even help if fsync() doesn't do what developers expect: https://danluu.com/fsyncgate/

I think this was the blog post that had a bunch more stuff that can go wrong too: https://danluu.com/deconstruct-files/

But basically fsync itself (sometimes) has dubious behaviour, then OS on top of kernel handles it dubiously, and then even on top of that most databases can ignore fsync erroring (and lie that the data was written properly)

So... yes.

Nextgrid · 2025-12-19T18:16:04 1766168164

> ignore fsync and blatantly lie that the data has been persisted

Unfortunately they do: https://news.ycombinator.com/item?id=38371307

btown · 2025-12-19T18:26:28 1766168788

If the drives continue to have power, but the OS has crashed, will the drives persist the data once a certain amount of time has passed? Are datacenters set up to take advantage of this?

Nextgrid · 2025-12-19T18:39:01 1766169541

> will the drives persist the data once a certain amount of time has passed

Yes, otherwise those drives wouldn't work at all and would have a 100% warranty return rate. The reason they get away with it is that the misbehavior is only a problem in a specific edge-case (forgetting data written shortly before a power loss).

unsnap_biceps · 2025-12-19T18:39:56 1766169596

Yes, the drives are unaware of the OS state.

igor47 · 2025-12-19T17:30:15 1766165415

I've been using minio for local dev but that version is unmaintained now. However, I was put off by the minimum requirements for garage listed on the page -- does it really need a gig of RAM?

dsvf · 2025-12-19T20:35:38 1766176538

I always understood this requirement as "garage will run fine on hardware with 1GB RAM total" - meaning the 1GB includes the RAM used by the OS and other processes. I think that most current consumer hardware that is a, potential garage host, even on the low end, has at least 1GB total RAM.

archon810 · 2025-12-19T17:32:59 1766165579

The current latest Minio release that is working for us for local development is now almost a year old and soon enough we will have to upgrade. Curious what others have replaced it with that is as easy to set up and has a management UI.

mbreese · 2025-12-19T20:04:58 1766174698

I think that's part of the pitch here... swapping out Minio for Garage. Both scale a lot more than for just local development, but local dev certainly seems like a good use-case here.

lxpz · 2025-12-19T19:43:40 1766173420

It does not, at least not for a small local dev server. I believe RAM usage should be around 50-100MB, increasing if you have many requests with large objects.

nijave · 2025-12-20T03:42:37 1766202157

The assumption is nodes are in different fault domains so it'd be highly unlikely to ruin the whole cluster.

LMDB mode also runs with flush/syncing disabled