newsbin.com

by **spotter** » Wed Aug 06, 2025 9:41 am

I think its well known that people find header ingestion performance on big groups to be problematic (though one can correct me if I'm wrong, and thereby discount this entire thread)

If it is the case, I wonder if we could improve performance by sharding the DBs over time (where time might not just be a physical time, but relative to # of headers)

this would limit the size of the sqlite DBs and thereby the size of the btrees reducing IO when inserting headers.

Now I imagine a major problem with this. What happens when files or file sets are split across these sharded DBs. while its possible for the UI to have a complex logic to try and handle it, I think there's an easier solution. overlapping shards.

i.e. if when we reach the point to create a new shard, we dont' stop inserting into the old DB for a period of time. In this way, a certain amount of records will be injected into both the previously "active" shard and the newly created shard, presumambly if we overlap a sufficient amount, one of the DBs should have a complete record of the file set (or possibly both)

if the records have a common id between them, it should be easy for the UI to filter out appropriately the duplicate records (i.e. either the older DB will have the complete set, the newer db will have the complete set or both, we just display only one of the common id of which is "better"), which doesn't seem to be as complex an operation as trying to merge records between shards.

this is just a thought of throwing a bit more disk space at the problem to try and improve performance.

by **Quade** » Wed Aug 06, 2025 11:27 am

In my tests, the latest beta which is 6.91B4 imports headers at least 10 times faster than older versions.

It might even be fast enough to bypass writing the headers to disk but I haven't tried that yet. Your other post made me think of it. The whole reason it writes the header files is header download speeds in the past far exceeded the ability to feed into the DB but that might not be true anymore. Newsbin downloads 50 meg blocks of headers from the servers (thats if it's downloading in bulk, typical header downloads from non-dump groups aren't that big). That's not that much RAM considering modern machines.

The problem with any sharding is performance. I've toyed with the idea of splitting header DB's by year. 2022.db 2023.db and the like. The dates usenet reports aren't that reliable. In that the poster can make the dates anything they want.

by **spotter** » Wed Aug 06, 2025 12:00 pm

Quade wrote:In my tests, the latest beta which is 6.91B4 imports headers at least 10 times faster than older versions.

It might even be fast enough to bypass writing the headers to disk but I haven't tried that yet. Your other post made me think of it. The whole reason it writes the header files is header download speeds in the past far exceeded the ability to feed into the DB but that might not be true anymore. Newsbin downloads 50 meg blocks of headers from the servers (thats if it's downloading in bulk, typical header downloads from non-dump groups aren't that big). That's not that much RAM considering modern machines.

The problem with any sharding is performance. I've toyed with the idea of splitting header DB's by year. 2022.db 2023.db and the like. The dates usenet reports aren't that reliable. In that the poster can make the dates anything they want.

right, dont trust the dates on the post, more in terms of shard based on # of headers in a DB (so the shards would be viewed as a DB as a whole, would do all queries for a group against all shards in the group). And then how many shards do you want to overlap the groups, and here is where physical time can come into play. i.e. if I say shard every billion headers, I might say, overlap 1 million headers (1/10 of 1% wasted) or when one week or 1 day has passed in the real world AND I'm up to date with header retrieval (i.e. not just 1 week where you haven't tried to download anything).

The argument here is that anything started in shard A, will have been finished by the end of a real world 7 days, and while those headers will be in shard B as well, it doesn't matter, per my description above. Similarly, when downloading in bulk from long time ago, I'd hazard to say (perhaps incorrectly, but this is what a bit of experimentation is for), that any things started in shard A will be finished within the next 1mil headers, though I'd guess this also goes to your experience in how posts get combined or not.

by **spotter** » Fri Aug 15, 2025 12:00 pm

Quade wrote:In my tests, the latest beta which is 6.91B4 imports headers at least 10 times faster than older versions.

It might even be fast enough to bypass writing the headers to disk but I haven't tried that yet. Your other post made me think of it. The whole reason it writes the header files is header download speeds in the past far exceeded the ability to feed into the DB but that might not be true anymore. Newsbin downloads 50 meg blocks of headers from the servers (thats if it's downloading in bulk, typical header downloads from non-dump groups aren't that big). That's not that much RAM considering modern machines.

The problem with any sharding is performance. I've toyed with the idea of splitting header DB's by year. 2022.db 2023.db and the like. The dates usenet reports aren't that reliable. In that the poster can make the dates anything they want.

so I moved to the beta and it is much faster, at even on a 50+GB DB (on a 16GB machine) able to handle at least when downloding thread on spinning disk. I'm not sure it could handle many groups updting in parallel (as the iops for it are minimal), but who knows. It took a while to read through an import directory with 15k header files stored, but that could be simply reading all that data and wasting iops between that and DB access

newsbin.com

Possible to improve performance by sharding group DBs?

Possible to improve performance by sharding group DBs?

Re: Possible to improve performance by sharding group DBs?

Re: Possible to improve performance by sharding group DBs?

Re: Possible to improve performance by sharding group DBs?

Who is online