newsbin.com

by **SnakeByte** » Thu Nov 14, 2013 5:10 am

Why would this be happening?
(debug output, filtering for "failed")

Code: Select all: [22:08:07] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.binaries.ces-Supernews-335972-335982.gz [22:09:37] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.binaries.cartoons.french.anime-fansub-Supernews-633866-633876.gz [22:27:13] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.binaries.multimedia.deadzone-Supernews-21726-21736.gz [22:30:45] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.binaries.tv.german-Supernews-31023407-31023417.gz [22:32:33] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.binaries.x2l.nzb-Supernews-124309-124319.gz [22:33:05] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [22:34:00] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.nl.ftp.binaries-Supernews-42298-42308.gz [22:35:31] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\alt.binaries.x2l-Supernews-37674874-37674884.gz

Here's the same debug output, but filtering for only one of those failed files:

Code: Select all: [22:32:38] DEBUG Total XOVER Size:hr.rec.stripovi.binaries - 835,719:835,729 [22:32:38] DEBUG XOVER: hr.rec.stripovi.binaries 835719:835729 [22:33:05] DEBUG DbWorker: Db Update Started5: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [22:33:05] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [22:46:19] DEBUG Total XOVER Size:hr.rec.stripovi.binaries - 835,719:835,729 [22:46:19] DEBUG XOVER: hr.rec.stripovi.binaries 835719:835729 [22:47:17] DEBUG DbWorker: Db Update Started5: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [22:47:17] DEBUG Checking Autodownload: hr.rec.stripovi.binaries [23:05:58] DEBUG Total XOVER Size:hr.rec.stripovi.binaries - 835,719:835,729 [23:05:58] DEBUG XOVER: hr.rec.stripovi.binaries 835719:835729 [23:07:07] DEBUG DbWorker: Db Update Started5: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [23:07:07] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [23:26:25] DEBUG Total XOVER Size:hr.rec.stripovi.binaries - 835,719:835,729 [23:26:25] DEBUG XOVER: hr.rec.stripovi.binaries 835719:835729 [23:29:53] DEBUG DbWorker: Db Update Started5: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [23:29:53] DEBUG Checking Autodownload: hr.rec.stripovi.binaries [23:52:16] DEBUG Total XOVER Size:hr.rec.stripovi.binaries - 835,719:835,729 [23:52:17] DEBUG XOVER: hr.rec.stripovi.binaries 835719:835729 [00:05:39] DEBUG DbWorker: Db Update Started5: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz [00:05:39] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz

It's repeatedly failing?

by **Quade** » Thu Nov 14, 2013 11:01 am

22:33:05] DEBUG DbWorker: Db Update Started5: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz
[22:33:05] DEBUG DbWorker: Db Gz to Db Failed: d:\newsbin\Import\hr.rec.stripovi.binaries-Supernews-835719-835729.gz

These are groups that have had no traffic since the last time you downloaded headers. The 10 posts it downloaded are just "overlap" posts. Meaning Newsbin reads 10 posts it's already read. It does this because some servers can get out of sync and their post numbers aren't exactly the same. More than likely the data is already in the DB and the failure is that it's not actually adding any data to the database.

by **SnakeByte** » Thu Nov 14, 2013 1:21 pm

Quade wrote:More than likely the data is already in the DB and the failure is that it's not actually adding any data to the database.

Quade,

Is there any way to get the app to be more verbose with these debug failures so that we know for certain that is the case and not a "more than likely" type of situation?

Also, what's up with the 5 in "Db Update Started5:"? I thought maybe that represented the processing thread, but it's always a 5. Is it just a format string goof?

by **Quade** » Thu Nov 14, 2013 2:23 pm

If you're missing data. Let me know and I'll look into it.

by **SnakeByte** » Fri Nov 15, 2013 5:34 am

Quade wrote:If you're missing data. Let me know and I'll look into it.

Quade,

If newsbin is the only program I'm using to download data, and it may not be importing it correctly, how can I tell that the data is missing? What I can tell you is that looking into one of those groups (hr.rec.stripovi.binaries) shows empty in newsbin, yet the gz file clearly shows data:

835719.(by req) Comic Strip - "Comic_Strip_35.cbr" (57/61) 28.4 MBytes yEnc.tomislav.cukelj4@zg.t-com.hr.Tue, 13 Nov 2012 08:02:38 +0100.<e4s3a8tufm3lulj2m5a17cspecchmu0e5p@4ax.com>..508785.3910.Xref: number.nntp.dca.giganews.com hr.rec.stripovi.binaries:835719

However, even if this isn't a data missing issue, another problem with these failures is that they repeat themselves. Since I've noticed this, the same files keep re-appearing after each header download and each time, they fail to get imported. This seems wasteful, if nothing else.

If the news server changes their posts so that the data is out of sync with newsbin, shouldn't newsbin update itself so that these failures stop?

Also, why is it that bogus errors are logged, yet not these? Both seem to stem from the same problem - newsbin is out of sync with the usernet server.

by **Quade** » Fri Nov 15, 2013 10:48 am

However, even if this isn't a data missing issue, another problem with these failures is that they repeat themselves. Since I've noticed this, the same files keep re-appearing after each header download and each time, they fail to get imported. This seems wasteful, if nothing else.

I already explained this. You're downloading headers from some (probably many) dead groups so, each time you run your update cycle you're downloading the SAME posts, over and over and over again. Every 18 minutes.

It's not that hard to check if you know what's in the GZ file, just load the group and see if that data is already there.

by **SnakeByte** » Sat Nov 16, 2013 3:23 am

Quade wrote:
I already explained this. You're downloading headers from some (probably many) dead groups so, each time you run your update cycle you're downloading the SAME posts, over and over and over again. Every 18 minutes.

Quade,

But why is this occurring?

When I issue the group command on that one example group, I get back this response from the supernews server:

Code: Select all: group hr.rec.stripovi.binaries 211 515773 319957 835729 hr.rec.stripovi.binaries

As you know, the last number the server gives back is the reported high watermark- the last article # (835729). If newsbin has downloaded from this group previously, it should have this number already recorded in the GroupRange table and know that there is no need to request any new articles from the server. Yet in the logs, I see that newsbin is still asking the server for the past 10 articles up to that maximum value:

Code: Select all: [22:32:38] DEBUG Total XOVER Size:hr.rec.stripovi.binaries - 835,719:835,729

Why is it doing this if the articles are already in the database?

Code: Select all: sqlite> select * from GroupRange; Supernews|hr.rec.stripovi.binaries|0|835729

Of course it would fail if the data already exists (as you say it should). But that's not the point. Newsbin shouldn't even be creating the importable gz file in the first place.

by **Quade** » Sat Nov 16, 2013 4:52 am

These are groups that have had no traffic since the last time you downloaded headers. The 10 posts it downloaded are just "overlap" posts. Meaning Newsbin reads 10 posts it's already read. It does this because some servers can get out of sync and their post numbers aren't exactly the same. More than likely the data is already in the DB and the failure is that it's not actually adding any data to the database.

Here's what I wrote in my first reply. Perhaps you didn't understand. Newsbin has a small overlap between the last header download and the next. When you have a dead group, it means it's downloading the same 10 posts over and over.

by **SnakeByte** » Sun Nov 17, 2013 2:19 am

Quade wrote:
These are groups that have had no traffic since the last time you downloaded headers. The 10 posts it downloaded are just "overlap" posts. Meaning Newsbin reads 10 posts it's already read. It does this because some servers can get out of sync and their post numbers aren't exactly the same. More than likely the data is already in the DB and the failure is that it's not actually adding any data to the database.

Here's what I wrote in my first reply. Perhaps you didn't understand. Newsbin has a small overlap between the last header download and the next. When you have a dead group, it means it's downloading the same 10 posts over and over.

If the articles ever get out of sync, then the group command will report back a different "high watermark" number. Then, and only then, would it make sense to sample the posts to get a bearing on a future sync. Polling 10 of the latest posts from each group "just in case" seems wasteful and unwarranted.

by **Quade** » Sun Nov 17, 2013 9:31 am

If the articles ever get out of sync, then the group command will report back a different "high watermark" number. Then, and only then, would it make sense to sample the posts to get a bearing on a future sync. Polling 10 of the latest posts from each group "just in case" seems wasteful and unwarranted.

So you'd think but, in fact you're wrong. Some servers have multiple farms of headers servers and they're not always synced 100% so, when you download headers and try to be precise down to a post, you end up with missing posts. Off by one or some errors. "Overlap" is the result of hard earned experience over many different servers. Some servers "backfill" too where new headers show up in ranges you've already pulled headers from.

by **SnakeByte** » Sun Nov 17, 2013 8:39 pm

Quade wrote:
If the articles ever get out of sync, then the group command will report back a different "high watermark" number. Then, and only then, would it make sense to sample the posts to get a bearing on a future sync. Polling 10 of the latest posts from each group "just in case" seems wasteful and unwarranted.

So you'd think but, in fact you're wrong. Some servers have multiple farms of headers servers and they're not always synced 100% so, when you download headers and try to be precise down to a post, you end up with missing posts. Off by one or some errors. "Overlap" is the result of hard earned experience over many different servers. Some servers "backfill" too where new headers show up in ranges you've already pulled headers from.

So there are two situations to worry about:

One is you believe that there are servers that report the same article # but with different content.

Two, servers will backfill, so even though an article # has been reached, previous article #'s will be filled in.

Is there any way to detect #1 without creating a gz file? I.E. query the content of the last article # from newsbin's db, and then string compare with the result of an "XOVER [lastarticle#]-" ?

Shouldn't #2 only be a concern if the news server previously returned results with gaps in the article # count? I.E. returns posts for 1, 2, 3, and 5, but not 4? If there is a complete set, then no backfill worries? And wouldn't it make more sense to ask for the missing article #'s instead of a blanket 10-post request which wouldn't likely cover the range of backfilled articles anyway?

by **Quade** » Sun Nov 17, 2013 11:56 pm

One is you believe that there are servers that report the same article # but with different content.

No. You request a range of posts, 1-1000 and you only get 990 posts because of server sync issue. Then the next time you download headers, you think you have up to record 1000 so, you download 1000-2000 but, the 10 missing from before remain missing even though they've shown up since the last time you downloaded. You can't know they've shown up since the last header download.

#2 is accurate. Gaps in articles aren't unheard off. Posts get removed. I don't care about how or why the servers might do this. All I care about is making Newsbin work. If that means 10 overlap or 1000, that's what I'll implement.

Is there any way to detect #1 without creating a gz file? I.E. query the content of the last article # from newsbin's db, and then string compare with the result of an "XOVER [lastarticle#]-" ?

As for detecting it, I don't see the point. Overlap takes care of it. Newsbin isn't tracking record numbers in the DB's. Once the records are imported, the record number is discarded. It's not used during download. All the headers from all servers are combined together on the DB.

How about collecting up a bunch of 10 post overlap GZ files in some other folder. You should have a crapload. Collect up 1000. Then exit Newsbin, rename the import folder, start Newsbin, then dump these 1000 gz files into the empty import folder and see how long it takes to process them.

by **SnakeByte** » Tue Nov 19, 2013 4:38 am

Quade wrote:
One is you believe that there are servers that report the same article # but with different content.

No. You request a range of posts, 1-1000 and you only get 990 posts because of server sync issue. Then the next time you download headers, you think you have up to record 1000 so, you download 1000-2000 but, the 10 missing from before remain missing even though they've shown up since the last time you downloaded. You can't know they've shown up since the last header download.

Okay, so newsbin asks for article #'s 1-1000, but only gest back 900 articles, even though the last article you got was numbered 1000. This means some of the articles between 1 and 1000 were not sent. This could be because the articles were removed, or it could be because of a server glitch of some type.

What does newsbin do at this point?

I don't think you're saying that newsbin has asked for 1-1000 but got back a complete, but truncated 990, right? Cuz then all it would need to do is record 990 into the Range.db3 and take it from then the next header update.

What about truncating the import at the first detected gap? So, if a server returns 1-200, 300-500, and 900-1000, stop it at 200, record that in the Range.db3 and then wait to continue from there the next header update. If the gap still exists, then assume deleted and move on to the next gap.

What if newsbin asks for 1-1000 and actually gets all of the articles? Should it still be necessary to ask for the last 10, save that to disk and attempt to import it? What about adding a "complete" flag in Range.db3 that is only set when this situation occurs?

Quade wrote:Posts get removed. I don't care about how or why the servers might do this. All I care about is making Newsbin work. If that means 10 overlap or 1000, that's what I'll implement.

If posts get removed, then asking for them again isn't going to ever get you that data. It's gone.

I don't think we're in disagreement about wanting newsbin to work

I'm just exploring other possible avenues. For instance, what about, in addition to that mentioned above, only back-filling when one actually selects to download something, and then only when there aren't enough par articles to fill those gaps? I mean, if newsbin were to start storing the article #'s in the db, it would know the first article # of the first file and the last article # of the last par file. That would be a great time to refresh the newsbin DB by asking for that range of headers again, especially considering all the DMCA takedowns that may have happened since the original header download.

by **Quade** » Tue Nov 19, 2013 12:17 pm

What does newsbin do at this point?

Nothing, typically overlap catches the missing posts. If you have a really screwed up server you can increase the overlap.

If posts get removed, then asking for them again isn't going to ever get you that data. It's gone.

You're sort of missing the point. What I was trying to explain is that a backfilling server is indistinguishable from normal missing posts.

only back-filling when one actually selects to download something, and then only when there aren't enough par articles to fill those gaps? I mean, if newsbin were to start storing the article #'s in the db, it would know the first article # of the first file and the last article # of the last par file. That would be a great time to refresh the newsbin DB by asking for that range of headers again, especially considering all the DMCA takedowns that may have happened since the original header download.

Why bother, when something simple like overlapping the header downloads, which is essentially free, solves the problem.

You don't really understand the ins and outs of usenet. Posts can come in any order and they come interleaved with other posts. You might have 20 file sets interleaved in a 100,000 post range of headers. You can't trust anything about the record numbers. You can't trust anything about the date order. You might assume that a newer post has a higher record number than an older post but, sometimes they don't. You can't trust that the server is ever telling you the truth. That's why Newsbin uses "storage age" instead of depending on the server to tell Newsbin what the group post range it.

You're trying to solve a solved problem. Optimize something that probably doesn't need optimizing. You haven't really thought through storing record numbers either. What happens when someone has 10 servers they download headers from? The whole reason Newsbin doesn't track them is that there might be N header servers.

by **SnakeByte** » Thu Nov 21, 2013 4:08 am

Quade wrote:You're trying to solve a solved problem. Optimize something that probably doesn't need optimizing. You haven't really thought through storing record numbers either. What happens when someone has 10 servers they download headers from? The whole reason Newsbin doesn't track them is that there might be N header servers.

The true problem I'm trying to solve is how to work around the slow gz import problem. That, and some concern for the increasing number of ISP's that are capping their data plans.

I still think there is one place where an optimization would seriously help here.

When the group command is issued to a usenet server, three numbers (besides the 211) are returned. If none of these values change (Estimated article count, low article #, high article #), downloading the last 10 headers should be unnecessary. Since the vast majority of newsgroups see little traffic, this should solve the slow import issue seen by those who don't want to have to choose which newsgroups to subscribe to because newsbin won't have to unnecessarily attempt to import data it already has.

In an age where you think subscribing to 10 usenet servers is a legitimate argument, and where newsbin will blindly download the last 10 articles from each group FROM EACH SERVER, things could get out of hand with far less groups subscribed to.

There is one other place where importing could possibly be sped up... import more than one gz at a time, or kill off the wasteful separation of download from import and combine that into one process... no gz file creation at all. Why write to and read from the disk twice?

by **Quade** » Thu Nov 21, 2013 8:57 am

In an age where you think subscribing to 10 usenet servers is a legitimate argument, and where newsbin will blindly download the last 10 articles from each group FROM EACH SERVER, things could get out of hand with far less groups subscribed to.

- I'll bet you I have 500 users with at least 10 servers. 1000's with 5 servers. Hell, I have 4 servers I typically use when testing.

It's funny that you scoff at my comment while at the same time subscribing to 2300 groups, which is an equally out there thing.

You're a "canary in a coal mine" meaning the issues you're running into are outlier issues. Don't get me wrong, I like knowing about them because, If I can make your import faster, and fix your crashes, everyone benefits. Having someone running it on the edge like you do is good for me and good for Newsbin. You're just on the extreme end of the usage curve. Just like the people who have 10 servers and 200 connections. I expect odd results in both cases because you're both testing things I've never tested.

newsbin.com

Db Gz to Db Failed debug messages...

Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Re: Db Gz to Db Failed debug messages...

Who is online