Page 1 of 1

Basic header retrieval

PostPosted: Tue Jun 01, 2021 11:45 pm
by rh
I've been using Watch Topics with internet search for so long, I forgot the basics of newsgroups.

Watch topics seem to not retrieve all the posted items I usually see after a few weeks absence. So I wanted to just download 30 days of headers from a group. How do I do that? I've tried download special 500,000 and it downloads for a bit but afterwards there are no post when I load the group - even with all the filters OFF.

Re: Basic header retrieval

PostPosted: Wed Jun 02, 2021 11:16 am
by Quade
- Set the download age to "30".
- Right click the group and select "Post Storage/Use Download Age"
- Download headers normally.

This will download 30 days without deleting the existing headers. Auto-mode will only download "new" posts so this might not get you what you want.

You can re-scan the group without downloading headers which will re-feed the watch list too.

You can always load a watch list and add anything you don't think was added automatically

Re: Basic header retrieval

PostPosted: Wed Jun 02, 2021 12:53 pm
by rh
Ok, re-scan sounds nice but it didn't work. That is it reported 5xx found but no new post were added to the watch topic after reloading all from disk.

This has been a problem before which I had just given up on. Away for a few weeks and there's a distinct hole in posts when reviewing the topics.

So, as a test, I shift-del all but the last 60 days from a watch topic. Sorted by date there's a spot where the date instantly goes from 17d to 3d. Hence 14 days of missing posts.

I'm trying Post Storage/Use Download Age now on a random group in the GOG that the watch topic is using. What does that do? See the last date pointer back N days? So that load latest starts from that point? Also, by doing this, pulling headers from this group that the watch topic also 'watches' - that will/should cause the watch topic to SEE the headers too and add to its list as well?

Re: Basic header retrieval

PostPosted: Wed Jun 02, 2021 3:17 pm
by Quade
Ok, re-scan sounds nice but it didn't work. That is it reported 5xx found but no new post were added to the watch topic after reloading all from disk.


I've specifically tested for this in the latest beta and it works for both headers and search.

If you use "wild card" mode with a search based watch list, you won't get more than one to two days worth of hits. Wildcard simply returns too many results. If you're using an actual search term, the results from a rescan depend on how many matching results you get from the search. If you don't run Newsbin for 2 weeks and you're using a wildcard search term, getting a gap is to be expected. With a wildcard search term, there might be 20 million results in a 2 week gap.

I'm trying Post Storage/Use Download Age now on a random group in the GOG that the watch topic is using. What does that do? See the last date pointer back N days? So that load latest starts from that point? Also, by doing this, pulling headers from this group that the watch topic also 'watches' - that will/should cause the watch topic to SEE the headers too and add to its list as well?


It clears the tracking DB for current header position. Then when you download headers again, it'll start downloading from "Download age" in the past to today. Usenet doesn't support downloading headers by date. Instead Newsbin probes the group and figures out what record number corresponds to N days ago.

Re: Basic header retrieval

PostPosted: Wed Jun 02, 2021 4:30 pm
by rh
Bingo. Wildcard search - so that makes sense now. So, given the volume, it's just not practical or possible to "catch-up" I guess.

The test download for a single group in the GOG took maybe 20 min and I think the header count was around 11M indicated during the download. But there are NO post indicated in New Files column or when I load that group. Again, filters completely OFF, Display Settings - None. Even tried setting Display Age to Show All Files.

If that should have worked, maybe I should try the beta. I'm still on 6.82 5142.

Re: Basic header retrieval

PostPosted: Wed Jun 02, 2021 4:45 pm
by rh
Installed B10, download newest from that same group (a couple hours later) and progress showed about 127k headers but again nothing shows when I load the group.

Wiped the DB3's in that groups SPOOL_V6 folder, set the date to 16 days, now downloading 5.8M headers.....

done. Storage.db3 is just stuck at 52,348 bytes but time stamped NOW. No post exists. I must have some setting somewhere that just says scan the downloaded post but don't save them.

Re: Basic header retrieval

PostPosted: Wed Jun 02, 2021 5:04 pm
by rh
Hmmm, is it possible that if a group is in a GOG that is part of a Watch Topic - then manually loading headers for that group is not supposed to store the post in the db3 for THAT group? I guess I could see that, but it's a little confusing.

I pose that question because I just picked another random group that worked as expected, actually stored the downloaded headers. This group is NOT in a GOG that is used in a Watch Topic.

Re: Basic header retrieval

PostPosted: Thu Jun 03, 2021 3:08 pm
by Quade
Watch topics simply parallel the main header database. So main header database gets fed then any matching results would get sent to the watch list.

1 - "Show all Posts" on the group.

2 - Look down where is says "Cache: X/Y (N)" N is the number of header blocks awaiting import. Until that number hits zero, you won't see all the newest headers in your groups. If you continually bang on the group, re-downloading headers multiple times, you may flood the import folder with header blocks. It'll eventually clear up but the more you fill it, the slower the import of headers will be.

3 - "Display Age" if the group is a dead group and the posts you're downloading are older than your display age, only "Show all" will display them

4 - "Storage Age". If you have a short "Storage Age" say 1-2 days. Any headers downloaded older than that are deleted before they get written to the header database. The default is 3000-4000 days depending on version.

5 - Filtering of the groups. If you notice a progress bar during "Show all Posts" then headers are loading. If they don't then show up, it typically means they're filtered out. B10 shows loaded and displayed file counts in the rightmost field of the toolbar specifically for cases like this.

Re: Basic header retrieval

PostPosted: Fri Jun 04, 2021 1:28 am
by rh
So "main" header database would be a database for a specific group? Each group having its own "main" database?

Show all posts - check
Cache 400/400 (0) - Been so long since I needed to, I'd forgotten to keep an eye on this.
Download age - 30
Display age - 60
Storage age - 90

Filters completely off. No progress bar on load and rightmost dropdown field shows 0/0. DB3 file size is a good indicator of an empty DB.

Again, so far I'm only seeing this behavior when I try to load headers for a group that's already part of a GOG that is also used in a Watch List.

<hours pass>

That's IT. What I had asked about earlier.

1. Create a GOG called TEST and add a.b.misc to it
2. Create a Watch Topic AND set Look In Topic to TEST
3. Go to TEST and try ANY method of downloading headers for a.b.misc. The download will happen but no post will be stored in a.b.misc.

I had also asked this earlier - since the download DOES take place, I was hoping the Watch Topic engine would also see them. But that doesn't seem to be the case since my 14-day hole is still present in the Watch Topic.

Re: Basic header retrieval

PostPosted: Fri Jun 04, 2021 1:08 pm
by Quade
To make it clearer, records landing in the per group database have nothing to do with records that might eventually end up on the watch lists.

10 days of a.b.misc is 105 million headers. There's very little usable content in Misc too. Usable by headers I mean.

I'll do a test then. I'll create a GOG move misc there, create a header based watch for image files associated with that GOG. Then re-download the last 10 days of misc after a with a purge of the group.

I'm skeptical of you conclusion but only testing will tell. I'm skeptical because I have several header based watches that don't exhibit the symptoms you're suggesting here.


--> Header Download ---> Import into Per group database ---> Of the new records imported see if any match any watches ---> Feed into each matching watch list.

Re: Basic header retrieval

PostPosted: Fri Jun 04, 2021 2:42 pm
by rh
Ok, thanks for the clear summary of header processing flow. In that case, I really would have expected my 14-day hole to be filled.

Also, a.b.misc was just an easy test group that I just pulled OUT of my existing GOG. I also used the tip you provided earlier to reset the post storage and set the download age to ONE day since I didn't need to download that much in order to test my theory.

Re: Basic header retrieval

PostPosted: Fri Jun 04, 2021 4:25 pm
by Quade
In that case, I really would have expected my 14-day hole to be filled.


Our search engine will often tag an obscured file set with meaningful data. Meaning just because you see it in search, you won't necessarily see it in headers.

You're already downloading headers. I'd suggest actually loading the group and see if you see anything. You mentioned some low post counts. 11 million isn't particularly many posts. For you to cover the last month's worth of a.b.misc for example, you'd need to download at least 300 million headers from that group.

Re: Basic header retrieval

PostPosted: Fri Jun 04, 2021 4:53 pm
by rh
29,146 total & hidden - that's what I ended up with for a.b.misc when I chose get latest with Download Age enabled and set to 1. I think the download tab showed about 21M though.

I could see the count instantly begin to appear in the Group List column during the download. That does not happen when headers for a group, in a GOG, that's in a Watch Topic.

<edit>

After 15 hours lapsed since the 1-day pull from a.b.misc, I unchecked Use Header Download Age and initiated a get latest. It's pulling from 7,831,299 posts and the New Files column is being updated during this download. In a few hours, I'll MOVE a.b.misc back into my primary GOG that a Watch Topic uses and get latest. I'm certain the download count will be near 1M if not a little above but New Files will remain 0 and no new post will be loaded to the groups DB.

Re: Basic header retrieval

PostPosted: Fri Jun 04, 2021 10:50 pm
by rh
5 hours 27 minutes later, MOVE a.b.misc to a GOG that is referenced by a Watch Topic.

Get latest headers. Total Size reported in progress 3,068,032

7 minutes later, New Files = 0

Load a.b.misc, 44288/44228 indicated and loaded. Most recent post 5 hours 54 minutes ago.

Re: Basic header retrieval

PostPosted: Sat Jun 05, 2021 12:39 am
by Quade
When you load a.b.misc, do you actually see any files that match your watch filters? I'm seeing a couple image files and some music files and a whole bunch of obscured stuff with no identifiable filenames.

Normally I won't bother to download headers from Misc because it's all obscured.

I'm starting to think nothing goes into your watch because nothing matches the watch filtering.

Re: Basic header retrieval

PostPosted: Sat Jun 05, 2021 1:20 pm
by rh
I'm not sure why I ever had a.b.misc in the GOG (from 10+ years ago) and during this testing I never really paid any attention to the posts. Scrolling through the 44k that did come down after I removed it from the GOG and did the header pull, 99% are obscured and I'm seeing the same things you saw. But that isn't the point.

I think I demonstrated that if a group is in a GOG AND that GOG is referenced by an existing Watch Topic - you are unable to manually pull headers for that individual group in the traditional manner.

I could use a.b.mp3 and show the same result.

So to be clear, for this particular test, I was not expecting anything to match a Watch Topic. I was just demonstrating that headers (though downloaded) are not stored and thus I 'presume' not seen by any Watch Topic - IF - the group being processed is in a GOG and that GOG is used by a Watch Topic.

Re: Basic header retrieval

PostPosted: Sat Jun 05, 2021 2:02 pm
by rh
Today I'll use a.b.mp3. It is currently in a GOG named SOUND.

I just reset the pointer to 1 day, I forgot to note the download "size" but I have 1028 post loaded now.

Tomorrow, I'll create a Watch Topic to look for "mp3" AND more importantly set Look In Topic to SOUND.

Tomorrow night I'll download latest from the individual group a.b.mp3. It will show header data being downloaded but NO post will be stored in the DB for a.b.mp3. This will clearly demonstrat the issue using a group with fewer obfuscated posts.

Re: Basic header retrieval

PostPosted: Tue Jun 08, 2021 12:49 am
by rh
Well, this might be impossible to ultimately chase down. I might end up taking the time to start with a fresh config file some day.

My MP3 test failed in that Download Latest downloaded headers AND DID store them in the group DB, despite that group being IN a GOG that was referenced by an active Watch Topic that I let run its course before trying the download headers.

The other group that would download but always failed to store the downloaded headers still exhibits the problem. I assume it exist for other groups in the same GOG that this one is in. BUT, I really only wanted to get 14 days of headers for THAT group to "catch up" since I was pretty confident the usual activity was present. I found a way to accomplish that during this series of tests. Using the tip you provided earlier about resetting Post Storage, I did that. THEN - I MOVED the group OUT of it's GOG into another group I had for Sounds that had never been used in any Watch Topics. THAT WORKED. It downloaded the 14 days of headers AND saved them in the group DB. I loaded the group, saw what I knew was probably posted during those 14 days, and achieved my goal.

So maybe it's just some weird - perhaps legacy - stuff in my config that causes the problem. Moral to the story, never shutdown that computer or NB.

Re: Basic header retrieval

PostPosted: Tue Jun 08, 2021 4:24 pm
by Quade
THEN - I MOVED the group OUT of it's GOG into another group I had for Sounds that had never been used in any Watch Topics.


This sort of implies you might have a header filter set for some of your groups or GOGs. There's a little documented feature that allows you to apply a filter to the stream of headers being imported into the DB3. I could imagine you having a filter set on a GOG that tosses out all the newly downloaded headers.

You can look inside your NBI with a text editor and see of you have a "DownloadFilter" set on any groups or topics.

"DownloadFilter=HeaderFilter"

I use this on one of my groups to remove obscured postings.

Re: Basic header retrieval

PostPosted: Tue Jun 08, 2021 8:04 pm
by rh
BOOM :shock: - mystery solved.

I indeed had a Header Filter on THE GOG that the problem group was in. Back in the day I was experimenting with header filters after following many posts here on the topic and forgot that was in place. I'm not going to bother dissecting all the filters in SpamFilter but I know I had added entries there to catch spam. My last update might have been so inclusive that it prevented ALL headers from being saved.

Thanks for hanging on with me! Now I can be away and use the Post Storage trick to catch-up if needed.