Friday, April 22, 2022

Searching my own blog

From time to time I check something I wrote before. I've been relying on google--logically they'd be the best since blogger is a google product--but google search is no longer returning very many references. I don't know why--probably most blogs are just noise for ordinary searches. They've cluttered some of my searches in the past...

Since google is no longer providing the info, I have to do this myself. using takeout.google.com you can backup the whole thing, and get an emailed link to a .zip file that contains the lot. When you unzip it you'll find a "feed.atom" file that looks like a fairly simple XML file.

I figured BFI was the way to go, and so I settled on making a copy of the text contents, one line per post. The simple approach is something like

grep -i horse searchable | grep -i boy | grep -i lewis

which gives me the full text, along with a lot of clutter like

The Last Battle</U>.</P> <P>I think

that is trivial to clean up.

If you're interested in trying it too, I used the following (no error handling or sanity checking, sorry) awk script as

awk -f makesearch.awk feed.atom > searchable

BEGIN{ready=0;}
{
        if(index($0,"<id>") > 0)printf("%s ",$0);
        if(index($0,"<content") > 0){printf("%s ",$0);ready=1}
        else{
                if(index($0,"content>") > 0){printf("%s\n",$0);ready=0;}
                else{if(ready==1){printf("%s ",$0);}}
                }
}

Then I put the file somewhere I'll remember it and put a reference to it in .bashrc, and I should be good. At least WRT searching old items in my blog.

UPDATE: Yes, I know about the search feature in the blogger home page, but I am not always logged into my gmail account when composing and searching--nor should I be. Why should random cookies for random sites have any extra information about me?

2 comments:

Douglas2 said...

Early on in my experience of google, I found that I could get to the information I wanted much more quickly by using boolean search operators, many of which work on other search engines as well.

So, for example, if I wanted to search something I remember you writing about Liberia, I would start with a search with the string: liberia site:idontknowbut.blogspot.com

It sure looks like that's broken, however, as I'm only getting 8 results -- at least they are all really from your blog.

james said...

Yep, that's why I decided to do this. Even when I specify that I want only to look at the one site, google seems to have instituted a limit on the number of items to return.
Or perhaps I should say "number of items to keep track of," since even a very specific search for a phrase I know to be present returns nothing.