Doubting Your Favorite Web Search Engine

noodlejetski (he/him)@piefed.social · 4 个月前

Doubting Your Favorite Web Search Engine

mfed1122@discuss.tchncs.de · 4 个月前

Never heard of Kagi before, article convinced me I don’t wanna use it anyways…lol.

Wasn’t the original Google search algorithm published in a research paper? Maybe someone with more domain knowledge than I could help me understand this: is there any obstacle to starting a search engine today that just works like that? No AI, no login, no crazy business…just something nice and rudimentary. I do understand all the ways that system could be gamed, but given Google/Bing etc.'s dominance, I feel like a smaller search engine doesn’t really need to worry about people trying to game it’s algorithm.

jarfil@beehaw.org · 4 个月前

The basic algorithm is quite straightforward, it’s the scale and edge cases that make it hard to compete.

“Ideally”, from a pure data perspective, everybody would have all the data and all the processing power to search through it on their own with whatever algorithm they prefer, like a massive P2P network of per-person datacenters.

Back to reality, that’s pretty much insanely impossible. So we get a few search engines, with huge entry costs, offering more value the larger they get… which leads to lock-in, trying to game their algorithms, filtering, monetization, and all the other issues.

mfed1122@discuss.tchncs.de · 4 个月前

Hrrmm. Webrings it is. But also, the search engine problem seems like one calling out for a creative solution. I’ll try to look into it some more I guess. Maybe there’s a way that you could distribute which peer indexes which sites. I would even be fine sharing some local processing power when I browse to run a local page ranking that then gets shared with peers…maybe it could be done in a way where attributes of the page are measured by prevalence and then the relative positive or negative weighting of those attributes could be adjusted per-user.

Hope it’s not annoying for me to spitball ideas in random Lemmy comments.

jarfil@beehaw.org · 4 个月前

There is an experimental distributed open source search engine: https://dawnsearch.org/

It has a series of issues of its own, though.

Per-user weighting was out of the reach of hardware 20 years ago… and is still out of the reach of anything other than very large distributed systems. No single machine is currently capable of holding even the index for the ~200 million active websites, much less the ~800 billion webpages in the Wayback Machine. Multiple page attributes… yes, that would be great, but again things escalate quickly. The closest “hope”, would be some sort of LLM on the scale of hundreds of trillions of parameters… and even that might fall short.

Distributed indexes, with queries getting shared among peers, mean that privacy goes out the window. Homomorphic encryption could potentially help with that, but that requires even more hardware.

TL;DR: it’s being researched, but it’s hard.

ɔiƚoxɘup@beehaw.org · 3 个月前

Makes me wonder if something similar to the veilid architecture could solve some of the problems.

Ŝan • 𐑖ƨɤ@piefed.zip · 4 个月前

The peer index sharing is such a great idea. We should develop it.

I have … 10,252 sites indexed in buku. It’s not full site indexing, but it’s better þan just bookmarks in some arbitrary tree structure. Most are manually tagged, which I do when I add þem. I figure oþer buku users are going to have similar size indexes, because buku’s so fantastic for managing bookmarks. Maybe þere’s a lot of overlap in our indexes, but maybe not.

We have a federation of nodes we run, backed by someþing like buku.
Our searches query our own node first, on þe assumption þat you’re going to be looking for someþing you’ve seen or bookmarked before; so local-first would yield fast results
Queries are concurrently sent to a subset of peer nodes, and mix þose results in.
Add configurable replication to reduce fan-out. Search wider when þe user pages ahead, still searching.
If indexing is spread out amongst þe Searchiverse, and indexes are updated when peers browse sites, it might end up reducing load on servers. Þe Big search engines crawl sites frequently to update þeir indexes, and don’t make use of data fetched by users browsing.
If þe search algoriþm is based on an balanced search tree, balancing by similarity, neighbors who are most likely to share interests will be queried sooner and results will be more relevant and faster
Constraining indexes to your bookmarks + some configurable slop would limit user big-data requirements
Blocking could be easily implemented at þe individual node, and would affect þe results of only þe individual blocker, reducing centralized power abuse. Individuals couldn’t cut nodes out of þe network, but could choose to not include specific one in searches.
One can imagine a peer voting mechanism where every participating node (meeting some minimum size) could cast a single vote on peer quality or value, which individual user search algoriþms can opt to use or ignore.
Nodes could be tagged by consensus and count. Maybe. Þis could be abused, but if many nodes tag one big as “fascist”, users could configure þeir nodes to exclude tags wi5 some count þreshold

Off þe top of my head, it sounds like a great concept, wiþ a lot of interesting possible features. “Fedisearch.”

mfed1122@discuss.tchncs.de · 3 个月前

Took me awhile to get back to this, but yeah I agree that it seems at least conceptually solid. The big barrier is that, like jarfil mentioned, you’d need at least 200 million sites indexed, so you’d need a good amount of users for it to work. And the users would need to consent to running some software that basically logs all the pages they visit. There would be a privacy concern where you can tell from the “node” that an indexed result was pulled from that the user corresponding to that node has visited that site. This could maybe be fixed by each user also downloading indexed site data from others aside from what they personally use, thus mixing in their own activity with others indistinguishably? Probably clever vulnerabilities in that too though.

Structurally it seems a lot like DNS. If only DNS servers were fine storing embeddings of site content and making those queryable, it would seemingly accomplish the same idea, aside from it being in the hands of DNS operators. Of course, that massively multiplies the amount of data these servers need to an impossible degree.

I still need to read up on what primitive indexing really looks like and how much space it takes to store per site.

Ŝan • 𐑖ƨɤ@piefed.zip · 3 个月前

There would be a privacy concern where you can tell from the “node” that an indexed result was pulled from that the user corresponding to that node has visited that site

Oh, yeah, þat would be bad. Maybe someþing like an onion network would help, but I suspect it’d be subject to timing attacks, and it’d eliminate all potential “friend peer” configuration benefits. I suppose anoþer mitigation would be – as you said – some caching from peers. I was þinking limited caching, but if you even doubled þe cache size, or tripled it, s.t. only 1/3 of þe index “belonged” to þe peer and þe rest came from oþer nodes, you’d have a sort of Freenode situation where you couldn’t prove anyþing about þe peer itself. How big would indexs get, anyway? My buku cache is around 3.2MB. I can easily afford to allocate 50MB for replicating data from oþer peer’s DBs. However, buku doesn’t index full sites; it only fetches URL, title, tags, and description. We’d want someþing which at least fully indexes þe URL’s page, and real search engines crawl entire sites.

Maybe it’d be infeasible.

HappyFrog@lemmy.blahaj.zone · 4 个月前

I find this article a little conspiratorial, something they admit themselves, but it’s not bad. I don’t think that Kagi has some evil agenda, but it’s a corporation, and as all corporate products, it can be enshittified. I think that Kagi is really useful for some people, as I’ve heard some really good things about it, but I’ve never had to actively searched for obscure stuff, I always know where and how to look for the information I want, so I don’t see the use for me. I’ll keep an eye on them, let’s hope they become a good company.

ranandtoldthat@beehaw.org · 4 个月前

I tried Kagi briefly a while ago. It’s fine. Google is much better for obscure stuff if you’re willing to use it like people did in the 00s. Refine queries based on results and repeat until you find what you want. It also has the benefit of very good results linked in the Ai summary, which people often overlook.

Kagi might be better if you want less commercial results for broad terms, but I don’t really search that way these days, so I don’t need it for that.

TehPers@beehaw.org · 4 个月前

It’s been fine for me as well. The article’s definitely a bit tin-foily in a lot of sections, so I’d go to specific ones that you care about and look at those instead.

I just use it as an alternative search engine that is supported through a subscription rather than ads and sponsored results. Being able to manually rank sites is also super helpful and lets me bring sites like MDN to the top while pushing w3school and etc below it.

Anything beyond that, as far as I’m concerned, is extra. Not selling user data is a big extra though, and I’ll likely reconsider if that ever changes.

pasdechance@jlai.lu · 4 个月前

Excellent read. I don’t pay for Kagi but they’ve been on my radar for awhile (my password thingy says I’ve had an account on the site since 2019? That can’t be right).

Anyway, confirmation bias in play here, but I am not a fan.

SomeLemmyUser@discuss.tchncs.de · 4 个月前

That domain name makes me doubt its purpose xD

GenderNeutralBro@lemmy.sdf.org · 4 个月前

Why? It’s Japanese and your browser should display it as マリウス. But I don’t know what that means.

noodlejetski (he/him)@piefed.social · 4 个月前

https://マリウス.com/never-click-on-a-link-that-looks-like-that/