I’m trying to set up a periodic crawl for terms of interest to see if communities pop up that discuss them. Some are easy, like “Mordhau” because no other words contain it. On the other hand, “sword” and “HEMA” are problematic because things like “password” and “mathematical” contain them.

I tried to put quotes around the string but that returns nothing, and padding a space on either side didn’t change results.

  • heartlessevil@lemmy.one
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    2 years ago

    There doesn’t seem to be a way. The search is fairly rudimentary, you can see how it works here:

    https://github.com/LemmyNet/lemmy/blob/50efb1d519c63a7007a07f11cc8a11487703c70d/crates/db_schema/src/utils.rs#L59-L62

    It replaces spaces in your search query with wildcards, and puts wildcards at the beginning and end. It also uses case-insensitive search Here’s an example with regexes for the query “foo bar baz”:

    https://rubular.com/r/EiMum5gV9jWOaL

    I think it’s worth mentioning that this search algorithm is extremely slow. It uses several wildcard matches (multiple consecutive spaces also create multiple wildcards), case-insensitive search and doesn’t have indices. Even the built-in fulltext search capabilities of PostgreSQL would be miles more scalable.

    • Ken Oh@lemm.eeOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 years ago

      Ah thanks, that does look very much on the crawl side of crawl-walk-run. Hopefully the devs understand the urgency of discoverability.

    • kevincox@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      It’s amazing how far a table scan will scale. But yes, I think replacing this by PostgreSQL Full Text Search will probably need to be done sooner than later.

      This will give some benefits such as stemming but will have some tradeoffs such as only allowing searching by full words. (but for Lemmy this is probably what people want 99% of the time.)