Does Lemmy really benefit from Rust? Is code execution speed the bottleneck?

Buttons@programming.dev · edit-2 3 years ago

Does Lemmy really benefit from Rust? Is code execution speed the bottleneck?

dudeami0@lemmy.dudeami.win · 3 years ago

The numbers are a little higher than you mention (currently ~3.2k active users). The server isn’t very powerful either, it’s now running on a dedicated server with 6 cores/12 threads and 32 gb ram. Other public instances are using larger servers, such as lemmy.world running on a AMD EPYC 7502P 32 Cores “Rome” CPU and 128GB RAM or sh.itjust.works running on 24 cores and 64GB of RAM. Without running one of these larger instances, I cannot tell what the bottleneck is.

The issues I’ve heard with federation are currently how ActivityPub is implemented, and possibly the fact all upvotes are federated individually. This means every upvote causes a federation queue to be built, and with a ton of users this would pile up fast. Multiply this by all the instances an instance is connected to and you have an exponential increase in requests. ActivityPub is the same protocol used by other federated servers, including Mastodon which had growing pains but appears to be running large instances smoothly now.

Other than that, websockets seem to be a big issue, but is being resolved in 0.18. It also appears every connected user gets all the information being federated, which is the cause for the spam of posts being prepended to the top of the feed. I wouldn’t be surprised if people are already botting content scrapers/posters as well, which might cause a flood of new content which has to get federated which causes queues to back up; this is mostly speculation though.

As it goes with development, generally you focus on feature sets first. Optimization comes once you reach a point a code-freeze makes sense, then you can work on speeding things up without new features breaking stuff. This might be an issue for new users temporarily, but this project wasn’t expecting a sudden increase in demand. This is a great way to show where inefficiencies are and improve performance is though. I have no doubt these will be resolved in a timely manner.

My personal node seems to use minimal resources, not having even registered compared to my other services. Looking at the process manager the postgres/lemmy backend/frontend use ~250MB of RAM.

For now, staying off lemmy.ml and moving communities to other instances is probably best. The use case of large instances anywhere near the scale of reddit wasn’t the goal of the project until reddit users sought alternatives. We can’t expect to show up here and demand it work how we want without a little patience and contributing.

glorbo@lemmy.one · 3 years ago

Yup was just typing a comment to basically this effect. Federation adds a ton of overhead – you can still do things fairly efficiently, but every interaction having to fan out to (and fan in from!) many servers instead of like a single RDBMS is gonna cost you.

In all likelihood the code is not as efficient as it could be, but usually you get time to work those out gradually. A giant influx of users quickly turns “TODO: fix in the next six months” into “Oh god the servers are melting fuck fuck.”

That said, assuming the devs can get over this hump, I suspect using a compiled language will pay off long-term. Sure things will still be primarily IO-bound, but making things less CPU-bound is usually a good thing.

For some illustrative examples: Mastodon is in Ruby and hits dumb scaling limitations far more often than other fedi microblogs. Pleroma/Akkoma are Elixir (and BEAM is super well optimized for fast message passing/scaling/IO), Calckey (primarily Typescript) is moving some code to Rust, GoToSocial (Golang) is able to run in a fraction of the resources of Mastodon. The admins of one of the bigger tech instances recently announced they’re basically giving up on administrating Mastodon and are instead going to write a new server from scratch in a compiled language because it’s easier for them than scaling a Rails monolith.

TL;DR everything is IO-bound til it’s not.

AggressivelyPassive@feddit.de · 3 years ago

I’m pretty sure the fediverse needs a new kind of node at some point. If we assume, that almost every larger instance is connected to almost every other larger instance directly, then there’s a ton of duplicated and very small messages.

There needs to be some kind of hub in-between to aggregate and route this avalanche. Especially if, like you wrote, every upvote is a message, the overhead (I/O, unmarshalling, etc) is huge.

topbroken@programming.dev · 3 years ago

This is kinda how Usenet worked (well, still does). Rather than n*n federated connections, smaller providers tend to federate with central hubs that form backbones.

I think it makes sense for the fediverse as well.

chris@l.roofo.cc · edit-2 3 years ago

You mean like centralizing the fediverse? Who hosts the hub? Who maintains it? In which country? Who pays for it?

topbroken@programming.dev · edit-2 3 years ago

O(n*n) isn’t really scalable, so you either

a - have a small number of nodes total

b - have a small number of hubs with a larger number of leaf nodes.

Either way, there’s going to be some nodes that become more influential than others.

snowe@programming.dev · 3 years ago

Hi, programming.dev owner here. From what I’ve been seeing it’s a lot of memory issues. We were hitting swap which was causing massive disk io. You can see what happened with the disk io immediately after the upgrade to more memory. I know at least one reason is being resolved in this PR

We were also having issues with the nginx config. There were some really weird settings that I don’t think were necessary. Finally, the federation is quite busy. So if someone subscribes to events from 10 different servers, we pull in every single event, even upvotes. There’s currently a lot of work being done around this stuff.

I don’t think Rust is the problem. I think it’s just a growth thing. Every platform has growth challenges, things grow in ways that you never expect. You might have thought that it was going to be IO constrained due to the federation, but in reality it’s memory constrained because memory is actually the most expensive thing to have on a server. etc.

argv_minus_one@beehaw.org · 3 years ago

So if someone subscribes to events from 10 different servers, we pull in every single event, even upvotes. There’s currently a lot of work being done around this stuff.

You mean like coalescing multiple events into a single message, or…? (I don’t know anything about ActivityPub, so apologies if this is a stupid question!)

snowe@programming.dev · 3 years ago

correct. I’ve been looking for the thread to try and find it for you, but haven’t been having any luck. People have been discussing exactly that though, but it seems like it could cause some problems with vote faking. Anyway, it is being worked on!

binwiederhier@discuss.ntfy.sh · 3 years ago

Thank you for the insight. Fascinating. Also insane that ever upvote causes a flood of messages being distributed…

Espi@kbin.social · 3 years ago

I would say that it’s extremely unlikely.

Websites in general are never limited by raw code execution, they are mostly limited by IO. Be that disk IO as files are read and written, database IO as you need to execute complex queries to gather all the data to build the user timeline, and network IO to transfer data to and from the user. For decentralized social media like Kbin or Lemmy its even more IO limited as each instance needs to go back and forth to other instances to keep up-to-date data.

Websites usually benefit much more from caching and in-memory databases to keep frequently used data in fast storage.

This is why simple, high level, object oriented, garbage collected languages have become so common. All the CPU performance penalties they incur don’t actually affect the website performance.

TortoiseWrath@tortoisewrath.com · edit-2 3 years ago

Not relevant to lemmy (yet), but this does break down a bit at very large scales. (Source: am infra eng at YouTube.)

System architecture (particularly storage) is certainly by far the largest contributor to web performance, but the language of choice and its execution environment can matter. It’s not so important when it’s the difference between using 51% and 50% of some server’s CPU or serving requests in 101 vs 100 ms, but when it’s the difference between running 5100 and 5000 servers or blocking threads for 101 vs 100 CPU-hours per second, you’ll feel it.

Languages also build up cultures and ecosystems surrounding them that can lend themselves to certain architectural decisions that might not be beneficial. I think one of the major reasons they migrated the YouTube backend from Python to C++ isn’t really anything to do with the core languages themselves, but the fact that existing C++ libraries tend to be way more optimized than their Python equivalents, so we wouldn’t have to invest as much in developing more efficient libraries.

terebat@programming.dev · 3 years ago

It is fairly relevant to lemmy as is. Quite a few instances have ram constraints and are hitting swap. Consider how much worse it would be in python.

Currently most of the issues are architectural and can be fixed with tweaking how certain things are done (i.e., image hosting on an object store instead of locally).

Baldr@programming.dev · edit-2 3 years ago

That’s correct. I wonder if YouTube still uses Python to this day (seems like they migrated to C++?)

Not saying there isn’t a difference in language performance, but for most world problems the architecture and algorithms matter more than the language for performance. Unless you’re in a very constrained environment such as lower end smartphones or embedded systems.

TortoiseWrath@tortoisewrath.com · 3 years ago

I wonder if YouTube still uses Python to this day

We do not.

Baldr@programming.dev · 3 years ago

Also this makes you think (assuming it’s true lol): https://www.reddit.com/r/Lemmy/comments/14h965f/comment/jpdemet

th3raid0r@tucson.social · edit-2 3 years ago

In lemmy’s case, my perusal of the DB didn’t really suggest that the queries would be that complex and I suspect that moving it to a higher performance NoSQL DB might be possible, but I’d have to take a look at a few more queries to be sure.

I wonder if this could be made to work with Aerospike Community Edition…

Obviously it could be more effort than it’s worth though.

hungrybread@lemmygrad.ml · 3 years ago

There’s no need to migrate the database, that shouldn’t be an issue at this size. Caching should be implemented as another comment suggested.

TortoiseWrath@tortoisewrath.com · 3 years ago

Oh shit does lemmy not have response caching? Yeah, that’s gonna be an issue pretty soon.

Baldr@programming.dev · 3 years ago

https://www.reddit.com/r/Lemmy/comments/14h965f/comment/jpdemet

alertsleeper@programming.dev · 3 years ago

Would you be so kind as to recommend some resources about caching? I’ve read the basics, but have yet to dive deep on it

hungrybread@lemmygrad.ml · 3 years ago

The basic idea is to keep data as close to the processor as possible, so with a database that means storing the result of commonly used queries in memory.

Baldr@programming.dev · 3 years ago

Good resources.

KindaABigDyl@programming.dev · 3 years ago

It could be the devs just like programming in Rust. It’s a nice language lol

dcormier@beehaw.org · 3 years ago

I know I do. ¯\_(ツ)_/¯

zygo_histo_morpheus@programming.dev · 3 years ago

One benefit of using rust for webservers in general is that it’s possible to have a consistently lower latency compared to GCd langues: discord mentions this in their article about migrating to rust from go: https://discord.com/blog/why-discord-is-switching-from-go-to-rust

Another difference between rust and e.g. python is that rust expects you to invest more time to get code that’s runnable in the first place, but likely more well optimized and correct.

In my experience from writing rust, the language pushes you to write more efficient code compared to python because it makes things like copying visible and also because it’s easier to reason about memory usage compared to garbage collection which means that you’re more likely to have a useful model of the performance cost of various things while you’re programming.

It’s possible that a hypothetical lemmy written in python would have allowed the devs to do some big picture optimizations that they haven’t had the time to do yet in the rust version, so for the time being it might be slower than a python alternative.

Rust is likely to catch up though: eventually the rust version will probably also have this optimization while the python version has to resort to make smaller optimizations that the rust version already had in the first version of the code or that you get for free from the language.

sznio@lemmy.world · 3 years ago

Pulling this out of my ass, but I think the problem might be in Lemmy using websockets.

I feel like supporting 1500 simultaneous users making a request every 10-20 seconds is easier than keeping 1500 websockets alive.

Irregardless, Lemmy does feel very snappy compared to other websites I’ve had the displeasure of using. Main problem is low robustness in the RPC layer.

binwiederhier@discuss.ntfy.sh · 3 years ago

I maintain and host ntfy.sh, an open source push notification service. I have a constant 9-12k WebSocket and HTTP stream connections going, and I host it on a two core machine with an average load average of less than 1. So I can happily tell you that it’s not WebSockets. Hehe.

My money would be on the federation. Having to blast/copy every single comment to every single connected instance seems like a lot.

OrangeSlice@lemmy.ml · 3 years ago

They’re gonna move away from we sockets within a couple of weeks, from what I hear

binwiederhier@discuss.ntfy.sh · 3 years ago

That’s a good move IMHO. Honestly I don’t want my UI to randomly shift down when new messages come in from syncing with another instance.

The right move would be to make a page that renders once and then only updates when you refresh the page. And then use web push for message notifications.

AnonymousLlama@kbin.social · 3 years ago

Whatever the performance bottlenecks are, I hope they can get them sorted so that any exponential growth caused by Reddit at the end of the month can be capitalized on.

I’ve got a feeling that other redditors who want to flee / branch out will look at these sites and so long as they see posts, comments and that it has a decent UI and loads well then people will stick around and build communities here.

knoland@kbin.social · 3 years ago

The real benefit as I see it for using rust for backends is memory safety.

loren@sh.itjust.works · 3 years ago

All the major languages for web backends are memory safe. Java, C#, etc

C8H10N4O2@kbin.social · 3 years ago

These are garbage collected languages and come with the overhead of such a process. Rust has no GC process and instead relies on reference counters to statically track live memory.

eddythompson@kbin.social · 3 years ago

“GC overhead” only matter for extreme realtime applications, like emulators, games, drivers, simulators, etc. a 10msec (or even a 100msec) pause in a request processing isn’t gonna even be noticed when your network, database and disk IO are literally orders of magnitude higher. Use Rust for web services if you like the language, comfortable with it, etc. Don’t use it because you think it’ll give you “more performance” or “reduce GC overhead”.

Java, C#, Python, Node, or even PHP as languages will never be your web backend bottleneck. Large scale web services performance tuning is entirely architectural. What caches you keep, how you organize your data, how many network operation does 1 user interaction translate to, stateful vs stateless components etc.

clawlor@programming.dev · 3 years ago

+1, exactly this.

As an aside, “stop the world” GC pauses can affect web server performance in interesting ways. Some web application servers have a perf profile where throughput drops off a cliff as the server approaches max memory load. This is fine, so long as you know what’s happening, and can tune your auto scaling to spin up new servers before you start to hit that threshold. This likely wouldn’t be a reason to not use a particular lang / server, except at the most massive scales.

dragontamer@lemmy.world · 3 years ago

Meta: Hmmm… replying to kbin.social users appears to be bugged from my instance (lemmy.world).

I’m replying to you instead. It doesn’t change the meaning of my post at least, but we’re definitely experiencing some bugs / growing pains with regards to Lemmy (and particularly lemmy.world).

GC overhead is mostly memory-based too, not CPU-based.

Because modern C++ (and Rust) is almost entirely based around refcount++ and refcount-- (and if refcount==0 then call destructor), the CPU-usage of such calls is surprisingly high in a multithreaded environment. That refcount++ and refcount-- needs to be synchronized between threads (atomics + memory barriers, or lock/unlock), which is slower than people expect.

Even then, C malloc/free isn’t really cheap either. Its just that in C we can do tricks like struct Foo{ char endOfStructTrick[0]; } and store malloc((sizeof(struct Foo)) + 255); or whatever the size of the end-of-struct string is, to collate malloc / frees together and otherwise abuse memory-layouts for faster code.

If you don’t use such tricks, I don’t think that C’s malloc/free is much faster than GC.

Furthermore, Fragmentation is worse in C’s malloc/free land (many GCs can compact and fix fragmentation issues). Once we take into account fragmentation issues, the memory advantage diminishes.

Still, C and C++ almost always seems to use less memory than Java and other GC languages. So the memory-savings are substantial. But CPU-power savings? I don’t think that’s a major concern. Maybe its just CPUs are so much faster today than before that its memory that we practically care about.

valpackett@lemmy.blahaj.zone · 3 years ago

That refcount++ and refcount-- needs to be synchronized between threads

Only for things that you specifically want shared between threads – namely this (synchronized refcount) is an std::sync::Arc. What you want to share really depends on the app; in database-backed web services it’s quite common to have pretty much zero state shared across threads. Multithreaded environment doesn’t imply sharing!

dragontamer@lemmy.world · 3 years ago

The refcount absolutely is shared state across threads.

If Thread#1 thinks the refcount is 5, but Thread#2 thinks the refcount is 0, you’ve got problems.

DiagnosedADHD@kbin.social · 3 years ago

I’m not convinced using rust is all that useful for web development. There are already plenty of other mature well optimized solutions for backend web development that includes a lot of security, qol features ootb like mature orm’s with optimized database access which regardless of how fast your code is can bottleneck your site much faster if you don’t have smart database access.

Yeah, theoretically rust can be faster than ruby/python/node, but it’s harder to optimize it enough to get to that point and you will have a much harder time finding devs to work on such a project because the amount of backend devs that have enough rust experience is so small, they’re like on the opposite sides of the ven diagram of languages for web developers.

On top of all that, languages which are heavily used for web development often get low level optimizations baked into the frameworks/languages so you can get pretty amazing performance uplifts over time to bring it to the level of lower level languages.

24Vindustrialdildo@sh.itjust.works · 3 years ago

I think the devs openly stated they aren’t backend bods and asked for help optimising the database as a priority. There’s a bit of work going on on github to sort that out I think. Anyone reading this who can optimise postgresql or contribute to a database agnostic retool should probably speak to the devs as I imagine you’d be welcome.

I wish I could help so much but I doubt they’re going to retool into .net haha.

Buttons@programming.dev · edit-2 3 years ago

Which is fine. If they wanted to learn Rust and wrote inefficient code, good for them. I appreciate their efforts. Rust can certainly be beaten into shape and perform well enough in the end.

21trillionsats@infosec.pub · 3 years ago

Rust itself or the way the Rust logic is implemented is not the bottleneck. Like most decent web applications the bottleneck is the database and how the decentralized protocols themselves are reconciled there.

Scaling massive amounts of records like Lemmy has been forced to is almost always IO bound at the database level even when a web service is centralized; this is much more difficult in federated architectures. This is why “NoSQL” databases have increased in popularity, but they are also not a magic bullet as there are major ACID trade offs one needs to consider.