As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r@tucson.social · 5 months ago

I normally would, but my wife has the same problem and she’s done that 3 times in the last 6 months. In fact, her problem became MUCH worse because the “clean slate” was far more impressionable. She’d search up beauty routines, only to find that Youtube thinks she now wants to see “popping” videos, even though she’s now searching for dinner recipes.

So yeah, I saw her experience and decided “no thanks”.

To be fair, MOST of YouTube I watched can be found on Nebula and Floatplane, both of which will likely not have this issue since it’s not a user-content platform. Not to mention, the creators likely make more from those platforms anyways.

YouTube is basically unavoidable though, so now I just view everything through a piped instance if I absolutely need something that can only be found there.

th3raid0r@tucson.social · 5 months ago

I used to subscribe to YouTube premium as of just a few days ago. Even without the ads. There was something very seriously wrong with the suggestion algorithm.

I was getting cartel violence videos, and dead animal videos. Never watched one before in my life. Yet. YouTube seems to think that I should want to watch this crock of shit. This started coming up about 6 months ago. Until now I’ve been reporting each video as they come up. But that doesn’t seem to help at all.

At this point I think YouTube is a danger to society - if it’s recommending cartel violence videos to me unsolicited, what are they suggesting to my nieces?

I have completely nuked it from my life. Almost all of the YouTubers I like are on Nebula or Floatplane so it doesn’t feel like I’m missing much.

th3raid0r@tucson.social · 7 months ago

Paywalled. Can anyone paste the text here?

th3raid0r@tucson.social · 7 months ago

I’ve tried it before, it’s fine but had issues running on wayland last I tried. Did they fix the wayland issues? Looking at the issue tracker it seems like there are still a few open Wayland issues.

kiTTY by contrast has had Wayland support for about as long as I’ve used it.

th3raid0r@tucson.social · edit-2 8 months ago

He did this thing where he unified his shell history across thousands of hosts - it was super handy given our extensive use of Ansible playbooks and database managment commands. He could then use a couple hotkeys to query this history within a new open document. Super handy for writing out shell command steps or wrapping things in a bash script you’re working on. Unfortunately I don’t really have a link to HOW to do this, I just remember thinking “Oh my god, that would save me SO much time”.

Nowadays, I just have this giant document with hundreds of our runbook commands and enable Github Copilot to make it SUPER easy to do the same thing without establishing an SSH session in the backend.

th3raid0r@tucson.social · 8 months ago

Eeeehhhh, I was kinda jealous of one of my coworkers Doom Emacs setup. He had automated like 80% of his own job with it. Still haven’t bothered to try to learn it myself. One of these days…

th3raid0r@tucson.social · 8 months ago

No kidding. One of the YouTubers I followed was really shilling Zed editor. He didn’t seem to mention that it was Mac only.

Well, I guess it’s back to neovim on kiTTY terminal for me.

Sometimes I swear Mac based developers think the world revolves around them.

th3raid0r@tucson.social · 11 months ago

“Your application” - the customers you mean. Our DB definitely does it’s own rate limiting and it emits rate limit warnings and errors as well. I didn’t say we advertised infinite IOPs that would be silly. We are totally aware of the scaling factors there and to date IOPs based scaling is rarely a Sev1 because of it. (Oh no p99 breached 8ms. Time to talk to Mr customer about scaling up soon)

The problem is that the resulting cluster is so performant that you could load in 100x the amount of data and not notice until the disk fills up. And since these are NVME drives on cloud infrastructure, they are $$$.

So usually what happens is that the customer fills up the disk arrays so fast that we can’t scale the volumes/cluster fast enough to avoid stop-writes let alone get feedback from the customer in time. And now that’s like the primary reason to get paged these days.

We generally catch gradual disk space increases from normal customer app usage. Those give us hours to respond and our alerts are well tuned. It’s the “Mr. Customer launched a new app and didn’t tell us, and now they’ve filled up the disks in 1 hour flat.” that I’m complaining about.

th3raid0r@tucson.social · 11 months ago

It is definitely an under provisioning problem. But that under provisioning problem is caused by the customers usually being very very stingy about what they are willing to spend. Also, to be clear, it isn’t buckling. It is doing exactly The thing it was designed to do. Which is to stop writes to the DB since there is no disk space left. And before this time, it’s constantly throwing warnings to the end user. Usually these customers tend to ignore those errors until they reach this stop writes state.

In fact, we just had to give an RCA to the c-suite detailing why we had not scaled a customer when we should have, but we have a paper trail of them refusing the pricing and refusing to engage.

We get the same errors, and we usually reach out via email to each of these customers to help project where their data is going and scale appropriately. More frequently though, they are adding data at such a fast clip that them not responding for 2 hours would lead them directly into the stop writes status.

This has led us to guessing what our customers are going to end up at. Oftentimes being completely wrong and eating to scale multiple times.

Workload spikes are the entire reason why our database technology exists. That’s the main thing we market ourselves as being able to handle (provided you gave the DB enough disk and the workload isn’t sustained for a long enough to fill the discs.)

There is definitely an automation problem. Unfortunately, this particular line of our managed services will not be able to be automated. We work with special customers, with special requirements, usually fortune 100 companies that have extensive change control processes. Custom security implementations. And sometimes even no access to their environment unless they flip a switch.

To me it just seems to all go back to management/c-suite trying to sell a fantasy version of our product and setting us up for failure.

th3raid0r@tucson.social · 11 months ago

That is exactly what we do. The problem is that as a managed service offering. It is on us to scale in response to these alerts.

I think people are misunderstanding my original post. When I say that customer cluster will go into stop writes, that does not mean it is not functional. It is an entirely intended function of the database so that no important data is lost or overwritten.

The problem is more organizational. It’s that we have a 5 minute SLA to respond to these types of events and that they can happen at any random customer impulse.

I don’t have a problem with customers that can correctly project their load and let us know in advance. Those are my favorite customers. But they’re not most of our customers.

As for automation. As I had exhaustedly detailed in another response, we do have another product that does this a lot better. And it’s the one that we are mass marketing a lot more. The one where I’m feeling all the pain is actually our enterprise level managed service offering. Which goes to customers that have “special requirements” and usually mean that they will never get as robust automation as the other product line.

th3raid0r@tucson.social · 11 months ago

Our database is actually pretty graceful. It just goes into stop writes status. You can still read any data and resolving the situation is as easy as scaling the cluster or removing old records. By no means is the database down or inoperable.

Essentially our database is working as designed. If we rate limited it further then we have less of a product to sell. The main feature we sell of our database technology is its IOPS and resiliency.

Further, this is just for a specific customer, it has no impact to any other customers or any sort of central orchestration. Generally speaking the stop writes status only ever impacts a single customer and their associated applications.

Also, customers can be very stingy with the clusters they are willing to buy. We actually are on poor terms of the couple of our customers who just refuse to scale and just expect us to magic their cluster into accepting more data than its sized for.

th3raid0r@tucson.social · 11 months ago

Probably not feasible in our case. We sell our DB tech based on the sheer IOPS it’s capable of. It already alerts the user if the write-cache is full or the replication cache is backing up too.

The problem is, at full tilt, a 9 node cluster can take on over 1GB/s in new data. This is fine if the customer is writing over old records and doesn’t require any new space. It’s just that it’s more common that Mr. customer added a new microservice and didn’t think through how much data it requires. Thus causing rapid increase in DB disk space or IOPs that the cluster wasn’t sized for.

We do have another product line in the works (we call it DBaaS) and that can autoscale because it’s based on clearly defined service levels and cluster specifications. I don’t think that product will have this problem.

It’s just these super mega special (read: big, important, fortune 100) companies have requirements that mean they need something more hand-crafted. Otherwise we’d have automated the toil by now.

th3raid0r@tucson.social · 11 months ago

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r@tucson.social · 1 year ago

As somebody with autism. I find this take lacking nuance. You see for me these tools represent a huge leap and accessibility for me. I can turn a wall of stream of consciousness text into something digestible and represents myself.

I find myself constantly exhausted with the societal expectation that I review, edit, and adjust my own speech constantly. And these tools go a long way to helping me actually communicate.

I mean, after all nothing changes for me. People thought of me as a robot before. And I guess they can continue to think I’m still a robot. I’ve stopped giving a crap about neurotypical expectations.

th3raid0r@tucson.social · 1 year ago

I mean I take a less extreme take. But I definitely resonate. As somebody with autism, it’s really nice to have an impartial chat assistant to turn my stream of consciousness wall of text into something far more digestible. Trying to do so myself often takes hours to construct a message a couple paragraphs long. Where I checked and double check and triple check for anything that might offend somebody or come across strange or not flow well. Etc etc etc.

A lot of these articles don’t really investigate the accessibility aspect of these tools. And I really wish they did. I know if one of my friends used chatgpt to help with their messages, I would be completely fine with it.

th3raid0r@tucson.social · 1 year ago

Lived there for 7 years - I think I got it.

Step one, do not be in downtown, inner SE, inner NE, Gateway, or anywhere near a Max line or bus station after dark. Step two, carry mace and a stun gun. Step three, leave Portland for good and only return if I must << We are here.

We got a lot of hate from certain left leaning folks in Portland for leaving “because of the homeless”. It’s like, "No, dude, I’m leaving because my wife was assaulted by homeless no less than 3 times (twice physically, once was almost a rape), and that’s even when she was “safely on TriMet. You can ‘but not ALL homeless’ all you want. My wife is traumatized and we want nothing to do with this shithole of a city”.

Yeah, after the 3rd one we left, and we can say with certainty that we’ll never ever come back to live in PDX.

th3raid0r@tucson.social · 1 year ago

Dang it!

I have a business trip to Portland next week and I was looking forward to cooler temperatures than here in Tucson, but alas, it’ll only be 5-10 degrees cooler.

Gee thanks climate change!

th3raid0r@tucson.social · edit-2 1 year ago

In lemmy’s case, my perusal of the DB didn’t really suggest that the queries would be that complex and I suspect that moving it to a higher performance NoSQL DB might be possible, but I’d have to take a look at a few more queries to be sure.

I wonder if this could be made to work with Aerospike Community Edition…

Obviously it could be more effort than it’s worth though.

th3raid0r

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?