We had a single storage node in Vegas burp causing the whole block storage cluster to hang. When it hangs, users VM's will usually "pause" and then resume once things are moving again.
Vegas runs a fairly old version of things, with NY/LU/MIA being newer builds. Our current setup doesnt' give us any sort of node level redundancy, though, so if a node locks up/reboots/whatever, it's going to crash whatever VM's are feeding from it.
In the next few days we'll begin live trials on our Ceph cluster. Our own tests look pretty solid and will give us the option to offer Object Storage (S3) if we wanted to.
It'll take quite some time to migrate users into Ceph, but I'm fairly sure I can do the entire thing while users are still running and without a single disruption or byte lost.
Mine doesn't seem to have been down at all. I had an ssh session open last night and it is still connected. Storage is still mounted too. I'm used to having to remount it when anything happens.
@willie said:
Mine doesn't seem to have been down at all. I had an ssh session open last night and it is still connected. Storage is still mounted too. I'm used to having to remount it when anything happens.
That's if a storage node you're attached to reboots, that actually kills active connections so you'll go read-only.
I like the setup we have, it's pretty easy to maintain, but the lack of wider redundancy is annoying.
@Francisco said: I can do the entire thing while users are still running and without a single disruption or byte lost.
Just copying for the comp claim later.
Thankfully it just uses Libvirts live migrations. We literally rebuilt all of LUX slabs...twice... last year due to XFS chewing its face off. Users were unaware it was happening minus the lack of stock.
@Francisco said:
In the next few days we'll begin live trials on our Ceph cluster. Our own tests look pretty solid and will give us the option to offer Object Storage (S3) if we wanted to.
It'll take quite some time to migrate users into Ceph, but I'm fairly sure I can do the entire thing while users are still running and without a single disruption or byte lost.
Francisco
That's exciting! I assume you have enough nodes that you can lose a few without going HEALTH_WARN; the stress of heavy scrubbing on a live cluster can quickly cause cascading issues. Some of us still remember ZXHost....
@Francisco said:
In the next few days we'll begin live trials on our Ceph cluster. Our own tests look pretty solid and will give us the option to offer Object Storage (S3) if we wanted to.
It'll take quite some time to migrate users into Ceph, but I'm fairly sure I can do the entire thing while users are still running and without a single disruption or byte lost.
Francisco
That's exciting! I assume you have enough nodes that you can lose a few without going HEALTH_WARN; the stress of heavy scrubbing on a live cluster can quickly cause cascading issues. Some of us still remember ZXHost....
Shouldn't be a problem I suspect ZX was flying by the seat of his pants and barely had enough capacity to cover what he was offering, nevermind spare. There's a real chance he had 'min_size' == 1, basically R0.
@Francisco said:
In the next few days we'll begin live trials on our Ceph cluster. Our own tests look pretty solid and will give us the option to offer Object Storage (S3) if we wanted to.
It'll take quite some time to migrate users into Ceph, but I'm fairly sure I can do the entire thing while users are still running and without a single disruption or byte lost.
Francisco
That's exciting! I assume you have enough nodes that you can lose a few without going HEALTH_WARN; the stress of heavy scrubbing on a live cluster can quickly cause cascading issues. Some of us still remember ZXHost....
Shouldn't be a problem I suspect ZX was flying by the seat of his pants and barely had enough capacity to cover what he was offering, nevermind spare. There's a real chance he had 'min_size' == 1, basically R0.
Francisco
For what it’s worth I/ZX was using erasure coding. 6-2 if I remember correctly. Was in the process of adding a new node (storage) hit a nasty bug that caused some extents in the OSD journal to be miss set during the rebalance.
Was fine till an OSD needed to restart and playback the journal, worked with the CEPH dev’s to fix the issue at the time.
However by that point enough OSD/PG shards where corrupt and pretty much every RBD was impacted hence toasted FS.
oh, hi there! good to see you alive and standing... miss my storage boxes still, however it all could have ended better I guess ;-)
How's everything going? any plans to come back to hosting? probably people quickly get their forks out, so be careful ... all the best!
Hey!
Couple months after I actually had a few people reaching out to me asking if I was going to restart / offer something as they had a need for XXTB and couldn’t find anywhere else.
So for past year or so have run a small operation providing to just word of mouth.
0 advertisement or anything, after that I realised having 100’s of clients paying $.$$ when something goes wrong it’s a huggeee headache and have not much more hair to loose…
Feel free to drop me a message if you ever need any advise / anything, don’t want to prop up this post anymore.
But yeah I don’t think I’ll be doing anything similar again anytime soon. Maybe launch something on the mid/higher end of $ and not aiming for the bottom.
Comments
When did BuyVM add LA?
Opps, It's Las Vegas.
That is completely true, Las Vegas
Web Design Agency - Custom Web Designs
WHMCS.design - WHMCS Themes | Blesta.shop - Blesta Themes
its up, likely just your node.
Free NAT KVM | Free NAT LXC
was it ever possible to buy something from buyvm? every time i went to their website, it said: sold out!
Yeah it is possible . you must receive an email about availability of their services. It is on their website just put your email there
Dentistry is my passion
Gotta get in a line and be quick about it to buy anything from them.
♻ Amitz day is October 21.
♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.
Sorry about that.
We had a single storage node in Vegas burp causing the whole block storage cluster to hang. When it hangs, users VM's will usually "pause" and then resume once things are moving again.
Vegas runs a fairly old version of things, with NY/LU/MIA being newer builds. Our current setup doesnt' give us any sort of node level redundancy, though, so if a node locks up/reboots/whatever, it's going to crash whatever VM's are feeding from it.
In the next few days we'll begin live trials on our Ceph cluster. Our own tests look pretty solid and will give us the option to offer Object Storage (S3) if we wanted to.
It'll take quite some time to migrate users into Ceph, but I'm fairly sure I can do the entire thing while users are still running and without a single disruption or byte lost.
Francisco
Ah that explain what I can't connect to ssh, but my box still retained its uptime when I came back later.
Mine doesn't seem to have been down at all. I had an ssh session open last night and it is still connected. Storage is still mounted too. I'm used to having to remount it when anything happens.
Just copying for the comp claim later.
Ohh, Fran's gonna get sued.
♻ Amitz day is October 21.
♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.
That's if a storage node you're attached to reboots, that actually kills active connections so you'll go read-only.
I like the setup we have, it's pretty easy to maintain, but the lack of wider redundancy is annoying.
Thankfully it just uses Libvirts live migrations. We literally rebuilt all of LUX slabs...twice... last year due to XFS chewing its face off. Users were unaware it was happening minus the lack of stock.
Francisco
That's exciting! I assume you have enough nodes that you can lose a few without going
HEALTH_WARN
; the stress of heavy scrubbing on a live cluster can quickly cause cascading issues. Some of us still remember ZXHost....Shouldn't be a problem I suspect ZX was flying by the seat of his pants and barely had enough capacity to cover what he was offering, nevermind spare. There's a real chance he had 'min_size' == 1, basically R0.
Francisco
For what it’s worth I/ZX was using erasure coding. 6-2 if I remember correctly. Was in the process of adding a new node (storage) hit a nasty bug that caused some extents in the OSD journal to be miss set during the rebalance.
Was fine till an OSD needed to restart and playback the journal, worked with the CEPH dev’s to fix the issue at the time.
However by that point enough OSD/PG shards where corrupt and pretty much every RBD was impacted hence toasted FS.
Hey you're here, Ash! I did enjoy ZX while it lasted, and I knew you tried your best to recover.
oh, hi there! good to see you alive and standing... miss my storage boxes still, however it all could have ended better I guess ;-)
How's everything going? any plans to come back to hosting? probably people quickly get their forks out, so be careful ... all the best!
Hosting ain't worth it. Stay away.
♻ Amitz day is October 21.
♻ Join Nigh sect by adopting my avatar. Let us spread the joys of the end.
Hey!
Couple months after I actually had a few people reaching out to me asking if I was going to restart / offer something as they had a need for XXTB and couldn’t find anywhere else.
So for past year or so have run a small operation providing to just word of mouth.
0 advertisement or anything, after that I realised having 100’s of clients paying $.$$ when something goes wrong it’s a huggeee headache and have not much more hair to loose…
Feel free to drop me a message if you ever need any advise / anything, don’t want to prop up this post anymore.
But yeah I don’t think I’ll be doing anything similar again anytime soon. Maybe launch something on the mid/higher end of $ and not aiming for the bottom.