Welcome to PlagueFest.com! Log in or Sign up to interact with the Plague Fest community.
  1. Welcome Guest! to interact with the community and gain access to all the site's features.

Hardware upgrades

Discussion in News started by Kyle, Jan 19, 2014

  1. Apr 9, 2007
    Hi guys.

    So you may have been wondering what the esoteric titles were about... I was whining to @Brian Friday night about how consistently bad our FastDL (hereon referred to as SlowDL) server is. There are times when it just really chugs *insert crude word here*. We previously talked about getting an SSD but couldn't justify the downtime, reorganization, and system update (Debian doesn't really support modern hardware; ssds included). So I frustratingly looked at the LSN server page yet again, and saw they had increased the base bandwidth packages from 6TB to 10TB. I also noticed they discounted a ton of base packages, RAM, and their gbit ports. I priced out our same web-server with much improved specs, it came out about $10 more expensive for a Haswell CPU (same price if we had gotten Ivy Bridge), with memory, and disks tripled (bandwidth doubled). Well, shit; it's the same price and this was the super discounted server. I priced out new game-servers, with improved disk capacity, because you know, Mappers ( :neutral: ), this was an actual win in price. So we can upgrade our hardware, and save money. We went to bed, baffled, planning to wake up in the morning. We also made a ticket to request to have our networks transferred, which is almost always the nightmare. Presumably the $20 charge per network now (it was previously free; don't get me started on horrible tactics...) is to help prevent the confusion because there's a transaction. But I digress, onwards!

    Brian rolled out of bed at about 11AM my time (10AM his; we finished at around 3AM). I talked to live support to figure out what was going on with the E5's as on their sale page there's a Special Offer if you use the Live Chat feature.
    ss (2014-01-19 at 070111)

    Success! The deal is 50% off of all upgrades! So we start configuring our E3-1270v3 with just a ridiculous amount of upgrades. We're mulling over it, still under what we were previously paying, then we decide to get a disk controller (bringing the total cost for three servers to $2 over what we were paying total). With the 50% discount, it's $10. The primary purpose of the controller is just for the caching capabilities, we have no intention of using RAID. We order the server, and I forward the invoice number onto who I was talking to. Puzzled, he asked where's the E5 server. I double check the invoice number, it all lines up... Then I look at the name beside it, we bought an E3...

    Stuck, in a futz, we have all this planned hardware for a server we can't afford without the discount; we have to start over. It's 1:30PM, and of course it's a Saturday so Sales closes early. I ask if there's any other sort of discount available, and he links me to the New Years sale page.
    ss (2014-01-19 at 071944)

    Damn, that's 20% off of the total price, not just upgrades. We also have a 5% off resellers discount, which brings the total up to 25% off of the total order (I had to inquire, the coupon didn't seem to work with it; we got it though). The 50% off deal was the better offer, but with 25% off, it's still significant enough, and we're still saving money if we do upgrades. At 2:30 we order the new game-server. Now onto Web, which was the harder problem to solve.

    Web has a single HDD, which locks quite frequently as people in Mumble (potentially) know. We're pretty sure it's our backups destroying the drive when they hit (I like to imagine grinding when it happens; Brian does not). We had a bunch of clunker SSDs previously planned out (64G Kingstons), they're consumer drives; some of the reviews were hilarious (imo) so we changed our mind.

    Apparently the norm (now) for the States is any new orders have to have identity checks (we've only done replacements / part upgrades in the past couple years). Sales was closing in an hour, and we were still trying to figure out what's going on with Web. We didn't know this immediately, and lost about 10 minutes just mulling over Web before the account holder (Brian) snapped a picture to upload.

    Eventually, we sided with a single Kingston as the OS, one more well reviewed SSD for the DB and fastdl, then a single 500G drive for backups (saving the Kingston from itself; if it happens). Of course more RAM, bandwidth, etc + the new drive controller.

    The RAID controller was esoteric, and while ordering I was warned many, many times it wouldn't work outside of an array (our configuration wouldn't work). The installation instructions were don't configure RAID, despite us ordering RAID 1. So he added the card, plugged it in, then presumably called it a day. I configured two JBOD's, then installed Gentoo. The install was pretty straight forward, nothing crazy happened. I tried GRUB2 as I was feeling adventurous and it's what is recommended now. The install finished, I rebooted the server. Nothing, the server would go into some dumb UEFI shell after trying to PXE Boot. The nightmare begins.

    The screen before the actual fire happened was as follows.
    ss (2014-01-18 at 072129)

    The drives show up as independent JBOD's, but they don't show up in the BIOS. What the hell? So I play with some BIOS settings, thinking it's some dumb UEFI thing (apparently you can choose to UEFI boot off of PCI devices; presumably those SSDs...). Next thing I know, the controller gets rather upset and corrupts itself.
    ss (2014-01-18 at 073752)

    So then comes the tickets. I made a ticket explaining what I did, and what happened, then they wanted to charge me $5 ( Lime Aid ) just to look at it. The controller never worked to begin with, brand new server, and they want to me charge cash?! I mention the Provision ticket, no response after 10 minutes so I close the ticket and fallback to the original ticket discussing the hardware installation. Same guy comes into the ticket 40 minutes later, iterating we need to pay money for them to look at the problem. His shift ends, someone else comes and it's relatively smooth sailing. Card is replaced, same BIOS settings, no corruption (It Wasn't Me™). Then the SuperMicro board just continues to crash and never boots, so we get moved from some (presumably ghetto) controller series, three series up with 3x the CPU speed (dual core) and twice the cache. It's stable, time to rejoice. The only problem is, I can't boot from my drives...

    There's a option at the bottom of the dos-style window to convert the JBOD's to an Array. At this point, it's 4 hours later and I have nothing to lose (the first guy burned a ton of time). I go for it, don't re-initialize the drives to preserve data. Suddenly when rebooting there's one single drive virtual drive from the controller exposed to the BIOS. Hurray! I set the device to that, hit save, then sail onwards... Or so I thought.

    It's now 1AM, when we boot from the RAID controller's virtual drive, nothing happens. Clearly it's the sneaky bastard that changed on me, Grub2. I redo the Gentoo install (another hour), go with Grub1, recompile everything and install it to the drive, then reboot. The SM board hangs once, which is fine I guess, all of our other SuperMicro boards (From LSN) fail to boot about 20% of the time so it's not controller related. The virtual drive boots, and nothing, just that damn blinking cursor... We've gone through three controllers, first one legitimately failed, second one was unstable (presumably corrupt as well), third one resulted in success but it just won't boot.

    When we boot from a DVD, the controller is found and drives are just auto-magically loaded and exposed as pure, labeled drives. To the BIOS though, they just don't exist. We go back and forth (the guy who we'd been talking to all night, since the actual shiesters left), he keeps pointing to the install as being the problem. I'd run through it two times at that point, very straight forward. At least if it was me messing up, GRUB should say something, or something else should throw an error message saying Missing OS. That wasn't happening, just the same blinking cursor that happened before the Controller BIOS booted. At this point, he'd re-racked the server at least 7 times trying things. I finally say lets just have the OS drive on its own, if it boots, the problem is the controller. If it doesn't boot, it's my fault. We're not using Striping or anything equally as crazy.

    The drive is re-racked with the OS drive on its own. The ticket was updated no joy, and I die a little inside. However, I noticed that god damn fast blinking cursor in the top right left corner, it's clearly still booting from the controller! I look at the BIOS, and indeed, it's still set to boot from the controller. I look for the Intel drive, and it's not listed anywhere. From how I wrote it, it was "Unplug the Intel Drive", not "Unplug the Intel Drive and plug it into the Main HBA". Once that's sorted (taken off the rack, and put back), we set the Intel drive to boot. Fingers crossed.

    A quick GRUB screen pops up for about half a second then the OS boots. About 3 seconds later, we're at the login screen. The controller doesn't support booting from standalone volumes, or something equally absurd. I try to SSH in, no joy. I login using the KVMoIP, check out my trusty ifconfig, the interfaces were missing. I didn't include the igb kernel module in the 2nd go around, doh. What I found hilarious was he was still watching (I was messing around in the BIOS prior too), and said essentially: "You might need this" and dumped the networking information.

    The system eventually comes online, but then we're stuck without the OS on the RAID controller, which loses half the benefit right there (two disk system). Brian's gone to sleep since it's 2AM, I truck onwards with alone with the most helpful Tech. I create a btrfs partition for the server drive, start cloning from our original server using the internal network. It took me a while to get the settings right (I started with SCP, then realized it wasn't copying symlinks properly (it was following them), ended up with rsync). At about 4AM I ask for the Kingston to be removed from the RAID controller so Brian doesn't hit the problem I did as I haven't solved it. I also notice the new battery backup unit doesn't appear to be working (I thought it was the older card due to this, it was also cataloged wrong internally). The Kingston is removed; HDD and SSD on the controller are ready to go. I'm was really tired and stopped making sense, and went to bed at about 7:20AM.


    I woke up at about noon from the wind picking up and destroying my blind on the downfall. My phone also went off, so I was up and at it again. The sync went through, but I messed up. I didn't put btrfs on a partition, I did the whole drive. Since I added btrfs as a module and not built-in (if something built-in has a problem and causes a panic, you're hooped), there was a race condition between the raid controller module being loaded and the btrfs module being loaded and being recognized as a btrfs drive (apparently if it's detected as corrupt, the device node doesn't show up at all... lol?).

    At about 2PM I make the plaguefest user and sync the assets to their account. Then I login to find this monstrosity.

    Reference: This is what a shell looks like.
    ss (2014-01-19 at 084524)

    Then we have all this confusing garbage. I don't know, I find the carnival coloured theme horrible. Especially the newlines with full paths, my word that's bad. Blame List: Brian.

    Anyways, all that was left to do was to test the server using ZombieMod. I changed the IP and launched it as is, then @Tony the Tiger!! :grin: joins, accusing me of crashing ZM. :frown: Oh woe is me. The test seemed to go fine, I don't remember our previous performance numbers but I remember at one point the CPU maxing out (maybe that was just on Sandy Bridge), we never hit above 70% so presumably it was within an acceptable margin. I reconfigured the network, ensuring we're not going to hit the reverse path filter bug (why that's a thing in 2014 I will not know).

    After deliberating with Brian we decided to remove it from the game-servers since it was just the single drive; that drive being an SSD would probably have a marginal benefit at best. It wasn't worth the $19*2 off to maintain on the controller on two servers with pretty fast (albeit consumer, enterprise Intel's are on the OS (they're slightly slower)) SSDs. We're keeping it on the HDD and SSD on Web, but Brian may be right about it's usefulness there too. The controller was removed, but then my drive disappeared. I realized it happened again, but it was a little more "oh come on". I wrote "Can you please remove the RAID controller from our package", the drive was disconnected, the controller was removed, then the server was re-racked. The Samsung SSD wasn't plugged back in, which was another 20m thing on-top of the 40m before the card was removed.

    Card removed, we try to reorder the first server (the game server), but notice the button is missing (Brian actually noticed it the night before since we wanted to do it then to grab the new years discount before it expires; the controller woes are what prevented us mostly, we saved $19 from waiting).

    I launched all of the (zombie) SRCDS servers, they bound to their non-transferred addresses... They're listening, but no one can connect because there's no routing (or gateway). Then we made the IP transfer request. We needed a network change done at the time, which still hasn't been done at the time of this writing (Four Hours Later; nice ticket response time...), after an hour of "Yeah this will be handled soon; sure." I explicitly asked if he was capable of transferring networks. Apparently they're called IP Blocks in the States, or something equally bizarre, but I digress.

    The esoteric messages were to notify that something may happen, but not give anything away. We missed Christmas, hopefully here it is over the next of couple weeks. I don't expect Brian to write as much as I did, but who knows :wink: It was just myself and Brian who knew about this. Source:
    ss (2014-01-19 at 092918)
    Saying I killed beavers, preposterous.

    RE: Limestone Networks, something internal to the company has changed. They used to be really focused on the customer, I mean, there are always assholes in a company that would explicitly try to harm you. People like that do indeed exist at Limestone Networks; it's like there's more of them now. However, the technician I talked to last night really made the whole thing not as bad as it could have been, and for that I applaud him. The positive experience isn't isolated either, the guys on the night shift are really the best. The daytime guys try to ding you for everything, manage to destroy data doing routine things; I find it horrible. There's no support, unless if you pay them; but they're a corporation in the States, trying to make money. Oh well.
    • Like x 8
    • Funny x 2
    • Winner x 2
    • Wizard! x 2
    • Informative x 2
    • Mapping King x 1
    • Useful x 1
      Kyle, Jan 19, 2014 Last edited by Kyle, Jan 19, 2014
    • Apr 9, 2007
      I've added a little more information then was there previously.

      • Downtime was about 5 seconds for everyone.
        • It looked like the server crashed, the other set was running ready to go.
      • I bought Harvey's for Lunch on buy day.
        • Two Angus burgers with Large Onion Rings and Drink ($9!)
      • The guy who provisioned the server shorted us on an SSD.
        • Really cheap one, same capacity; not sure if malicious...
      • Firewall Rules were applied at 10:45PM, the ticket was made at 6:12PM
        • I kept hounding; otherwise it would have been next week :neutral: (past experiences).
      • If your LimeAid ticket takes longer then an hour; it's $65.
      • RockWare, their Ticket System, went down for about 45 minutes when we were adding the RAID card to the Web server.
      ss (2014-01-20 at 120500)
      ss (2014-01-20 at 014431)
      • Informative Informative x 4
      • Funny Funny x 1
        Kyle, Jan 19, 2014 Last edited by Kyle, Jan 20, 2014
      • Jan 21, 2011
        I forgot you guys gave the servers ghetto names. LOL
        • Agree Agree x 1
        • Apr 9, 2007
          Brian hasn't asked for a login yet :wink:

          Also; Racist :frown:
          • Like Like x 1
          • Funny Funny x 1
          • Feb 27, 2012
            I was in my friends tab in the server browser, saw Zombie Mod, joined from there, and notice the server was dead. I figured you broke something since you were the only one there :frown:
            • Agree Agree x 1
            • Jan 21, 2011
              • Agree Agree x 1
                Seeker, Jan 19, 2014 Last edited by Seeker, Jan 19, 2014
              • Feb 3, 2012
              • Feb 3, 2012
                Even the Haplo Fetus crashes the server. :doh:
              • Apr 9, 2012
                So, everything resumed, there are some guys at your Server Network system that intentionally wanted to harass you and principammy break the servers into something worse than before? I guess there'll always be some persons in the world that are so bored that they need to harass people in order to make themself feel important :pain:

                I hope it gets sorted out.
              • Apr 3, 2013
                It had to be done
                • Funny Funny x 2
                • Apr 9, 2007
                  I wrote it for me, clearly not for you.
                  • Zing! Zing! x 2
                  • Winner Winner x 1
                  • Sep 25, 2010
                    So I take it that yesterday while I was getting some help from you, you were preocupied with killing beavers? For shame!
                    • Like Like x 1
                    • Funny Funny x 1
                    • Apr 1, 2012
                      For a guy who sucks in IT, I'm scared at reading Kyle's epic saga
                    • Aug 7, 2012
                      Always neat to read through the aftermath :3
                    • Dec 11, 2013
                      Agree with it, but i also force myself to read it (because it's important!) After reading it, my mind was blackout for 5min lolzzzzz