Server-Side Hardware for MOGs

myrandomcomment · on May 11, 2017

The design of the proposed networks are not really optimal for provinding low latency. You would do much better by dropping the firewalls, having the front end game severs on then net with the correct setup security wise (only running the services needed to answer for the game). On the game serves have a back end network for storage and DB (2 networks on shared NIC with VLANs or even 2 different NICs). It really depends on the load and throughtput requirements. This is an answer given the design they proposed. L2 and bae metal servers are not the right answer when you are designing from scratch in 2017.

Ideally you do an L3 CLOS Pod (leaf/spine) behind a SLB that directs traffic to containers on the servers running the game system. Do every container as a /32s (BGP fabric on the clos pulled down to the server - running routing on the server) and run the internal DB / app / storage on a diffent set of /32s. Towards the external router do not announce the internal DB / app / storage networks and then the world can not get to them. Redundancy is a feature of the Clos and BGP + containers. Scaling is simple if your application is built to scale with the addition of service containers.

Also 10g is cheap and the latency is much lower then 1g. Heck I know people building the 2nd design I talked about above running 50g to the servers and 6x100g from the leaf to the spines. The cost on the switchs (Broadcom Tomahawk ASIC in kit from Arista, Cisco, Juniper, whitebox) is really cheap.

maxmcd · on May 11, 2017

It is made clear a few times in the article that this is a "slower" setup.

> With such slower-but-more-secure deployments, our uplink Internet port goes to our firewall; nothing else is connected directly to the Internet – so the firewall can control all the traffic.

lazylizard · on May 11, 2017

actually whats wrong with bare metal servers?

myrandomcomment · on May 11, 2017

Nothing, if you have a need for a single big app that can consume the full system. Makes redundancy more complex as you cannot move the app if there is a failure. Requires clustering, SLB type design.

lazylizard · on May 11, 2017

i'm thinking a cluster of bare metal servers is rather simpler than a cluster of containers on top of the same number of bare metal servers?

myrandomcomment · on May 11, 2017

What are you trying to do is the key here. Lets say you have some web servers and you also have some applications (games). The load on the games depends on the number of users, how popular it is. If you have a a few racks of servers that are all 100% the same that you can bring up the containers on demand. Have web containers. Have game containers on demand based on load. Bare metal is really for something where you have a reason to consume all the HW, lets say for a DB you want to keep all the set in memory, then bare metal in a cluster with a big box that has lots of RAM. Like I said it is all really use case dependent. The trend today in the target scale systems is load on demand via, VM or container.

Also if you use containers with a routing protocol on the server they can move and update the network as they move.

whatupmd · on May 11, 2017

If you are learning what a VLAN is, you should be using a firewall.

cthalupa · on May 11, 2017

Reading through this and it feels very... dated.

Firewalling your game servers is just not a good idea if latency matters at all. It shouldn't really be presented as an option. Separate VLANs or subnets or some other method of segregating traffic is the correct answer here. I understand there is a disclaimer, but it's just not ever the right choice.

Out of band management: If you're in a production setting and having to manually fix mistakes when you broke your IP configuration, you're doing it wrong. Why are you not using configuration management tools? Why are you manually configuring your servers? If you're using the cloud, why wouldn't you just toss away a broken instance?

Fault tolerance: Vague statements about how building in redundancy is somehow a single point of failure itself. It mentions further details are available in some other chapter, but that chapter appears to not be on the site, or isn't written yet, or something. This seems somewhat nonsensical to me, as there are many proven high availability configurations.

RAID: Software raid is fine for a game server vs. hardware raid? There's so many caveats to this... If your RAID is software based, that means your CPU is handling the calculations required for the RAID. Do you have the spare cores for this? Will it negatively impact server performance? Will it perform as well as it would with a dedicated controllers? How often are you loading things from disk?

Blade servers: Unsubstantiated anecdotal evidence about how blade servers are less reliable. If they're really less reliable, there should be data on this. This appears to be the beta for a book that will be sold. If this is a product, back up your assertions.

"Mission critical servers": Recommendations to put your mission critical servers on a 4U because this will somehow make them more reliable than on a 1U or 2U, despite many failure modes having absolutely nothing to do with the amount of space the server takes up on the rack.

SANs: Apparently these are not acceptable for database servers. I believe many of us are in for rude awakenings.

RAID with battery backed write cache vs. NVMe: Apparently it does not matter what type of storage you are using underneath the RAID as long as you have a write cache. No mention of what happens if you are pushing more data to the RAID controller than can be cached before being flushed to the disks backing it.

Cloud storage: "Just now" cloud providers are providing virtual machines with local disk options. Last I checked, many of the big name cloud providers launched their virtual machine offerings with local disk options.

Vendors: Just avoid SuperMicro? No explanation here beyond they're "not mature enough". I suppose two and a half decades isn't mature.

Browsing some of the other sections, like the 'Cloud' chapter...

"Hardware Replacement": Mentions that cloud providers will automatically instantly relaunch a busted instance for you, and that you'll lose your hard disk data. Many providers don't automatically do anything of this sort, and if you are using network storage, you certainly aren't losing your storage because you moved your instance to new hardware.

Network throughput: Somehow a 100mbps port is going to push out 13 petabytes of data in one month. I'm not entirely sure how.

VM migration: Apparently all clouds live migrate VMs, and these live migrations with small latency spikes are worse than complete failure.

Large boxes: Those 4U boxes in the other chapter supposedly do not have equivalents available in the cloud.

No mentions at all of the tradeoffs in performance on virtualization vs. bare metal, mitigating factors, etc.

I honestly feel like this was written by someone who is stuck architecting things like it's 2002.

myrandomcomment · on May 11, 2017

+1 to everything you said. It is strange mix of random statements not backed up buy facts, design elements that come from 12 years ago somehow being shoehorned onto the cloud.