Built to Scale

Team Foundation Server 2010 introduces an evolved architecture built to scale to the most challenging of teams and scenarios. Here are aseries of best practices you can apply to make your deployment reach its full potential.

Contents

Introduction

The Evolved Architecture

Team Project Collections

Configuration Database

Background Tasks Job Agent

Evolved Architecture Visualized

Best Practices

#1 Start Simple

#2 Optimize I/O bandwidth

#3 Use a Load Balancer or Application Delivery device

#4 Leverage your Team Foundation Proxies

#5 Utilize the new Scale-Out Option

#6 Stay within Team Foundation Server limits

#7 Job Agent Grids for Large Scale deployments

Recommended Hardware for TFS Deployment

Conclusion

Introduction

In software engineering scalability it’s a desired property of a system indicating its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged. A system, whose performance improves after adding hardware, proportionally to the capacity added, is said to be ascalable system.

It is important to reiterate that capacity and performance are the main variables of a scalable system. The best practices outlined in this whitepaper will walk you through scenarios where capacity is increased, and as an administrator you are looking to add resources to the systemin order to either improve or maintain the performance levels of the application.

The methods for adding more resources to a system fall into two categories: scale up (vertically) and scale out (horizontally). When you are “scaling up”, you are adding resources to a single node in the system, typically adding more CPU, Memory or Disk Space. In the “scale out” model a different approach is taken, when adding resources, you are adding a new node to the system looking to distribute load and achieve greater capacity. An example would be adding a new computer as an Application Tier in order to distribute user request load. As computer prices drop and performance continues to increase, low cost "commodity" systems can be easily leveraged in a grid/cluster to achieve large amounts of computer power and performance.

Team Foundation Server (TFS) 2010, and the best practices outlined in this whitepaper,utilizes both models in order to achieve the deployments full potential.

The Evolved Architecture

Before we can start discussing scalability best practices it is important to outline the elements that compose the evolved architecture of Team Foundation Server 2010. This section sets the groundwork by defining several of the new concepts introduced to support a true scale-out multi-tenant system.

Team Project Collections

The first important concept to understand is what we call team project collections (TPCs).A team project collection is nothing more than a group of tightly related team projects. When you are thinking about themit helps to focus on correlating them with products, codebases or application suites. For example, if you company makes four unique products that have almost no code sharing between them, it would be practical to create four team project collections. If on the other hand, your company has several products that compose a solution or product suite with high code reuse and framework sharing, then you will have only one team project collection.

The concept is represented in the back end as a single SQL database, allowing us to provide complete encapsulation, greater mobilityand improve administration. Team project collections are the key pillar of multi-tenancy, due to their encapsulation, and hence enablers of server consolidation by allowing multiple groups within an organization to share the same deployment and infrastructure.

You can read more about the Team Project Collection concept by following these two links: MSDN and Team Foundation Server 2010 Key Concepts

Configuration Database

The introduction of Team Project Collections has brought with it changes to the organization of the TFS databases. The most important change is the creation of a Configuration Database. This “root” database contains a centralized representation of our configuration data, including the list of all Team Project Collections, Identities, Resources and Global Application Settings. Customer should treat this database as the core of the Team Foundation Server Farm and hence configure it for high availability.

Background Tasks Job Agent

The TFS Background Job Agent is an executable installed on the Application Tier responsible for processing all of the background tasks generated by the server components. Examples of these tasks are: pumping data into the warehouse by the adapters, processing asynchronous long running operations like syncing identities from Active Directory, or maintenance tasks like installing an update to TPC’s. The agent contacts the configuration database and asks for tasks to execute from a queueand starts processing those requests by leveraging its plug-ins.

Evolved Architecture Visualized

Often it is said that a picture is worth a thousand words so we have included an image outlining the evolved architecture and its key components.The image shows a scale out configuration known as a TFS Farm.

Best Practices

#1 Start Simple

When planning your deployment to scale it is necessary to Start Simple, even if you feel thatbusiness priorities and adoption plans are constantly changing. Don’t complicate your initial deployment with large scale plans but rather build the minimum configuration that meets your usage load and expand from there, one step at a time.

Starting simple means that you should focus only on three questions:

How many users am Iinitially looking to support with this deployment within a year timeframe?
Am I deploying SharePoint and Analysis Services on box or utilizing already existing instances?
What are the machine specs that match my requirements?

During this exercise your most important decisionis to accurately assess the number of users who will be using the server. The second question is there to remind you that Team Foundation Server has integration points with other server products and sharing resources with those other servers can significantly impact overall performance.After determining the user load, reference the section Recommended Hardware for TFS Deployment located in this document (you can also reference the one located in MSDN). With this data you now have your initial hardware specs and are ready for a successful deployment.

On a related topic, unless you are a small/medium team with less than 100 total users our recommendation will always be to install a dual server configuration so you can easily scale out your nodes (Application Tier, Data Tier) when needed.

#2 Optimize I/O bandwidth

Team Foundation Server, as well as other ALM products, can be very I/O intensive due to the type of load subsystems like version control, work item and build generate. For a subset of our customers the most common performance issue or impediment to achieve higher scale is directly related to the disk I/O bandwidth. Small customers that deploy on machine with specs similar to development boxes are not initially affected, but when you start scaling to 100+ users and utilizing more features disk I/O can become a problem quickly.

Our best practice recommends administrators to monitor the Disk I/O of both the Application and Data Tiers, the latter being of most importance, and to implement mitigation plans when these become a limiting factor.

There are three technologies you can deploy in order to increase your capacity. Disk Arrays, Network-Attached Storage (NAS) and Storage Area Network (SAN). Your IT infrastructure and budget will dictate which of these technologies would be best for your company.

#3 Use a Load Balancer or Application Delivery device

Our third best practice focuses on increasing the scale and reliability of the Application Tier node. Loosely defined a TFS Application Tier is the node that hosts the web application for the Team Foundation web services. In TFS 2010, you can install a new application tier and have it join an existing server deployment or create a new one altogether by also deploying a configuration database with it. Scaling out the Application Tier refers to the former option where you are configuring a new node by installing the feature in that machine without deploying any database components.

We recommend for customers to configure a load balancer when adding a second or third application tier node to the deployment. This load balancer would be configured to sit in front of these application tiers and be in charge of effectively balancing the load across them. If you want to learn more about load balancers you can start by following this link, but at the core there are two solutions you can choose from: software (NLB is a good example of this) and hardware.

Hardware solutions, although usually more expensive, provide the most features, configuration flexibility and best performance. These hardware devices are known in today’s market as Application Delivery devices due to their versatility as they do much more than balance requests load (e.g. routing, https, content acceleration).

In summary, the benefits of having a load balancer are:

High availability solution by routing requests to the active/hot nodes
Automatically balances load across nodes so users don’t have to selectively connect to individual Application Tier machines
Allows seamless scale, increase or decrease of capacity, by provisioning new nodes and adding or removing them to the load balancer configured list

To learn more about possible Team Foundation configuration options you can reference a whitepaper authored in partnership with F5- makers of application delivery devices.

#4 Leverage your Team Foundation Proxies

Within Microsoft’s internal deployment of TFS, the single most called method is download file. This is expected as we have thousands of developers coding features on a daily basis and hundreds of build machines delivering builds to test.

All of this traffic tends to overwhelm the Application Tier, impacting its ability to handle user requests in a responsive manner. The files requested are usually on the AT cache but not loaded in memory – would be too large- and have to be constantly fetched, compressed and transferred to end users.

This scenario is where the Team Foundation Server proxy can be effectively leveraged to reduce AT load and effectively solve performance issues. The Team Foundation Proxy caches these versioned files and it is optimized exactly for this function. With low hardware specs (except in disk speed), ease of deployment and configuration (quick install and registration), it is the perfect resource to keep application tier resources focused on handling requests and not delivering content.

The two most popular scenarios for deployments are Offshore and Build Labs. For the Offshore scenarios you are looking to have one or more Team Foundation proxies on the LAN premise of the offshore team. The following illustration depicts that deployment.

In the Build Lab scenario your goal is to transfer load from the AT to a proxy machine all within the same LAN network. In this deployment you setup Y number of proxies and have all download/read intensive applications register a proxy for their use. The following illustration depicts that deployment.

#5Utilize the new Scale-OutOption

You have successfully deployed Team Foundation Server and adoption has increased very sharply in the last three months. As this is occurring, users’ complaints about performance are increasing in frequency and it is time to act.

In previous releases the decision was very easy, Scale Up. That meant either buying higher end machines or improving the memory and CPU specs of the current deployment. In TFS 2010, the decision changes as you now have the new option to Scale-Out.

There are two core questions for scaling out:

Which variables get factored in the decision?
What is the best topology and machine specs?

Technically you should Scale Out when 1) you need to increase the throughputand it is more cost effective to distribute the load by moving resources (e.g. team project collections) to other nodes or 2) when you need to add more capacity to a deployment configuration that cannot be scaled up.

One of the most important elements is the load distribution of your team project collections. As part of this decision you are trying to optimize around three axes: CPU, Memory, and Disk IO (we are assuming disk space is not an issue). Collections with high number of requests will need more CPU, while collection with “normal” request levels but large data sizes will need more memory.

As you collect the load breakdown for each of the collections try to match their capacity needs to the adequate hardware configuration. We recommend the use of our administration tools together with windows diagnostic tools (e.g. performance counters) to effectively reach those decisions.

Note: Grant Holiday, Program Manager in our team, has a good blog post detailing how to gather and analyze this data, reference it if you are not familiar with our tools.

In a scale out deployment the rule of thumb is to keep the machine specs as close to “commodity hardware” as possible. Our recommended topology is a set of application tiers and data tiers with the following specs:

Application Tier: 1 processor, Quad Core @2.0GHz and 4GB RAM
Data Tier: 1 processor, Quad Core @2.3Ghz and 8GB RAM

Each of these machines will be able to service the load of 1,000 users.

#6Stay within Team Foundation Server limits

Although Team Foundation Server is highly scalable there are limits you should be aware of since they directly impact scale-up and scale out decisions. Most of the limits are not enforced by the product, but rather recommendations by the product team in order to maintain a certain level of performance. You can read more about our limits here.

There are some limits which should be closely monitored by the administrator as those have the potential to incur the most impact on performance.

TFS 2010 Limits

200 Team Projects per Team Project Collection
50 – 200 Active Team Project Collections per SQL Instance (range for 8GB – 64Gb of RAM)

A subset of the limits outlined in the linked document can be hard to monitor or have an automated mitigation plan to. Nevertheless, they are important and should be treated with as equal priority as machine configuration or branch management strategies.

#7Job Agent Grids for Large Scale Hosted deployments

If you are not a hosting company or planning to deploy Team Foundation Server to 10,000+ users you can skip this best practice and proceed to the next section. Large deployments call for very scale out configurations and this last best practice provides some insight into achieving those goals.

The TFS Background Job Agent is an executable installed on the Application Tier responsible for processing all of the background tasks generated by the server components. These tasks are usually very CPU intensive and at times contend for machine resources with the TFS Web Application. As load increases there could be instances where AT performance is significantly impacted.

Our best practice for these types of scenario is to separate these “components” into two roles. The deployment is configured to have a farm of Application Tiers handling user requests and a grid of Job Agents handling all the intensive background tasks. Separating the Application Tier into these two roles is achieved by performing an Application Tier Only install and then stopping the Team Foundation Server Website while leaving the Job Agent active.

It would be important to reiterate that splitting these components is not recommended for most deployments as the benefits provide their biggest return on investment (ROI) at large scale. The benefits are centered on:

Lower operational cost by using commodity hardware – previously, when planning for your Application Tier load you had to consider both your user load plus the load incurred on the system by the background tasks. This meant that in order to achieve your desired capacity the specs for the AT were higher than your typical “commodity hardware” specs. With this split you can have two nodes executing different roles without resource and priorities contention in inexpensive hardware

Resource needs map clearly to application needs – this configuration provides total control and flexibility over the solutions to application scale needs. If your background task activity is relatively low, but your user read load is very high due to custom tools that interact with the server, you can add application tiers to match exactly your capacity needs without expanding your background task grid. At the end, the elasticity needed to achieve your performance goals are better defined, leading to lower operational, administrative and troubleshooting costs.