SQL Server Technical Article
Summary: The APS architecture provides Massive Parallel Processing (MPP) scalability that can overcome many of the performance limitations generally associated with loading data into a traditional SQL Server database. In this whitepaper we will examine a number of concepts and considerations relevant to loading data into the Analytic Platform System (APS) appliance for optimal performance.
Writer: Andy Isley, Solution Architect
Technical Reviewer: Charles Feddersen, Mary Long, Brian Mitchell
Published: June 2015
Applies to: Analytics Platform System
Copyright
This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.
Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.
© 2014 Microsoft. All rights reserved.
Contents
Introduction 4
Objectives 4
Data Loading 4
Importance of Load Rate 4
PDW Table Architecture 4
The Data Movement Service 5
Hashing Algorithm 10
APPEND -m vs. FASTAPPEND -m + CTAS 14
Linear Scaling of Loads 21
Single File using DWLoader 21
Scaling of APS by adding Scale Units 23
Multiple File loading 27
Multiple Files Using DWLoader 28
Possible Causes Of Slow Load Speeds 31
Real World Scenarios 34
Best Practice 34
CTAS vs Upsert 34
Conclusions 35
Introduction
In this whitepaper, we will present a number of concepts and considerations relevant to loading data into the Analytic Platform System (APS) appliance. The APS architecture provides Massive Parallel Processing (MPP) scalability that can overcome many of the performance limitations generally associated with loading data into a traditional SQL Server database.
All of the tests described in this paper were performed on a single rack appliance running on version APS AU3 RTM (v10.0.3181.0) of PDW. As such no assumptions should be made that the results are accurate for, nor reflect the performance of any previous or future versions of PDW, nor any smaller or larger compute rack configurations.
Objectives
- Explain key concepts relating to the Data Movement Service
- Provide examples to demonstrate load performance and how it is affected by database design
- Demonstrate different causes of bottlenecks in the load process
- Discuss the various load methods (SSIS/DwLoader) and load rates based on each method
- Discuss load rates based on table geometry and index strategy.
Data Loading
Importance of Load Rate
The rate at which the Parallel Data Warehouse (PDW) Region of an APS System can load data is generally one of the most important questions asked by a potential customer either during a Proof of Concept or during an Implementation. It is a simple and tangible metric for customers to make comparisons against their existing implementations or competing vendors that they may also be evaluating. Given the performance limitations of loading data into a single table in a traditional Symmetric multiprocessing) SMP database, via either BCP or SSIS, it is important that the load results for an APS demonstrate multiple factor gains over their existing solution.
PDW Table Architecture
A database in PDW implements two different types of tables to achieve MPP scalability across multiple physical servers. These types are called Replicated and Distributed. In the case of a replicated table, an entire copy of the table exists on each physical compute node and as such there is always a 1:1 relationship between computes nodes and the physical instances of a replicate table.
Distributed tables are represented as 8 physical tables on each compute node. A single rack appliance contains 8 - 9 compute nodes depending on the manufacturer of the APS. Therefore 64 - 72 physical tables for each distributed table are created across these nodes. Data is distributed amongst these physical tables by applying a multi-byte hashing algorithm to a single column of data in that distributed table. All 8 physical tables per compute nodes are stored in a single database within the named instance of the compute node. The following diagram explains the physical table layout for a distributed table.
Load speed will vary depending on the type of distribution used. This will be covered in more detail later in the document.
The Data Movement Service
The APS appliance implements the Data Movement Service (DMS) for transporting data between compute nodes during load and query operations. Whilst this paper will focus primarily on the loading of data into the PDW, some of the information regarding DMS buffer management is also applicable to optimizing the query process.
The DMS uses two different buffer sizes when loading data into the appliance. When moving data between the Loading Server and the Compute nodes, a 256 KB buffer is used to move the data. At this stage of the load the data is still raw and unparsed. Once the data is received by each Compute Node, converted to the applicable data type and the hash applied to the distribution key column, the DMS then uses 32 KB buffers (32768 bytes) to transfer data between the Compute Nodes (shuffle) to complete the load process. The details below will focus on the 32 KB buffers as these are affected by destination DDL whereas the 256 KB buffer will not be.
These 32 KB buffers also contain additional header information stored with each buffer being transmitted (the 256 KB buffer holds only bullets 1,2 and 5 form the list below) however it does not form part of the 32768 bytes permitted for data use. Whilst a detailed look at the buffer header is beyond the scope of this document, the following bullets summarize the data stored in the header.
The Command Type: This is either a WriteCommand (0x02) or a CloseCommand (0x03). Every buffer except the last in a load contains the WriteCommand. Only the final buffer in a load contains the CloseCommand. The final buffer also contains no data.
The PlanId. This is a unique identifier that identifies the DMS plan that is associated to the buffer
The Distribution. An integer that defines the target distribution for a buffer
The SourceStepIndex.
The SizeOfData. This represents the actual size of data stored in the buffer in bytes
Within each 32 KB buffer the DMS effectively creates fixed width columns to store the data from the source file. This is a very important concept to understand when optimizing the load performance during a POC or implementation. The width specified for each column is determined by the ODBC C data structures that support each SQL type. More information on these data types can be found here: C Data Types.
The byte sizes each data type supported in PDW generally aligns with the max_length defined in the SQL Server sys.types catalog view, however there are some exceptions. Examples of this are provided later in this document. A simple example of the fixed width columns can be made using a varchar (100) data type in the destination table. The following diagram shows how 3 different length strings are stored in the DMS buffer.
This example is quite extreme to demonstrate unused space in the buffer. You will notice that the column is described as having 101 bytes of data, even though the column DDL definition was for 100 bytes. This is because an extra byte is required in the C language to represent the end of string null terminator. The rule for this extra data applies only to char, varchar, nchar and nvarchar data types. In the case of nchar or nvarchar, an extra 2 bytes is required for this character as they store double byte data.
In addition to the actual size of the data being stored, there is overhead in the buffer caused by the data type as well as the NULL status of the destination column. If a column in the destination table is allowed to be NULL, or if it is a string data type (char, varchar, nchar, nvarchar) then an 8 byte indicator will be added to that column in the buffer. This is used to specify the length of the variable length column as well as whether the value in the field is NULL. If a column is NULL or of variable length then the 8 byte indicator will be added, however if the column is NULL and of a variable length type then the indicator will remain 8 bytes in total for the column, it does not add to a 16 byte indicator. The only way to avoid this 8 byte indicator is to use a fixed width data type (int, bigint, datetime etc.) that is defined as NOT NULL in the destination database table.
The following table summarizes the whether or not a column would have an 8 byte indicator:
Column Type / 8 Byte IndicatorFixed Width Data Type NOT NULL / No
Fixed Width Data Type NULL / Yes (8 Bytes)
Variable Width Data Type NULL / NOT NULL / Yes (8 Bytes)
As a result of this behavior, a subtle change to table DDL may have a significant impact on the efficient packing of rows from a source file into each buffer during a load. The same effect would occur if performing a CTAS into a new table where an existing string data type is CAST into an unnecessary larger size. Here are two examples of similar table DDL with their corresponding data sizing.
create table Example1 (
id int not null
, name char(10) not null
, code smallint not null
, address varchar(8) not null)
id / 0 / 4 / 4
name / 0 / 10 + 1 / 11
code / 0 / 2 / 2
address / 8 / 8 + 1 / 17
Total: 34
create table Example2 (
id bigint null
, name char(10) null
, code char(5) not null
, address varchar(8) null)
Column / Indicator Bytes / Data Bytes / Totalid / 8 / 8 / 16
name / 8 / 10 + 1 / 19
code / 8 / 5 + 1 / 14
address / 8 / 8 + 1 / 17
Total: 66
This example demonstrates that 2 variations of the same table that could have been used to store the same data will have significant effect on the number of rows per buffer in the DMS. The best data loading efficiency is achieved by loading as many rows into each buffer as possible. It is common that a customer will supply DDL of excessive data types as part of a POC scope. Generally these are over size string types it may also be the case that BIGINT applied where a smaller numeric type would be more suitable.
Calculating the number of rows in each buffer is important for understanding the efficiency of the load process. As mentioned above the data sizing for each field used in the buffer is based on the ODBC C data type definitions. The following query can be used to identify the number of rows that can be packed into a DMS buffer when loading a file to the database. The same query can be used to identify buffer utilization for data movements during queries. As the data type size does not completely align with the sys.types max_length field for types, some adjustments have been made to match the sizes with the ODBC C data types used.
select
CAST(32768. / SUM(CASE
WHEN NullValue = 1 OR VariableLength = 1 THEN 8
ELSE 0
END +[DataLength]) AS INT) AS RowsPerBuffer
,SUM(CASE
WHEN NullValue = 1 OR VariableLength = 1 THEN 8
ELSE 0
END + [DataLength]) AS RowSize
,32768. % SUM(CASE
WHEN NullValue = 1 OR VariableLength = 1 THEN 8
ELSE 0
END +[DataLength]) AS BufferFreeBytes
,CAST(((32768. % SUM(CASE
WHEN NullValue = 1 OR VariableLength = 1 THEN 8
ELSE 0
END +[DataLength])) / 32768) * 100 AS DECIMAL(8,5)) AS [BufferFreePercent]
from (
select
c.name
,c.is_nullable AS [NullValue]
,CASE
WHEN ty.name IN ('char','varchar','nchar','nvarchar') THEN 1
ELSE 0
END AS [VariableLength]
,CASE
WHEN ty.name IN('char','varchar') THEN c.max_length + 1
WHEN ty.name IN('nchar','nvarchar') THEN c.max_length + 2
WHEN ty.name IN('binary','varbinary') THEN c.max_length
ELSE ty.max_length
END AS [DataLength]
from sys.tables t
inner join sys.columns c on t.object_id = c.object_id
inner join (
select
name
,system_type_id
,user_type_id
,CASE
WHEN name = 'time' THEN 12
WHEN name in ('date','datetime','datetime2','datetimeoffset') then max_length * 2
WHEN name = 'smalldatetime' then 16
WHEN name = 'decimal' then 19
ELSE max_length
END as max_length
from sys.types t) ty on
c.system_type_id = ty.system_type_id and
c.user_type_id = ty.user_type_id
where
t.name = 'si') z
To emphasize the impact that oversize DDL will have on the load performance, 2 loads of the line item table were run with the same source file. The first load was based on the standard DDL for the line item table, and the second was into altered DDL where the l_comment column type was set to varchar(8000). The DDL for each of these tables is as follows.
CREATE TABLE [TPC_H].[dbo].[lineitem_h]
(
[l_orderkey] BIGINT NOT NULL,
[l_partkey] BIGINT NOT NULL,
[l_suppkey] BIGINT NOT NULL,
[l_linenumber] BIGINT NOT NULL,
[l_quantity] FLOAT NOT NULL,
[l_extendedprice] FLOAT NOT NULL,
[l_discount] FLOAT NOT NULL,
[l_tax] FLOAT NOT NULL,
[l_returnflag] CHAR(1) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_linestatus] CHAR(1) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_shipdate] DATE NOT NULL,
[l_commitdate] DATE NOT NULL,
[l_receiptdate] DATE NOT NULL,
[l_shipinstruct] CHAR(25) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_shipmode] CHAR(10) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_comment] VARCHAR(44) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL
)
WITH (DISTRIBUTION = HASH ([l_orderkey]));
CREATE TABLE [TPC_H].[dbo].[lineitem_h]
(
[l_orderkey] BIGINT NOT NULL,
[l_partkey] BIGINT NOT NULL,
[l_suppkey] BIGINT NOT NULL,
[l_linenumber] BIGINT NOT NULL,
[l_quantity] FLOAT NOT NULL,
[l_extendedprice] FLOAT NOT NULL,
[l_discount] FLOAT NOT NULL,
[l_tax] FLOAT NOT NULL,
[l_returnflag] CHAR(1) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_linestatus] CHAR(1) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_shipdate] DATE NOT NULL,
[l_commitdate] DATE NOT NULL,
[l_receiptdate] DATE NOT NULL,
[l_shipinstruct] CHAR(25) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_shipmode] CHAR(10) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL,
[l_comment] VARCHAR(8000) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL
)
WITH (DISTRIBUTION = HASH ([l_orderkey]));