Slowly Changing Dimension Transformation
SQL Server 2012
Other Versions
11 out of 17 rated this helpful-Rate this topic
The Slowly Changing Dimension transformation coordinates the updating and inserting of records in data warehouse dimension tables. For example, you can use this transformation to configure the transformation outputs that insert and update records in the DimProduct table of the AdventureWorksDW2012 database with data from the Production.Products table in the AdventureWorks OLTP database.
ImportantThe Slowly Changing Dimension Wizard only supports connections to SQL Server.
The Slowly Changing Dimension transformation provides the following functionality for managing slowly changing dimensions:
- Matching incoming rows with rows in the lookup table to identify new and existing rows.
- Identifying incoming rows that contain changes when changes are not permitted.
- Identifying inferred member records that require updating.
- Identifying incoming rows that contain historical changes that require insertion of new records and the updating of expired records.
- Detecting incoming rows that contain changes that require the updating of existing records, including expired ones.
The Slowly Changing Dimension transformation supports four types of changes: changing attribute, historical attribute, fixed attribute, and inferred member.
- Changing attribute changes overwrite existing records. This kind of change is equivalent to a Type 1 change. The Slowly Changing Dimension transformation directs these rows to an output namedChanging Attributes Updates Output.
- Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly Changing Dimension transformation directs these rows to two outputs:Historical Attribute Inserts OutputandNew Output.
- Fixed attribute changes indicate the column value must not change. The Slowly Changing Dimension transformation detects changes and can direct the rows with changes to an output namedFixed Attribute Output.
- Inferred member indicates that the row is an inferred member record in the dimension table. An inferred member exists when a fact table references a dimension member that is not yet loaded. A minimal inferred-member record is created in anticipation of relevant dimension data, which is provided in a subsequent loading of the dimension data. The Slowly Changing Dimension transformation directs these rows to an output namedInferred Member Updates. When data for the inferred member is loaded, you can update the existing record rather than create a new one.
Note
The Slowly Changing Dimension transformation does not support Type 3 changes, which require changes to the dimension table. By identifying columns with the fixed attribute update type, you can capture the data values that are candidates for Type 3 changes.
At run time, the Slowly Changing Dimension transformation first tries to match the incoming row to a record in the lookup table. If no match is found, the incoming row is a new record; therefore, the Slowly Changing Dimension transformation performs no additional work, and directs the row toNew Output.
If a match is found, the Slowly Changing Dimension transformation detects whether the row contains changes. If the row contains changes, the Slowly Changing Dimension transformation identifies the update type for each column and directs the row to theChanging Attributes Updates Output,Fixed Attribute Output,Historical Attributes Inserts Output, orInferred Member Updates Output. If the row is unchanged, the Slowly Changing Dimension transformation directs the row to theUnchanged Output.
Slowly Changing Dimension Transformation Outputs
The Slowly Changing Dimension transformation has one input and up to six outputs. An output directs a row to the subset of the data flow that corresponds to the update and the insert requirements of the row. This transformation does not support an error output.
The following table describes the transformation outputs and the requirements of their subsequent data flows. The requirements describe the data flow that the Slowly Changing Dimension Wizard creates.
Output / Description / Data flow requirementsChanging Attributes Updates Output / The record in the lookup table is updated. This output is used for changing attribute rows. / An OLE DB Command transformation updates the record using an UPDATE statement.
Fixed Attribute Output / The values in rows that must not change do not match values in the lookup table. This output is used for fixed attribute rows. / No default data flow is created. If the transformation is configured to continue after it encounters changes to fixed attribute columns, you should create a data flow that captures these rows.
Historical Attributes Inserts Output / The lookup table contains at least one matching row. The row marked as “current” must now be marked as "expired". This output is used for historical attribute rows. / Derived Column transformations create columns for the expired row and the current row indicators. An OLE DB Command transformation updates the record that must now be marked as "expired". The row with the new column values is directed to the New Output, where the row is inserted and marked as "current".
Inferred Member Updates Output / Rows for inferred dimension members are inserted. This output is used for inferred member rows. / An OLE DB Command transformation updates the record using an SQL UPDATE statement.
New Output / The lookup table contains no matching rows. The row is added to the dimension table. This output is used for new rows and changes to historical attributes rows. / A Derived Column transformation sets the current row indicator, and an OLE DB destination inserts the row.
Unchanged Output / The values in the lookup table match the row values. This output is used for unchanged rows. / No default data flow is created because the Slowly Changing Dimension transformation performs no work. If you want to capture these rows, you should create a data flow for this output.
Business Keys
The Slowly Changing Dimension transformation requires at least one business key column.
The Slowly Changing Dimension transformation does not support null business keys. If the data include rows in which the business key column is null, those rows should be removed from the data flow. You can use the Conditional Split transformation to filter rows whose business key columns contain null values. For more information, seeConditional Split Transformation.
Optimizing the Performance of the Slowly Changing Dimension Transformation
For suggestions on how to improve the performance of the Slowly Changing Dimension Transformation, seeData Flow Performance Features.
Troubleshooting the Slowly Changing Dimension Transformation
You can log the calls that the Slowly Changing Dimension transformation makes to external data providers. You can use this logging capability to troubleshoot the connections, commands, and queries to external data sources that the Slowly Changing Dimension transformation performs. To log the calls that the Slowly Changing Dimension transformation makes to external data providers, enable package logging and select theDiagnosticevent at the package level. For more information, seeTroubleshooting Tools for Package Execution.
Configuring the Slowly Changing Dimension Transformation
You can set properties through SSIS Designer or programmatically.
For more information about the properties that you can set in theAdvanced Editordialog box or programmatically, click one of the following topics:
- Common Properties
- Transformation Custom Properties
For more information about how to set properties, seeSet the Properties of a Data Flow Component.
Configuring the Slowly Changing Dimension Transformation Outputs
Coordinating the update and insertion of records in dimension tables can be a complex task, especially if both Type 1 and Type 2 changes are used. SSIS Designer provides two ways to configure support for slowly changing dimensions:
- TheAdvanced Editordialog box, in which you to select a connection, set common and custom component properties, choose input columns, and set column properties on the six outputs. To complete the task of configuring support for a slowly changing dimension, you must manually create the data flow for the outputs that the Slowly Changing Dimension transformation uses. For more information, seeData Flow.
- The Load Dimension Wizard, which guides you though the steps to configure the Slowly Changing Dimension transformation and build the data flow for transformation outputs. To change the configuration for slowly change dimensions, rerun the Load Dimension Wizard. For more information, seeConfigure Outputs Using the Slowly Changing Dimension Wizard.
Fuzzy Grouping Transformation
SQL Server 2012
Other Versions
4 out of 4 rated this helpful-Rate this topic
The Fuzzy Grouping transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicates and selecting a canonical row of data to use in standardizing the data.
NoteFor more detailed information about the Fuzzy Grouping transformation, including performance and memory limitations, see the white paper,Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005.
The Fuzzy Grouping transformation requires a connection to an instance of SQL Server to create the temporary SQL Server tables that the transformation algorithm requires to do its work. The connection must resolve to a user who has permission to create tables in the database.
To configure the transformation, you must select the input columns to use when identifying duplicates, and you must select the type of match—fuzzy or exact—for each column. An exact match guarantees that only rows that have identical values in that column will be grouped. Exact matching can be applied to columns of any Integration Services data type except DT_TEXT, DT_NTEXT, and DT_IMAGE. A fuzzy match groups rows that have approximately the same values. The method for approximate matching of data is based on a user-specified similarity score. Only columns with the DT_WSTR and DT_STR data types can be used in fuzzy matching. For more information, seeIntegration Services Data Types.
The transformation output includes all input columns, one or more columns with standardized data, and a column that contains the similarity score. The score is a decimal value between 0 and 1. The canonical row has a score of 1. Other rows in the fuzzy group have scores that indicate how well the row matches the canonical row. The closer the score is to 1, the more closely the row matches the canonical row. If the fuzzy group includes rows that are exact duplicates of the canonical row, these rows also have a score of 1. The transformation does not remove duplicate rows; it groups them by creating a key that relates the canonical row to similar rows.
The transformation produces one output row for each input row, with the following additional columns:
- _key_in, a column that uniquely identifies each row.
- _key_out, a column that identifies a group of duplicate rows. The_key_outcolumn has the value of the_key_incolumn in the canonical data row. Rows with the same value in_key_outare part of the same group. The_key_outvalue for a group corresponds to the value of_key_inin the canonical data row.
- _score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row.
These are the default column names and you can configure the Fuzzy Grouping transformation to use other names. The output also provides a similarity score for each column that participates in a fuzzy grouping.
The Fuzzy Grouping transformation includes two features for customizing the grouping it performs: token delimiters and similarity threshold. The transformation provides a default set of delimiters used to tokenize the data, but you can add new delimiters that improve the tokenization of your data.
The similarity threshold indicates how strictly the transformation identifies duplicates. The similarity thresholds can be set at the component and the column levels. The column-level similarity threshold is only available to columns that perform a fuzzy match. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates. You specify the similarity threshold among rows and columns by setting theMinSimilarityproperty at the component and column levels. To satisfy the similarity that is specified at the component level, all rows must have a similarity across all columns that is greater than or equal to the similarity threshold that is specified at the component level.
The Fuzzy Grouping transformation calculates internal measures of similarity, and rows that are less similar than the value specified inMinSimilarityare not grouped.
To identify a similarity threshold that works for your data, you may have to apply the Fuzzy Grouping transformation several times using different minimum similarity thresholds. At run time, the score columns in transformation output contain the similarity scores for each row in a group. You can use these values to identify the similarity threshold that is appropriate for your data. If you want to increase similarity, you should setMinSimilarityto a value larger than the value in the score columns.
You can customize the grouping that the transformation performs by setting the properties of the columns in the Fuzzy Grouping transformation input. For example, theFuzzyComparisonFlagsproperty specifies how the transformation compares the string data in a column, and theExactFuzzyproperty specifies whether the transformation performs a fuzzy match or an exact match.
The amount of memory that the Fuzzy Grouping transformation uses can be configured by setting theMaxMemoryUsagecustom property. You can specify the number of megabytes (MB) or use the value 0 to allow the transformation to use a dynamic amount of memory based on its needs and the physical memory available. TheMaxMemoryUsagecustom property can be updated by a property expression when the package is loaded. For more information, seeIntegration Services (SSIS) Expressions,Use Property Expressions in Packages, andTransformation Custom Properties.
This transformation has one input and one output. It does not support an error output.
Row Comparison
When you configure the Fuzzy Grouping transformation, you can specify the comparison algorithm that the transformation uses to compare rows in the transformation input. If you set theExhaustiveproperty totrue, the transformation compares every row in the input to every other row in the input. This comparison algorithm may produce more accurate results, but it is likely to make the transformation perform more slowly unless the number of rows in the input is small. To avoid performance issues, it is advisable to set theExhaustiveproperty totrueonly during package development.
Temporary Tables and Indexes
At run time, the Fuzzy Grouping transformation creates temporary objects such as tables and indexes, potentially of significant size, in the SQL Server database that the transformation connects to. The size of the tables and indexes are proportional to the number of rows in the transformation input and the number of tokens created by the Fuzzy Grouping transformation.
The transformation also queries the temporary tables. You should therefore consider connecting the Fuzzy Grouping transformation to a non-production instance of SQL Server, especially if the production server has limited disk space available.
The performance of this transformation may improve if the tables and indexes it uses are located on the local computer.
Configuration of the Fuzzy Grouping Transformation
You can set properties through SSIS Designer or programmatically.
For more information about the properties that you can set in theFuzzy Grouping Transformation Editordialog box, click one of the following topics:
- Fuzzy Grouping Transformation Editor (Connection Manager Tab)
- Fuzzy Grouping Transformation Editor (Columns Tab)
- Fuzzy Grouping Transformation Editor (Advanced Tab)
Fuzzy Lookup Transformation
SQL Server 2012
Other Versions
3 out of 4 rated this helpful-Rate this topic
The Fuzzy Lookup transformation performs data cleaning tasks such as standardizing data, correcting data, and providing missing values.
NoteFor more detailed information about the Fuzzy Lookup transformation, including performance and memory limitations, see the white paper,Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005.
The Fuzzy Lookup transformation differs from the Lookup transformation in its use of fuzzy matching. The Lookup transformation uses an equi-join to locate matching records in the reference table. It returns records with at least one matching record, and returns records with no matching records. In contrast, the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches in the reference table.
A Fuzzy Lookup transformation frequently follows a Lookup transformation in a package data flow. First, the Lookup transformation tries to find an exact match. If it fails, the Fuzzy Lookup transformation provides close matches from the reference table.