Tom Kane
MISI 740 ETL Data Load and Analysis
DATA and SQL REQUIREMENTS
#1 -Verification of creation of the northwind_mart successful load,
#2a - Screen shot of EmployeeDim (Employee Dimension table) successful load
#2b - Verify source data populated - Northwind.dbo.Employee
to northwind_mart.dbo.EmployeeDim
#2c - Screen shot of Customer_Dim(Customer Dimension table) successful load
#2d - Verify source data populated - Northwind.dbo.Customers
to northwind_mart.dbo.Customer_Dim
#2e - Screen shot of ProductDim (Product Dimension table)successful load
#2f - Verify source data populated - Northwind.dbo.Products
to northwind_mart.dbo.Product_Dim
#2g - Screen shot of ShipperDim(Shipper Dimension table) successful load
#2h - Verify source data populated- Northwind.dbo.Shippers
to northwind_mart.dbo.ShipperDim
#3a-Verification of DateDim populator successful load - verifies the Date Dimension table creation
# 3b–Verification of DateDim populator successful file load–this screen verifies that the DateDimension table loaded with 9162 files2 rows
4a - Verify remove Sales fact records was successful (step1)
4b - Insert into northwind_mart.dbo.Sales_Fact – populating of the sales_table
successful load (step 2)
4c - Sales Fact table update (LineItemFreight) update calculation/script (step 3)
LineItemTotal SUM verification
(UnitPrice * Quantity) SUMverification
Note The LineItemTotal and the UnitPrice * Quantity are equal so Line Item is a product of Unit Price * Quantity i.e. you could pull the value of product on hand by checking LineItemTotal
Sum (Quantity) verification
SUM (LineItemQuantity) verification
SUM (Discount) verification
SUM (LineItemDiscount) verification
#5 a - Write two sql SELECT statements utilizing the Sales_Fact table and one other ( Product_Dim ) include a JOIN and GROUP BY – verification
The purpose of this statement is to choose the fields that are required to determine the name of the product, who the supplier is and the date the product is required to be available. This is grouped by supplier, product name and required date so that the business knows who is sending what product by when. Because the information is in two different tables, you are required to join the two tables on a column that is common to both tables in this case it is ProductKey.
#5 b - Write two sql SELECT statements utilizing the Sales_Fact table and one other ( Shipper_Dim ) include a JOIN and GROUP BY – verification
The purpose of this statement is to choose the fields that are required to determine the name of the shipper, what the shipper ID is and the line item quantity of the product. This is grouped by ShipperName, ShipperID and LineItemQuantity so that the business knows what the amount of the product is and who will be shipping the product. The ID may be important to the business the be able to enter into a short integer field as a time saving/productivity measure but allowing the business to be able to identify the shipper by name if need be. Because the information is in two different tables, you are required to join the two tables on a column that is common to both tables in this case it is ShipperKey.
ANALYSIS REQUIREMENTS
1. Operations of the script that populated the date dimension (DateDim) table.
The initial script utilized in the population of the DateDim table created the table with 14 fields and the primary key. The primary key is the key that establishes or denotes the primary identity of the table.
The next set of script were function statements that were turned off such as statistics_norecompute which actually allows the function to occur. The alter table command places a foreign key in the Sales_Fact table that allows the DateDim to be joined to the Sales_Fact through a like named primary key foreign key relationship.The foreign key in the fact table references the primary key in the dimension table so if data from one table is needed in conjunction with data from the other to make business intelligence decisions all data can be abstracted in a single .sql script with a join on statement.
The declaration of the variables defines what variables have which definitions such as being an integer, date and or time this allows for specific dates or date ranges to be queried by using Group functions such as sum, count, add, minus or other functions to quantify specific data in designated ranges. The initialization of the variables defines the value for the variable.
The next two select statements define the starting date and ending date for the DateDim range. The select statements regardingdefining year, quarter month day and week give weekends a value of 0 or they will not be counted, the Select @year = datepart (year, @DT) defines the year will start at 1995 i.e. year only defined not day or month. The IF statementsdefine values for the variable such a day equals 1 and a week equals 7, this is important in writing future script to be able to count by day or week when setting a date and a range to pull the information from.
The IF statement that begins ((@year % 4 = 0) is a check of the date to determine if it is a leap year so the appropriate number of days for the month and year are available this is important when calculating business of working days for overhead expense of staff or days of revenue. The INSERT DateDimstatement followed by the VALUES statement inserts the fields and the values for the fields into the DateDim table. The Select @DT = DATEADD(DAY, 1 @ DT) selects the begin date as discussed previously, and the day as an increment of 1.
2. Operations of the script that populated the sales_fact table.
The initial INSERT INTO script inserts the fields into the table and in this case the fields also have aliases. The fields inserted the d.DateKey which is the primary key for the table, additionally this field is a foreign key so it may reside in another table this is also known as a composite key because there are multiple key types for this field. There were six fields total inserted into the northwind_mart.dbo.Sales_Fact table.
The primary key is the key that establishes or denotes the primary identity of the table. The foreign key is when a field (often primary key field) from one table is referenced or placed in another so that a join, or union or intersect can occur. Other keys include natural keys which are keys based on natural data like date of birth or social security number as opposed to developing a numbering system for customer or vender identification. Durable keys are keys that are used when there are empty or null rows but there may be data in them in the in the future, a durable key can be established so if data is entered into these rows at a later time the data automatically flows or calculates based on pre-existing joins, or calculation script.
Joins, intersects, and unions are necessary when all of the data or information required does not reside in a single and two or more tables are needed due to data contained in one or more fields within the various tables. As an example one table may contain only vendor demographics, another strictly product information and a third only product cost, in order to determine the cost of products from a given vendor, it is necessary to draw information from at least 3 fields from three different tables so a join, union or intersect may be needed.
The script that begins with SUM(o.Freight) calculates quantities by order so that a company can analyze quantities, prices and discounts by order versus having to manually check on a line by line basis. This is an example of needing a join to acquire the needed data, an inner join was used so the data would only show if data was matched or found in both tables. The set of dimension table joins also utilizes left joins. A right or left join will join tables and return data even when there is not a match or there is a null in the join on table. The determination of using a right or left join is determined by where in the statement the table without matching rows falls in the script with the join in it.
The nexttwo sets of script include script that spread the freight cost across the order where the order quantity is greater than zero i.e., there needs to be something in the order and the line item quantity is not null i.e. the data is not missing or empty. The SUM statements give the company the ability to check by line and order and can reconcile between the two to insure that there is nothing and or everything in the order quantity or price that should or should not be there.
3. Assignment experience
This assignment was difficult for me as I am not fluent in SQL nor am I as designer. I did a significant amount of reading in the course text, in texts from previous courses as well as on the internet with tutorials. As an older student I continue to find it difficult to take courses such as this on in an on line format and due to being a shorter course the material is compressed. I learned about commenting within the script, became more comfortable in writing and joining in script and am beginning I believe to visualize how tables need to be connected in order for information to be pulled. I came to the realization as a RN that a significant number of the difficulties that occur in electronic health records are due to not being able to pull data across tables from different portions or modules of an electronic health record creating the need to perform redundant work. There is also not currently the ability to retrieve data and automatically link it with purchasing or cost data so there would be real time information regarding costs and revenues of procedures or care provided.