Group by Operations

School of Computing, Engineering and InformationSciences
University of Northumbria

‘Group By’Operations

Aim:

To practise SQL retrievals that implement the equivalent of the GroupByrelational algebra operation.

Outline of Session:

Do some queries that require the SQLGROUP BYphrase to produce results.

Do some queries that implement the SQL equivalent of GroupBy with other relational algebra operations.

GROUP BY

This operation splits a relation into several groups of tuples, i.e. into several sub-relations. The grouping uses one or more attributes – the ‘grouping attributes’ – and splits the relationinto groups such that each group contains the same value in the grouping attribute(s).

Each group/sub-relation is treated as the source for a single tuple in the resultrelation.

Each tuple in the result relation consists of :

value(s) of the grouping attribute(s);
attribute(s) whose value(s) are calculated by ‘aggregating’ the values of other, specified attribute(s) in the group/sub-relation : more than one aggregation can be carried out in a GroupBy operation.

In SQL, a GroupByoperation is expressed by :

writing the grouping attribute(s) and the‘aggregation’ in the SELECT phrase,
writing aGROUP BYphraseafter the FROM and any WHERE phrases : this phrase is used to specify the grouping attributes.

The following example shows how one can get the average salary of each group of employees who all have the same marital status.

SQL> SELECT MARITAL_STATUS, AVG(SALARY) AS AVERAGE_SAL

2 FROM EMP

3 GROUP BY MARITAL_STATUS;

This retrieves a table that has 2 columns, called MARITAL_STATUS and AVARAGE_SAL, and which has one row for each of the marital status values existing in table EMP.

AVG is an aggregate function which returns the average (arithmetic mean) value of the values in the column which is its argument (in this case SALARY), i.e. for all thatcolumn’svalues in the group of rows. Other aggregate functions include SUM, MAX, MIN and COUNT.

COUNT is used rather differently from the other aggregate functions; it returns the number of values in a column. For most purposes it does not matter which column is used,and socomplete rows are countedinstead. To do this we writeCOUNT (*),rather than COUNT ( SALARY ) say.

Note that in the above example, a name has been give after the keyword AS to name the new column that contains the average salary. If the ‘AS column-name’ is not provided, then the new column will receive a default name consisting of the expression that determines the aggregation, e.g. in this case AVG( SALARY ). Therefore it is good practice to give the new column a name.

In all the above cases, the aggregationused every value in every row in the column. Sometimes we need the aggregation to apply to the number of different values in a column instead. Remember that unless the column being aggregated is a primary key column, there can be duplicate values in a column. To achieve this, we can putDISTINCT in front of the column name,e.g.

SQL> SELECT MARITAL_STATUS, COUNT( DISTINCT SALARY )
2 AS NO_OF_SALS

3 FROM EMP

4 GROUP BY MARITAL_STATUS;

This returns how many different salary values occur in each MARITAL_STATUS group.

Note that we cannot write COUNT( DISTINCT * ), but then this is logically unnecessary anyway.

‘GROUP BY’ EXERCISES

Exercise: execute the above example SQL queries to satisfy yourself that they work, and give the correct answer.

Exercise: execute the following query.

SQL> SELECT AVG( SALARY ) AS AVERAGE_SAL
2 FROM EMP
3 GROUP BY MARITAL_STATUS;

What is the problem with the answer returned for this query ?

Exercise:develop and execute retrievals for the following queries. Make sure each aggregate column has its own proper name. Check each answer returned to ensure that it is correct.

Get the total salary bill for each marital status group.
How many employees work on each project ?
How many projects does each employee work on ?
For each salaryvalue paid out, how many people receive that salary ?
How many people work in each department ?
Get the total salary bill, and the average individual salarypaid out, for each department. (Do this in one query).
For each department, what is the minimum and maximum salary paid out to an employee ? (Do this in one query).

SQL> SELECT MARITAL_STATUS
2 FROM EMP
3 GROUP BY MARITAL_STATUS;

What does the output from this query tell us ?

How else could a query be written in SQL to give the same result?

SQL> SELECT SALARY, AVG ( SALARY ) AS AVERAGE_SAL
2 FROM EMP
3 GROUP BY MARITAL_STATUS;

Why do you get an error in response to this query ?

‘GROUP BY’ EXERCISES THAT INCLUDE THE SQL EQUIVALENTS OF OTHER ALGEBRA OPERATIONS

Sometimes, in order to answer a query, we need to Join tables together and do the GroupBy on the result of the Join; or we first need to remove rows from a table with a Restrict operation and do the GroupBy on the result of that; or we need to do both a Join and a Restrict before doing the GroupBy.

Exercise: why do we rarely need to Projectout columns before doing a GroupBy ?

Here are some queries that will require a Join and/or Restrict before the GroupBy :-

Exercise: check each answer returned to ensure that it is correct :

1.Get the total salary bill for each project.

2.For each project, get the minimum and maximum salary paid out to someone working on that project.
(Do this in one query).

3.Count the number of department managers who fall into each marital status group.

4.What is the average salary of those employees who work in departments ‘D1’ and ‘D2’ ?

5.For each salary value paid out that is above £15,000, how many people earn that amount ?

6.Ignoring people who earn less than £16,000, how many people work on each project ?

7How many employees work on each project that has a deadline after ’01-Mar-2008’ ?