The role of measurement in software and software development
Tor Stålhane, Ph.D.
SINTEF Informatics and Telematics
Why do we need measurement
The software industry - as all other industry - has as its main goals to deliver products and services to the market. This must be done at a price that the customer accepts as reasonable and at a quality that the customer wants. One of the main characteristics of the market is that it changes, not by itself but through the competition among those who provide the goods that is sold. The changes will not happen at the same time for all industries, but only when the players can no longer expand in any other way. The main results of this competition are lower prices and higher quality - not because the producers want it to be this way, but because it is the only way in which they can survive.
The software industry, like any other industry, can improve only in one way: by understanding the product and the process that produces it. Thus, we can define the first goal of software measurement.
We need measurement in software development to improve the development process so that we can increase product quality and thus increase customer satisfaction
In addition to this long-term need, we also need measurement in order to agree on the quality of a delivered product. The time when the customer in awe watched the blinking lights is gone forever. The customers now want quality and as a result of this we need to be able to establish a common ground for specifying, developing and accepting a software product with an agreed quality.
Even though we accept the ISO definition of quality as the product’s degree of satisfying the customer’s needs and expectations, we need something more concrete when we start to write a contract and develop a software system. The definition given in the standard ISO 9126 is a good starting point even though it concentrates on the product and leaves out important factors such as price, time of delivery and quality of service. Thus, our second need is
We need measurement on software so that we can understand and agree on product quality
The rest of this article will focus on the following ideas:
· What are the challenges when we want to improve the software process and the product quality through the use of measurement?
· How can we meet these challenges - what can we do and who will do it.
· What has been done up till now - national and international experiences.
· Where do we go from here - the future for Norwegian software industry.
The Challenges
“You can not understand what you cannot measure” said Lord Kelvin. Some unknown but probably frustrated software engineer has made the addendum “You cannot measure what you cannot understand”. Both views contain some truth and all work in understanding the software process and aspects of software quality must relate to both.
The main problem with the idea of the need to measure to understand is simply “What shall we measure to understand what?” To take a practical example: Given that we want to understand the reason why there are so many errors in a software system, what should we measure? The suggestions have been - and still are - legion. The suggestions for software metrics are almost endless, from number of code lines via McCabe’s cyclomatic number to Halstead’s software science. Unfortunately, none of this has brought us any nearer a good understanding of anything. The reason for this lies in the approach taken. The research has often moved along the following lines:
1. Get some inspiration and define a set of software metrics.
2. Collect the software metrics and the product characteristic that it is supposed to predict, for instance number of errors.
3. Do some statistical analyses - mostly regression analyses although other, more sophisticated methods such as principal component analyses have also been tried.
4. Look at the results and claim that any metric that correlates with the characteristic that we want to predict is - ipso facto - caused by whatever this software metric measure.
Statisticians call this particular brand of research for “shotgun statistics” and consider it a waste of resources. In addition, it lacks one important component, namely control over or measurement of the influence of the environment – both the project’s environment and the company’s environment in general.
Some solutions
All sound science has started out with the practitioners. The approach has always been to start with the observations and knowledge of the practitioners, then to systematise their experience, deduce hypothesis and then collect data in order to accept or reject the hypothesis. As a part of this, it has also been considered important to agree on important definitions in order to have a common vocabulary and a common frame of reference. There are several important lessons for software engineering here:
· By starting with the practitioners, we make sure that all or most of the available knowledge is included - or at least considered - when we start to build our models.
· By starting with formulation of hypothesis, we get a data collection process that is goal driven and thus easy to motivate.
· When we reject or accept a hypothesis, we always increase the available body of knowledge.
In this way we are able to accumulate knowledge and models, share them and discuss them with colleagues - in short: create a real software science, or at least a basis for one.
However, one last obstacle has to be surmounted - the fact that software engineering is not a natural science in line with chemistry or physics. Software is developed by people. In some sense, this is part of the environment problem since the strong dependence on the software engineers’ skills, experience and knowledge makes it difficult to do repeatable experiments. This means that we in some way must include the waste body of knowledge already available in psychology and sociology. Without this, we miss an important component of software engineering and our understanding will forever be incomplete. Even if we cannot model the developers, we need to consider them in our model, for instance as a source of variance or uncertainty.
To sum up, we need to embark on the following program:
1. Collect the available expert knowledge from the software development community.
2. Define terms in order to improve communication.
3. Formulate hypotheses based on available knowledge and experiences.
4. Collect data, perform statistical tests and accept or reject the hypotheses.
5. Incorporate this knowledge into an agreed body of knowledge
6. Use the body of knowledge to build models that can be used to predict the effect of changes to a development process concerning cost, lead time and product quality.
Do we really need all this? The answer is “Yes”. Today, the software industry does not even have an agreed upon definition of productivity. Considering this, it cannot come as a surprise that we have problems when we discuss if a certain tool or method has increased productivity.
What has been done
Even though there is much to be criticised of what has been done in the past, the situation is not all bad. In the last ten years, several research communities have come up with important ideas - the ami project, the GQM method for data collection, more interaction with other industries, especially related to the TQM concepts, better knowledge of how to run experiments in an industrial setting and so on. Unfortunately, the experience gained through the design of experiments (DoE) and the work of Tagushi has largely been ignored by the software industry up till now. The same goes for G. Box’ work on analyses of non-repeatable experiments.
The EU initiative, ESSI, has been one of the main driving forces in introducing measurement into software development and improvement in Europe. The ESSI projects - PIEs - have enabled the commission to collect a large amount of data, measurement and experience that someday will hopefully serve as a basis for real research on software processes and software quality.
Below, we have summed up some of the results that we consider to be important from some improvement work we have done ourselves. We will describe three experiments, the reason why the experiment was performed, the strategy used for data collection and analyses and the most important results as seen from the point of view of the software industry. The three experiments can be summarised as follows:
1. A survey of 100 Norwegian companies where the persons responsible for software procurement were asked to rank a set of product and service quality characteristics on a five point scale.
2. Analyses of data from integration and system tests from a telecom product in order to see which factors influenced the number of faults in a software component at the time of delivery.
3. Analyses of data from an experiment with traditional development versus object-oriented development. The important questions were whether object-oriented development lead to fewer errors and a more efficient error correction process.
The PROFF project
The PROFF project was a joint undertaking by several Norwegian software producers. The goal of the project was to improve quality and productivity within this industrial segment. The project had a subproject catering to product quality estimation and control, to a large degree based on ISO 9126. One of the things that we needed to find out at an early stage was how customers ranked the quality characteristics of the ISO model. In order to find out, we performed a survey of 100 Norwegian companies. After having been contacted on the telephone, they received a questionnaire where they were required to give a score to each ISO 9126 product quality factor. In addition, we included a set of service quality factors, adapted form an early version of the ISO standard for service quality - ISO 4100. The analysis was performed by first ranking the results and then applying Friedman statistics to the ranks. The results were significant at the 5% level.
Overall results
Each responder ranked a set of factors on a scale from 1 (not important) to 5 (highly important). First of all, we looked at the overall score, i.e. the ranking of the factors that we got when pooling together the responses from all price and product categories. The score for each factor was computed as the average of the scores for all criteria related to this factor. This gave the following results:
Scores for all product categories pooled togetherFactor / Score
Service responsiveness / 2.81
Service capacity / 2.74
Product reliability / 2.71
Service efficiency / 2.65
Product functionality / 2.60
Service confidence / 2.60
Product usability / 2.57
Product efficiency / 2.46
Product portability / 2.05
Product maintainability / 1.89
When looking at the results of this survey we found some rather surprising results - at least they surprised us. The most important finding was that the three most important determinants in a buy / no buy situation was service responsiveness, service capacity and product reliability - in that order.
If we split up the data set according to type of product, we got the following results.
Scores according to product categoryFactor / COTS (36) / Standardized software packages (27) / Customized and tailored software (19)
Product functionality / 2.55 / 2.65 / 2.62
Product reliability / 2.65 / 2.77 / 2.74
Product usability / 2.67 / 2.47 / 2.54
Product efficiency / 2.47 / 2.35 / 2.58
Product maintainability / 1.68 / 1.94 / 2.32
Product portability / 2.01 / 2.04 / 2.09
Service confidence / 2.55 / 2.59 / 2.67
Service efficiency / 2.64 / 2.67 / 2.61
Service capacity / 2.69 / 2.74 / 2.84
Service responsiveness / 2.82 / 2.83 / 2.76
In all cases, the producer service responsiveness and service capacity are considered to be the most important factors, while product maintainability and portability are considered to be the least important ones. The only product factor that consistently scores high is product reliability.
Statistical method used
We based the tests for the hypotheses on the Friedman statistics, described below. See for instance Lehman (1975). The method can be described by using the following table:
Example of Friedman’s rank statisticsFactor / Category 1 / Category 2 / Category 3
F1 / R11 / R21 / R31
: / : / : / :
FN / R1N / R2N / R3N
Rank sum / R1. / R2. / R3.
Each category is given a rank score in the range 1 to 3 – 1 for the most important and 3 for the least important. If one of the categories consistently receive the best score, the rank sum for this column will be close to N, while a category that is consistently given bad scores will have a rank sum close to 3N. If there are no differences, the score for all the columns will tend to be equal. The formula below is used to compute the Friedman rank statistics. Let N be the number of rows in the table. Let s be the number of categories (columns). We can then compute the Friedman statistics Q as follows:
The (s-1)/2-term is the expected score in the case that all observations – rankings – were completely random. The Q-value thus measures the difference between the expected and the observed scores. As an approximation, we have that Q is Chi-square distributed with s-1 degrees of freedom. We will reject a hypothesis of no inter-column difference on the a-level if Q > c, where c is the a-percentile in the Chi-square distribution with s-1 degrees of freedom.