THE TEXTUAL DATA ANALYSIS FOR EXPLORING THE NARRATIVE CONTENT OF CORPORATE REPORTING DOCUMENTS

Gianluca Ginesti

Department of Economics, Management, Institutions

University of Naples "Federico II" (Italy)

Riccardo Macchioni

Department of Economics

Second University of Naples (Italy)

Marco Maffei

Department of Economics, Management, Institutions

University of Naples "Federico II" (Italy)

ABSTRACT

Core (2001) and Berger (2011) call for new techniques in natural language approach to examine corporate documents in order to extend the boundaries of the empirical literature on disclosure.

We respond to the call for new researches on natural language approach, by testing a technique which enables to provide of a synthetic and graphical representation, through the data textual reduction of the characterization of the entire text used in corporate narrative documents, such as the Management Commentary. We test our analysis on the outlook section of the Management Commentary using a sample of 50 Italian listed firms. Our first results suggest that only few firms provide textual information useful for external users in order to understand the firm outlook.

______

*Although this study is the joint result of the discussion among all authors, the paragraph 2 (the textual data analysis) and the relatedgraphicsare developed by Maria Spano and Nicole Triunfo - Department of Economics and Statistics, University of Naples "Federico II".

** Corresponding author: Gianluca Ginesti. Email: , .

1. INTRODUCTION

Textual data analysis is a statistical methodology, which allowsthe examination of large amounts of information contained in a text.Tothis purpose,textual data analysis carries outa set of operations on the corpus of a textandprovides a syntagmatic representation of the words, i.e., the words on the syntagmatic axis are considered in connection with all other words in the same text(Bolasco, 2005).

In this paper, we analyze the text of narrative corporate disclosure.

It is well argued that narrative corporate disclosure is fundamental to drive investors’ decision-making (Shi Yun Seah and Tarc, 2006). However literature also emphasizes that narrative disclosure might leave room for discretion to firms, in terms of what information they provide and how these are reported. With reference to the latter, over the years an increasing number of studies have paid attention to the narrative sections of corporate disclosure by examining the text (Leavy et al., 2010). Some studies examined either the readability (Courtis, 1995; Clatworthy and Jones, 2001; Li, 2008) or the tone of the corporate disclosure (Davis and Tama-Sweet, 2012). With specific regard to accounting literature, Li (2008) introduced the natural language processing approaches by investigating the relatively simple issues of how annual report readability is associated with firm performance and earnings persistence. This paper provided crucial evidences about the relevance of computational linguistics to assess the basic aspects of disclosure in relation to the manager’s motives to obfuscate (negative) performance.This field of research employs the natural language processing techniques to capture aspects of disclosure not readily measured by other means. Thus, focusing on the natural language approachshould be intended as a way to analyzethe reporting quality (Berger, 2011). Berger (2011)enlightens that priorresearcheshave some limitations,in terms of multiple ways tomeasure the disclosure readability and tone, and lack of agreement in relation tothe text that would be more useful to extract from corporate narrative reports, in order to assess something of interest. Therefore, Berger (2011) is agreed with Core (2001) to call for new techniques, coming from other fields of research, to extend the boundaries of the existing empirical literature on corporate narrative disclosure.

This study responds tothe above-mentioned call fornew research, by testing a technique to examine the text of narrative corporate disclosures, which to the extant of authors’ knowledge,has not yet been used in previous empirical studies. This study carries out an explanatory empirical researchfor a sample of Italian listed companies.We focus on the Management Commentary, because this report is mainly narrative, as it should display the objectives and strategies of the firms (IFRS, Practice statement – Management Commentary, 2010).To explore a new technique, we examined thesection ofthe Management Commentary that provides outlook information. Indeed, this sectionoffers a unique setting to examine forward-lookinginformation, which are mainly qualitative in nature (Lajili and Zéghal, 2005). Second, it is worth notingthat there are a number of reasons that justify our choice to examine the outlook section of the Management Commentary. In this regard, it is well acknowledge that the forward-looking information is one of the most important sections of financial reporting. However, several questions still persist as to whether this disclosure is truly informative (Verrecchia, 2001;Li, 2010).

Moreover,what should be highlighted is that despite such forward-looking information aremandatory in Italy, the requirements are in practice very vague and not really detailed. Thus, the preparers have a certain degree of discretion in terms of information to provide(Quagli, 2004; Beretta and Bozzolan, 2004) and this also affects thesemanticdimensionsof a text, which can have latent or hidden meanings.Our technique allows detecting the latent semantic dimensions ofthe text of the outlook section, through a reduction of the dimensionality of the representational space. The latent semantic dimensionsof the texts are subsequently interpreted through the reference to control variables, either financial or non-financial.

The remainder of this paper is organized as follows.Section 2explainsthe textual data analysis used in this study. Section 3 reviews the existing literature.Section 4describes the sample and the variables. Section 5shows the results. Section 6 concludes the study.

2. THETEXTUAL DATAANALYSIS

2.1METHODOLOGY

This paper usesthe Canonical Correspondence Analysis (CCA) of Ter Braak (1986), which is a development of the Correspondence Analysis (CA)of Benzécri (1973).

The main difference between the CCA and the CA concerns the determination of the factorial axes. CA is anunconstrained ordering-based method andthe factorial axes are examinedin the following way: i) the researcher needs an external knowledge to understand the factorial axes; ii) the researcher might perform a multiple regression analysis to understand the factorial axes; iii) the researcher might calculate the correlation coefficients between the factors and furthervariables not used in the creation of the factorial axes.Instead, CCA imposes as restriction that factorial axes are a linear combination of thevariableschosen by the researchers, which contribute to the creation of the factorial axes.

Therefore the CCA can be defined asa constrained correspondence analysis to the subspace generated by variables, where texts and words are projected and the maximum number of dimensions (factors) that can be represented is at most equal to the number of variablesusedin the analysis, whether these are quantitative and/or categorical. For example, consideringn firms, if the number of variables increases the correspondence analysis is progressivelyless constrained, up to the limit case in which the number of variables p ≥ n-1 and the CCA is consequently nothing more than a CA.

CCA is based on the analysis of a quantitative matrixcontaining the variables and another matrixknown as lexical matrix. This latter is a rectangular matrix of size nxp, where n are the reports and p are the words in the same reports.

2.2.THEALGORITHM

Consideringa sample of nreports that relates to the same number of firms, the frequency or the presence/absence (presence = 1, absence = 0) of m words and the values ​​of q variables (q <n) are quantified. Yik is the frequency or the presence-absence of the word k in report i and zijis the value of variablesj for the company i.

The first step of the analysis of the gradient is to summarise the most of the variability of the words by ordination. Starting from the assumption that the relationship between the words and the performance indicators follows a Gaussian curve of response, Gauch et al. (1974) propose a technique called Gaussian Ordination. Therefore the model response to words is represented by the bell function:

(1)

Where E(yik) represents the expected value of the yik to the report i, whose score on the ordination axis is xi. The parameters for the k words are: ck, maximum of the response curve of the words; uk, fashion, or the value of x for which is maximum.

The next step is to perform a multiple regression analysis, linking the same axis with the indicators:

(2)

Where b0 is the intercept, bjis the regression coefficient of the j-th performance indicator and xi is the score on the ordination axis of yik. Note that in the first phase the scores on the ordination axis are obtained from the matrix containing the data about the frequency of words in the reports; subsequentlythe regression coefficients bjare estimated keeping fixed the values ​​xi.

Therefore, the words are indirectly linked to variables through the ordination axes. Although this two-stepstechnique, called by ter Braak (1985) Gaussian canonical ordination, is statisticallymore rigorous, fromacomputational point of view it is also very expensive. For this reason Ter Braak, by showing that the correspondence analysis approximates the maximum likelihood solution of the Gaussian ordination, introduces the canonical correspondence analysis, as a heuristic approximation of the Gaussian canonical ordination.

The considerations, leading to this approximation, are realized in the transition formulas of the ACC (ter Braak, 1986):

(3)

(4)

(5)

(6)

Where y.k and yi. are respectively the marginal column and row of the matrix on the composition of the words in the reports, R is a diagonal matrix with generic element yi. of size nn; Z={zij}is a matrix of size n(q+1) containing the values ​​of the performance indicators and a column of 1, b, x, x* are three column vectors: b=(b0,b1,…..,bq)’, x=(x1,…...xn)’ e x*=(x1*,……xn*)’. The transition formulas define a vector problem similar to that in the analysis of correspondences where λ is the eigenvalue.

Thisissue can be solved by using the following iterative algorithm:

-Step1: Assign arbitrarily the initial scores to reports;

-Step2: Calculate the scores of the words as weighted averages of the scores of reports (Eq.3 con λ = 1);

-Step3: Calculate the scores of new reportsas weighted averages of the scores of words (Eq.4);

-Step4: Obtain regression coefficients by a weighted multiple regression of reportscores on performance indicators (Eq.5), where the weights are the marginal totals of reports yi.;

-Step5: Calculate new report scores using Eq.6. The new scores are the fitted values ​​of the regression of the previous step;

-Step6: Standardize new scores: e

-Step7: Stop on convergence, for example, when the new scores of the reports are sufficiently close to those of the previous iteration, otherwise proceeds to Step2.

The algorithm is analogous to that of the correspondence analysis, but steps 4 and 5 are additional. The second and subsequent axes of the CCA are also linear combinations of quantitative and categorical variables, which maximize the dispersion of words but are constrained to be uncorrelated (orthogonal) with the previous axes.

The final regression coefficients are called canonical coefficients, and the multiple correlation coefficient of the final regression is defined as the words-indicators correlation, and measures how much of the variability in the lexical structure can be explained by the variables. Looking at the signs and the values of canonical coefficients we can determine the importance of each variable in predicting the lexical structure.

2.3GRAPHICAL REPRESENTATION

The graphical representation of CCA is a tri-plot, which allows simultaneously displayingtext, words and quantitative and/or categorical variables. Graphically, the points represent words and texts, meanwhile the arrows represent quantitative variables. The similarity (or dissimilarity) between points-word, points-text, or both between points-word and points-text should be evaluated in terms of distance among the same; hence the smaller the distance between two projected points, the more the two words, reports and therefore firms to which they refer are “similar”.

The correlation between two quantitative variables coincides with the cosine of the angle formed by the two vectors that represent them. The smaller theangle between the two vectors, the more the variables are related.

3. ASSESSMENTS OF PRIOR LITERATURE

Previous researcheshave addressedthe issues of corporate disclosure through different approaches and method: i) content analysis (Bryan, 1997; Beattie et al., 2004); ii) disclosure indexes (Cooke, 1989 and 1992; Wallace et al. 1994; Haniffa and Cooke, 2002); iii) survey ranking (Clarkson et al., 1999); iv) textual analysis (Shroeder and Gibson, 1990;Courtis, 1995; Li, 2008; Davis and Tama-Sweet, 2012).

Berger (2011) reviews recent literature on corporate disclosure,recognising the increasingimportance of natural language approach,which aims at examining the information provided by firms.

Manystudiesanalyzed the readability of the annual report and/or its components(Soper and Dolphin, 1964; Smith and Smith, 1971; Jones, 1998; Clathworthy and Jones, 2001).In this regard, it is worth highlighting that a major issue is related to the difficulties in defining the concept of readability. Some authors have argued that readability refers to the ease of understanding of a message due to the style of writing used by the preparers of reports (Barnett and Leoffer, 1979), whereasthe understandability refers to the capability of the reader to comprehend the adequate meaning (Smith and Taffler, 1992). However, readability of a report is often associated to its understandability, and ithas been considered as an indicator of the understandability (Adelberg and Razek, 1984; Curtis, 1986). Then, the readability could reflect the understandability because if a text is simple toread, it will arguably be easierto be understood by readers (Smith and Smith,1971).

A variety of techniques to analyze the readability of a text have been developed over the years. Many of theseconsider the readability formulas (or indexes), which are constructed using language style elements,such as sentence length, word length, syllables and other vocabulary variables (Klare, 1974). According to Courtis (1987, p. 20) “the success of a formula in providing suitable direction depends on its ability to measure elements in the writing that are related to readers’ comprehension”.The most popular readability formulaswithin accounting studies arethe Flesch reading ease (Barnett and Leoffler, 1979), the LIX and the Fog (Courtis, 1987).

In this section we analyze some relevant studies, which investigated the textual properties of accounting narrative disclosure. Notably, we mainly enlighten the following information, if available in the papers: i) the methodology; ii) the examined country; iii) the results and the main concerns on methodology.

Shroeder and Gibson (1990) examined three different aspects of readability (e.g., use of the passive voice, word length, sentences length) of the Management Commentary, analyzing a sample of 40 USA firms. They compared the readability of Management Commentary to the readability of Chairman’s letter and financial statements footnotes. The authorsfound that the Management Commentary is significantly less readable than the Chairman’sletter.

Subramanian et al.(1993) analyzed the writing style of annual reports of 60 USA listed firms with a software program and using a Flesch readability formula. The authors tested the relationship between the readability of annual report and the performance of firms. The findings suggested that the averagereadability level of the annual reports of profitable firms was higher than that of the unprofitable firms.

Courtis (1995) used variousindexes (Flesch, Fog and Lix) to examine the readability of annual reportsprepared in English language of 32 Hong Kong public firms. The results provide evidence that the annual reports prepared by Hong Kong public firmsare very difficult to read for the majority of the adult population living inHong Kong.

Clathworthy and Jones (2001) analyzed the text used in 60 UK Chairmanstatements. They showed that the introduction to the Chairmanstatement is systematically easier to read than other parts of the same document. Furthermore, the results suggested that the thematic structure of this documentis a key driver of the variability of the annual report readability.

Rutherford (2003) analyzed whether the poorly performing UK firms use textual complexity in the Management Commentarytoobfuscate and thus undermine effective communication and good governance. The main results suggest that poorly performing firms do not obfuscate theirManagement Commentary by resorting to textual complexity and there is no way to attributethis complexity to that of the underlying activities described.

Li (2008) examined the determinants and implications of the lexical properties of corporate disclosures. The author used two measures of readability: the fog index and the length of the report. The findings provide evidence that the annual reports of firms with poor performance are harder to read, and firms with annual reports that are easier to read have more persistent positive earnings.

Leavy et al. (2010) investigated the effects of the readability of corporate 10-K filings on the behavior of financial analysts. The authors calculated the overall readability of corporate 10-K filings using the Fog index. The authors provided evidence that analysts who follow the firms with less readable disclosureneedon averagelonger time to issue the reports in response to 10-K reports. Furthermore they foundthat less readable 10-K reports are associated with greater dispersion and lower accuracy in analyst earnings forecasts.

Goel et al. (2010), using a software style, examined the readability of annual reports of fraudulent and non-fraudulent firms. They found systematic differences in communication and writing style of the annual reports between fraudulent and non-fraudulent firms. The results show that reports of fraudulent companies are much harder to read and comprehend than those of non-fraudulent firms. Their analysis suggeststhat the linguistic features can be used as an effective means for detecting fraud.

Davis and Tama-Sweet (2012),referring toa large sample of firms, investigated the managers’ languages across alternative communications outlets. The authors used a textual-analysis approach to analyze thecontent of earnings press releases and the corresponding content of Management Commentary.They found that firms exhibit lower levels of pessimistic language and higher levels of optimistic language in earnings press releases rather than inthe Management Commentary. They also foundnegative association between the level of pessimistic language in the Management Commentary and future firm performance, controlling for pessimistic language in the correspondentearnings press release.