Using quantile regression to forecast poverty rate from current monthly income variable
1 Introduction
This paper sums up the results of the intership I made as an ENSAE student at division «revenu et patrimoine des ménages» during summer 2013, under the guidance of Cédric Houdré. My work aimed at providing a method to forecast at-risk-of-poverty rates from the French part of EU-SILC.
Specifically, given survey information for year N and survey and administrative data for year N-1, one expects to forecast the at-risk-of-poverty rate for year N without any reference to administrative information on that year. The latest administrative data on the income year N are indeed available in April N+1. Therefore, the at-risk-of-poverty rate is disseminated in September N+1.
In the hope that it could reflect, better than a linear model, the characteristics of the distribution of equivalised disposable income, a method based on the computation of recentered influence function was implemented to forecast at-risk-of-poverty rate for year N from administrative and survey data on year N-1 and survey data on year N.
2 Data
The French part of SILC survey gathers information on metropolitan France. It provides information on two distinct statistical units : household and individuals. Households are cross-section units : each year in the survey provides data on a distinct sample of household. Individuals are panel-data units : for each individual belonging to a household of the sample, information is collected during nine consecutive years.
In the end, SILC survey involves about 12000 household every year N, among which 10000 where already involved during year N-1.
For each year N, household-declared information in SILC survey provides :
- a main unit that gathers data on sociodemographical characteristics of the household, its income during year N-1, its financial situation and its living conditions (through deprivation indicators) ;
- a secondary unit dealing with a specific focus on a topic that is renewed every year ;
- a third unit that focuses on a topic formerly treated in permanent surveys on living conditions, renewed every year with a three-year rotation.
Since the aim is to provide a method that can be reproduced from year to year, there is no point in using secondary unit and third unit data. Hence the information that is to be used in at-risk-of poverty rate forecasting can only be extracted from SILC first unit.
3 Method
This section focuses on two separate questions:
- the theoretical framework of this paper, specifically the use of the recentered influence function of the at-risk-of-poverty rate seen as a functional of the equivalised disposable income distribution ;
- some specific operations that had to be carried to apply it to the forecasting of at-risk-of-poverty rate from household-declared information in SILC.
3.1 Theoretical framework : influence function of a functional
The use of the recentered influence function of the at-risk-of-poverty rate, at the base of this paper, is mostly an adaptation of Firpo, Fortin and Lemieux paper (2009) on unconditional quantile regressions. Its principle is summed up below.
3.1.1 Influence function and recentered influence function of a functional
Let Y be a real-valued random variable, and Ψ a functional of the space of cumulative distribution functions on R, that is to say Ψ associates to every cumulative distribution functions F : R → [0,1] a real number Ψ(F). The influence function aims at measuring the change in Ψ(FY), where FY is the cumulative distribution function of Y, when FY changes slightly.
Let FY and GY be two possible cumulative distribution functions for the real-valued random variable Y.
Proposition : For every t in [0,1], FY,t,GY : R → [0,1] defined by Fy,t,GY(y) = (1-t) FY(y) + t GY(y) is a possible cumulative distribution of Y.
Definition : The directionnal derivative of Ψ at point FY along vector GY-FY is :
when the limit exists.
Let δy be the cumulative distribution function of a Dirac in y, that is to say : δy(s)=0 for s<y and δy(s)=1 for s ≥ y.
Definition : The influence function IF(Ψ, FY, y) of the functionnal Ψ for the distribution FY at y is the directionnal derivative of Ψ at point FY along vector δy-FY.
Proposition : If FY and GY are two possible cumulative distribution functions for real-valued random variable Y then :
Definition : The influence function RIF(Ψ, FY, y) of the functionnal Ψ for the distribution FY in y is :
Proposition : If FY is the cumulative distribution function for real-valued random variable Y then :
Let now X be a multidimensionnal random variable that represents the covariates of Y.
Proposition : If FX is the cumulative distribution function of multidimensionnal random variable X and FY the cumulative distribution function of real-valued random variable Y then :
Hence, given a linear model :
where β does not depend on FY, one can then write, for any cumulative distribution function FY:
This gives the principle that was followed in this attempt to forecast at-risk-of-poverty rates from EU-SILC.
3.1.2 Forecasting the at-risk-of-poverty rate thanks to influence functions
This method can be applied to at-risk-of-poverty rate forecasting:
1. The at-risk-of-poverty rate is seen as a functionnal Ψ of the equivalised disposable income distribution as it is in the administrative data gathered in SILC.
2. Survey data in SILC is seen as covariates X.
Hence, whenever one has survey and administrative data on year N in SILC, and survey data on year N+1, one may :
1. calculate the recentered influence function of the at-risk-of-poverty rate on year N ;
2. estimate the linear model RIF( Ψ,FY,N,YN) = βN XN on year N ;
3. use the estimated βN to predict Ψ(FY,N+1) from variables XN+1 : Ψ(FY,N+1)= βN E[XN+1].
This may give a reasonable forecast of Ψ(FY,N+1) as long as the relationship RIF( Ψ,FY,N,YN) = βN XN holds for year N+1, what might be true as long as FY,N and FY,N+1,, and FX,N and FX,N+1 not differ much.
Formally, the recentered influence function of the at-risk-of-poverty rate is (Osier, 2009) :
where sY is the at-risk-of-poverty thresehold and mY the median of the equivalised disposable income distribution.
3.2 Practical aspects
3.2.1 Household selection for the estimation
Standard matching between administrative and survey data in SILC matches survey data on year N with administrative data on year N-1. Matching administrative and survey data on year N is in itself a problem:
· equivalised disposable income is a household-level data;
· households are cross-section units in the data whereas individuals are panel-data units.
To have a matching that makes sense between administrative and survey data on year N, one has to restrict oneself to the households whose members did not change between year N-1 and year N.
3.2.2 Choice of covariates
3.2.2.1 Available data
Available information in the survey data in SILC is:
· composition of the household
· characteristics of the reference person of the household
· characteristics and location of the household accomodation
· income of the household as reported in the survey data (as opposed to the administrative data on the income)
· deprivation indicators
The whole list of covariate used is reproduced at appendix A.
3.2.2.2 Survey data on income
Given the formal expression of the recentered influence function of the at-risk-of-poverty, it is unlikely to follow a linear relationship with the income of the household as declared during SILC survey. Estimating a linear model RIF(Ψ, FY,N, YN) = βN XN with XN the survey data equivalised disposable income could then lead to non-significant results.
To solve this problem, the recentered influence function of the at-risk-of-poverty rate based on survey equivalised disposable income was used instead of the survey equivalised disposable income in covariates X.
This calculation is based on the estimation of a probability density function. The way households tend to report their income in the survey may render the estimation difficult : their income is frequently reported as a multiple of 10, 50 or 100. This may create accumulation points in the distribution and lead to a poor estimation of the probability density function.
A simulated declared income was attributed to every household based on the income they actually declared in the SILC survey and a model of how households declare their approximate income in the survey, following Heitjan et Rubin, 1990. This simulation leads to a smoother distribution on which density estimation is easier, and also allowed to correct for partial non-answer (a survey equivalised disposable income was attributed to household that did not answer on their income but did answer to other questions in the survey).
3.2.3 Regression model
In the linear model RIF(Ψ, FY,N, YN)= βN XN, the covariate RIF(Ψ, GZ,N, ZN), where ZN is the household declared income in SILC on year N and GZ,N its cumulative distribution function is expected to have a major contribution. However, the recentered influence function of the at-risk-of-poverty rate can only take three different values.
The pair (RIF(Ψ, FY,N, YN), RIF(Ψ, GZ,N, ZN)) can only take nine different values. This situation is far from the traditionnal assumptions on linear regression models, and may lead to a poor estimation.
One could try to estimate the probability of being in each of the three domains of RIF(Ψ, FY,N,YN) given XN, following a polytomic model, but it is then impossible to estimate the values of RIF(Ψ, FY,N, YN) on each of the domains due to identification problems. Furthermore, with a change of weighting, the probability of being below s is exactly the at-risk-of-poverty rate one expects to forecast. There is then no interest in using influence functions anymore.
There is no simple solution to this problem. When RIF were used, the linear model was a GLM, to relax the homoscedasticity assumption. This model was compared to three different ones that did not involve influence functions :
· a logit model where RIF(Ψ, FY,N, YN) (resp. RIF(Ψ, GZ,N, ZN)) was replaced by a dichotomic variable y ≤ sY (resp. z ≤ sZ) ;
· a probit model where RIF(Ψ, FY,N, YN) (resp. RIF(Ψ, GZ,N, ZN)) was replaced by a dichotomic variable y ≤ sY (resp. z ≤ sZ);
· a multinomial logit model where RIF(Ψ, FY,N, YN) (resp. RIF(Ψ, GZ,N, ZN)) was replaced by a three-mode variable y ≤ sY , sY < y ≤ mY, mY < y (resp. z ≤ sZ , sZ < z ≤ mZ, mZ < z).
3.2.4 Levels and evolution, choice of year N
Based on the available data, the strategy described in 3.1.2. was actually adapted in two different ways, depending on whether one wished to predict at-risk-of-poverty rate levels or at-risk-of-poverty rate evolutions.
To predict at-risk-of-poverty levels, one is to:
1. calculate the recentered influence function of the at-risk-of-poverty rate on year N and on year N - 1 ;
2. estimate the linear model RIF(Ψ, FY, Y) = β X on year N and N - 1 (pooled cross-section) ;
3. use the estimated βN,N-1 to predict Ψ(FY,N+1) from variables XN+1 : Ψ(FY,N+1) = βN,N-1 E[XN+1].
In this case it was chosen to consider N = 2009.
To predict at-risk-of-poverty evolution, one is to:
1. calculate the recentered influence function of the at-risk-of-poverty rate on year N ;
2. estimate the linear model RIF(Ψ, FY,N, YN) = βNXN on year N ;
3. use the estimated βN to forecast Ψ(FY,N+2) - Ψ(FY,N+1) from covariates XN+1 and XN+2 : Ψ(FY,N+2) – Ψ(FY,N+1) = βN ( E[XN+2] - E[XN+1]).
In this case it was chosen to consider N = 2008.
4 Results
4.1 Estimation results
The commented estimation results deal with the estimation on years N-1 and N of the regression model (used to predict at-risk-of-poverty rate level in year N+1).
Since RIF(Ψ, FY, y) is not monotonous on y, signs and values of the coefficients are difficult to interpret.
Significancy levels in the regressions are reproduced at appendix B. The most striking result is that when estimating RIF(Ψ, FY, Y) = β X on years 2008 and 2009 using a generalised linear model, the contribution of most of the covariates is non-significant, whereas their contribution when estimating dichotomic and polytomic models is significant. In other words, these covariates discriminate well between households that are below and above the at-risk-of-poverty thresehold, and discriminate well between households that are below and above the median of the equivalised disposable income distribution, but do not discriminate well between the three different values of the recentered influence function.
The poor estimation results may therefore not be caused by a wrong choice of covariates, but rather, as 3.2.3 suggests, by a misspecification of the regression model.
Given these poor estimation results, it was chosen to add another model to the comparison. This model consists in a GLM regression of RIF(Ψ, FY, Y) = β X on years 2008 and 2009, but with a different choice of covariate. More precisely, the new covariates are the one that were found to be significant at a 95% thresehold in the first estimation. In the subsequent texte, this model is referred to as the reduced model.
These results hold when considering the estimation on year N of the regression model (used to predict at-risk-of-poverty rate evolutions between year N+1 and N+2). The results are reproduced at appendix C.
4.2 Forecast of at-risk-of-poverty rate levels
The five estimated models were then used to forecast the at-risk-of-poverty rate in 2010, and their results compared to the at-risk-of-poverty rate published by Eurostat from EU-SILC . The results are presented in the following table.
Model / Predicted at-risk-of-poverty rate in 2010RIF-GLM / 14,0%
Reduced RIF-GLM / 14,3%
Dichotomic logit / 13,4%
Dichotomic probit / 13,3%
Multinomial logit / 12,7%
Results of EU-SILC 2010 / 13,3%
Unfortunately, no calculation of a confidence interval was driven on these results, making the comparison difficult. However, the predicted at-risk-of-poverty for RIF-GLM and reduced RIF-GLM models look disappointing when compared to the actual at-risk-of-poverty rate in 2010 according to SILC, and to the predicted at-risk-of-poverty using dichotomic and multinomial models that do not require the use of influence functions.