CONFIDENCE INTERVALS FOR THE DEGREE
DISTRIBUTION OF A GRAPH UNDER INDUCED
SUBGRAPH SAMPLING
By Robert Garrard
School of Economics, University of Adelaide
We study the problem of constructing con dence intervals for the degree distribution of a graph when degrees are sampled via in-duced subgraph sampling. This sampling method results in obser-vations that are not independent, an ill-conditioned design matrix, and noise whose covariance matrix depends on nuisance parameters. We propose a Monte Carlo method akin to a parametric bootstrap whereby the degree distribution is estimated using a truncated singu-lar value decomposition and several graphs respecting the estimated degree distribution are constructed randomly from which we may draw pseudosamples. For each graph the relevant quantiles of the es-timator are determined and con dence intervals are constructed by taking the minimum and maximum values of the respective quantiles over all the graphs.
E-mail:
Keywords and phrases: Induced subgraph, density estimation, inverse problem
1
2
PREFACE
Thesis Title: Estimation of Distributional Properties of Social Networks Supervisors: Dr Firmin Doko Tchatoka, Dr Virginie Masson
Traditional neoclassical economic theory tends to view economic behavior through the lens of centralized markets which are perfectly competitive and obtain a market clearing price and quantity in equilibrium. In order to study a richer set of phenomena, modern theory attempts to relax some of these simplifying assumptions. Perfect competition is frequently relaxed through the introduction of a continuum of monopolistically competitive intermedi-ate goods rms or an increasing returns to scale production technology, and the existence of a market clearing price and quantity may be substituted for search and matching models (such as the Diamond-Mortensen-Pissaredes model of labor search). The study of social networks attempts to generalize economic exchange away from centralized markets to interactions between sparsely connected agents. Much theoretical headway has been made toward understanding how the structural features of social networks can a ect the underlying behaviors, but empirical techniques for measuring these features are yet to catch up.
My thesis studies procedures for estimating and conducting inference on a key feature of a social network: its distribution of connections.
Chapter one is concerned with testing for stochastic dominance relation-ships between the degree distributions of two networks or the same network measured over time. Under mild assumptions on a social welfare function, ef-ciency of one network architecture over another translates into a stochastic dominance relationship of the degree distributions (similar to ranking in-come distributions). We propose a statistical test which corrects for sample correlation and prove the validity of a bootstrap for the test statistics.
Chapter two, the paper presented here, studies how to estimate the degree distribution under a common sampling technique, induced subgraph sam-pling, and proposes a construction of con dence intervals based on Monte Carlo methods.
Chapter three addresses goodness-of- t testing for induced subgraph sam-ples. The design matrix associated with the sampling method is ill-conditioned, so typical hypothesis tests have remarkably low power. We consider regular-ized regression methods for increasing the power of hypothesis tests while maintaining the advertized size.
3
1. Introduction. Graphs are an increasingly popular tool used to model complex relationships and interactions, such as the spread of contagion through a nancial system (Nier et al. 2007, Gai and Kapadia 2010) , risk sharing (Bramoulle and Kranton 2007), trade in the absence of markets (Kranton and Minehart 2001), and even the interaction between proteins within a cell (Pellegrini et al. 2004).1 One of the many salient features of a graph is its degree distribution, which captures the pattern of direct connec-tions between nodes. Albert et al. (2000) show that the degree distribution is strongly tied to the ability of a graph to withstand failures of some of its components. Doyle et al. (2005) coined the term \robust-yet-fragile" to refer to graphs that are robust to random failures but vulnerable to targeted attacks or failures of certain \key" components. This feature is common to many real-world graphs such as the webgraph of the internet or interbank lending networks in a nancial system (Boss et al. 2004, Gai et al. 2011).
Since graphs are often too large to observe in their entirety, inferences about their features must be made from sampled subgraphs. One such sam-pling method, induced subgraph sampling, involves taking a simple random sample of nodes and observing only the connections between those nodes sampled. This yields a sampled subgraph for which we may compute fea-tures of interest, such as measures of centrality, clustering, etc. However, this sampling design distorts the degree distribution by systematically omitting links from the sample. In this paper we study how to estimate the density of a graph's degree distribution based on an induced subgraph sample and propose a Monte Carlo method for constructing a con dence region.
Let G = (V; E) be a graph and let V0V and E0E be a subset of nodes and edges. G0 = (V0; E0) is said to be an induced subgraph when any pair of nodes a; b2V0 are adjacent in G0 if and only if they are adjacent in G. An induced subgraph sample is formed by simple random sampling a subset of nodes and constructing the subgraph induced by those nodes. Since only connections to other nodes in the sample are measured, a node's degree in the sampled subgraph will be biased downward from its degree in the population graph. Furthermore, despite simple random sampling of the nodes, the observed degrees in the induced subgraph do not constitute independent draws. This takes kernel density estimation of the degree dis-tribution of G o the table. Not only would it be estimating a distorted version of the distribution, but the failure of independence would prevent cross-validation from selecting the optimal bandwidth.
Suppose we draw an induced subgraph sample G0 from a graph G. Let N =jV jand n =jV 0jbe the number of nodes in the population and sample
1The terms graph and network will be used interchangeably.
4
(a) True Distribution(b) Simple Estimator
Fig 1: Estimate of degree distribution using simple inversion. Sample of size n = 2000 drawn from an Erd}os-Renyi random graph on N = 10; 000 nodes with probability parameter such that N p = 7.
respectively. Let 2RN represent the degree distribution of G such that i is the proportion of nodes in G with degree i = 0; : : : ; N 1, and let y2Rn be de ned similarly for G0. Frank (1980) shows that
(1) / E[y] =Xwhere
(2) / Xij= / j / N / 1 / j / N / 1 / 1
i / n / 1 / i / n / 1
The system in (1) is under-determined since N > n. Supposing there is a known maximum degree in G such that only the rst M elements of are possibly non-zero, the system reduces to an inverse problem. De ning y~ and
~
X to be the rstMrows and columns of y and X respectively, a natural unbiased estimator of is
^~ 1~
(3)= Xy
~
However, as shown in Zhang et al. (2015), the matrix X is ill-conditioned with rapidly decaying singular values. This results in an unstable inversion when y is measured with noise. Figure 1 illustrates the undesirable nature of this estimator.
Zhang et al. (2015) derive conditions under which the distribution of noise from sampling error may be considered approximately normal and propose
5
an estimator which is the solution to the following quadratic programming problem.
(4) / minimize / (y~ / X~ / )0C1 / (y~ / X~ / ) + D / 2jj / jj2
subject to / 0
jj jj1= 1
Where C is an estimator of the covariance matrix of the noise, D is a second-di erencing operator to impose smoothness, and is chosen through Monte Carlo SURE. While this enforces desired behavior of the estimator by con-straint, its properties become di cult to study due to the requirement that both the estimator and the tuning parameter be determined numerically. Furthermore, the impact of the `1 constraint is unclear. By the very nature
~
of the problem, X has eigenvalues quickly decaying to zero, violating many of the typical restricted eigenvalue-type conditions for estimation with an `1 penalty, such as in Greenshtein et al. (2004), Fuchs (2005), Donoho (2006), and Zhao and Yu (2006).
The rest of the paper proceeds as follows. Section 2 discusses estimation of the degree distribution. Section 3 proposes a Monte Carlo method for constructing con dence intervals. Section 4 concludes with a discussion.
2. Estimating the Degree Distribution. LetGbe a graph onN nodes with an unknown degree distribution. Let G0 be an induced subgraph formed by simple random sampling n nodes from G. Following Frank (1980) and Zhang et al. (2015) we will assume G has a known maximum degree M n. Let denote the degree distribution of G, whereiis the proportion of nodes with degree i for i = 0; : : : ; M. Similarly, let y denote the empirical degree distribution of the sampled subgraph, where yi is the proportion of nodes in G0 with degree i for i = 0; : : : ; M. Let X be an (M + 1) (M + 1) matrix with ij-entry as in equation (2), for i; j = 0; : : : ; M. y is generated according to the linear model with xed design
(5)y = X + "
Where " is noise introduced by sampling error with mean 0 and covariance matrix . Frank (1980) shows that exhibits both heteroscedasticity and non-zero o -diagonal elements, each of which depend on higher order prop-erties of the graph G. Thus, is not in general consistently estimable due to its dependence on nuisance parameters, making GLS estimation of infeasible.
6
(a) True Distribution(b) Truncated SVD Estimator
Fig 2: Estimate of degree distribution in gure 1 using truncated SVD with sample size n = 2000.
Estimation of requires a solution to the above ill-posed inverse problem for which we will employ a truncated singular value decomposition (SVD). The design matrix may be decomposed into
(6)X = U V 0
Where U and V are (M + 1) (M + 1) orthogonal matrices and is a diagonal matrix whose entries 0 1M 0 are the singular values of X. The inverse of X may be written as
(7)X 1= V1U0
Since X is ill-conditioned, some of its singular values are very close to zero. This leads to extremely large elements in 1, in ating the e ect of noise and creating an unstable inversion. An approximate inverse which is well conditioned may be found by applying the following truncation.
y / 0 / y / ( / 1 / if ii t(8) / X / = V / U / = / ii
y / ii / 0 / otherwise
Ideally, the choice of the bandwidth t would be data driven. However, the dependence within the sample removes the possibility to choose t through cross-validation or bootstrap methods and the properties of the noise violate the assumptions for SURE. Thus we choose t heuristically. Noting that we obtain a consistent estimator only if t! 0, we have found reasonable success choosing t = n1=2 for sampling rates less than 20% and otherwise choosing
7
t = n 1=kwhere k is the tens digit of the sampling rate (eg, k = 3 for 30%). We employ the following consistent estimator.
^y
(9)=X y
+
Where (x)+ := maxfx; 0g to impose non-negativity. Figure 2 illustrates the SVD estimator for a sample taken from an Erd}os-Renyi random graph.
3. Monte Carlo Con dence Intervals. Suppose we wish to con-struct a 1 simultaneous con dence region for . Placing con dence inter-vals around our density estimate typically requires either knowledge of the sampling distribution of the estimator together with a consistent estimator of the standard error, or the use of the non-parametric bootstrap. Neither are possible here since the standard errors depend on nuisance parameters and the observations are not independent. We propose a Monte Carlo method for constructing con dence intervals akin to a parametric bootstrap.
^ / G be the set of all graphs onLet be a consistent estimator of . Let
^ / ^
jj jj1 / nodes such that for each g2 G, g has degree distribution .
Algorithm 1 Con dence Bands
1:for i= 1 tob do
2:Sample a graph gi2 G uniformly at random
3:for j= 1 toc do
4:Sample an induced subgraph, hj, on n nodes from gi
5:Determine the empirical degree distribution, yj , of hj
6: / Let ^ = / Xyy / +7: / end for j / j / ^
8: / Let Li and Ui / be the / and 1 / respectively
2 / 2 / quantiles of
9: end for
10: Let L = minfLig and U = maxfUig
Where the quantiles, minima, and maxima are understood componen-twise. Here the reason for choosing a truncated SVD estimator becomes apparent. The pseudoinverse X may be precalculated such that estimation
^
of reduces to matrix multiplication. A constrained least squares approach would require the solution of a quadratic programming problem at each it-eration.
This approach is similar to a parametric bootstrap because we attempt
^
to mimic the sampling distribution of by rst estimating the parameter and then drawing pseudosamples from a DGP indexed by that parameter. However, since the DGP is not uniquely pinned down by just the degree dis-tribution, it also depends on higher order properties of the graph, we sample
8
uniformly over all possible graphs which have that degree distribution and in e ect consider the worst case scenario. This should lead to con dence inter-vals which are asymptotically conservative. This construction requires the ability to sample uniformly over the set of graphs with a given degree dis-tribution. This can be achieved quite easily with the so-called Con guration model (Bender and Can eld 1978, Bollobas 2001), which reduces the prob-lem of randomly generating a graph with a prescribed degree distribution to that of creating a random matching.
3.1. Sampling Graphs with a Prescribed Degree Distribution. Let d1; : : : ; dN be the set of degrees of each node implied by the degree distribution.2 We wish to construct a graph with this degree sequence. Begin with a set of nodes, i = 1; : : : ; N. Endow each node i, with a set of distubs (or half-links) emanating from the node. Now construct a random matching on the set of stubs, and connect each pair of stubs that are matched to form a link.
To randomly match the stubs, create a list of the node labels in which label i appears di times, then form a random permutation of the list. To construct the graph, start at the rst entry in the permuted list and begin pairing o the stubs of nodes with adjacent labels.
Example 1. Suppose we wish to construct a graph with degree sequence 4; 2; 2; 1; 1.
Fig 3: Nodes with 4, 2, 2, 1, and 1 stubs respectively.
Construct the list: 1111223345. Produce a random permutation: 5113212314. Connect adjacent nodes.
2Which is unique up to a relabeling of the nodes.
9
(a) Connect the stubs.(b) Resulting Graph
Given a degree sequence, d1; : : : ; dm, We randomly sample the adjacency matrix of a graph g2 G on m nodes as follows.
Algorithm 2 Sample a Graph
1:Let S be a vector of stub labels
2:G zeros(m; m)
3: Prandperm(S)
4: fori = 1 to length(P )2 do
5: / G(P (i); P (i + 1)) / 16: / G(P (i + 1); P (i)) / 1
7: / end for
This yields an adjacency matrix G from which we may sample induced subgraphs. Under this sampling method it's possible for the sampled graph to have self-loops and multiple edges between nodes, though the probability of this is declining with the number of nodes. It's feasible to only accept samples which are simple graphs, but for the sake of computational speed we will not do so.
Figure 5 illustrates con dence regions generated by this procedure for various sampling rates. Sampling rates for graphs may be as high as 50% (Jackson et al. 2012, Banerjee et al. 2012). Due to the way in which the con dence intervals are generated, they need not be centered around the density estimate. However, since the threshold parameter for the SVD esti-mator increases with the sampling rate, the con dence bounds automatically become smoother and more tightly concentrated around the estimate.
4. Discussion. In this paper we have proposed a method for construct-ing con dence intervals around a density estimate of a graph's degree distri-bution based on Monte Carlo construction of graphs whose degree distribu-tions agree with the estimate. For these con dence intervals to attain their advertized coverage in nite samples, it's necessary for the density estimator
10
(a) Sampling Rate: 10%(b) Sampling Rate: 20%
(c) Sampling Rate: 30%(d) Sampling Rate: 50%
Fig 5: 95% Con dence bands with b = 100 and c = 100. M = 15.
to have good nite sample estimation risk. Zhang et al. (2015) propose an estimator that attempts to minimize estimation risk both through constraint (enforcing non-negativity and a known `1 norm) and Monte Carlo SURE. However, the combination of estimation through quadratic programming, determining the tuning parameter through Monte Carlo SURE, and form-ing con dence intervals with Monte Carlo construction of graphs proves to be a computational burden. Consequently we rely on a simpler estimation method based on a truncated singular value decomposition where the band-width is chosen heuristically. A data driven choice for this bandwidth given a dependent sample is a possible avenue for further research.
References.
Albert, R., Jeong, H., and Barabasi, A.-L. (2000). Error and attack tolerance of complex networks. nature, 406(6794):378{382.
Banerjee, A., Chandrasekhar, A. G., Du o, E., and Jackson, M. O. (2012). The di usion
11
of micro nance. NBER Working Papers 17743, National Bureau of Economic Research, Inc.
Bender, E. A. and Can eld, E. (1978). The asymptotic number of labeled graphs with given degree sequences. Journal of Combinatorial Theory, Series A, 24(3):296 { 307.
Bollobas, B. (2001). Random Graphs:. Cambridge University Press, Cambridge, 2 edition.
Boss, M., Elsinger, H., Summer, M., and 4, S. T. (2004). Network topology of the interbank market. Quantitative Finance, 4(6):677{684.
Bramoulle, Y. and Kranton, R. (2007). Risk-sharing networks. Journal of EconomicBehavior & Organization, 64(3-4):275{294.
Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on information theory, 52(4):1289{1306.
Doyle, J. C., Alderson, D. L., Li, L., Low, S., Roughan, M., Shalunov, S., Tanaka, R., and Willinger, W. (2005). The robust yet fragile nature of the internet. Proceedings of theNational Academy of Sciences of the United States of America, 102(41):14497{14502.
Frank, O. (1980). Estimation of the number of vertices of di erent degrees in a graph.
Journal of Statistical Planning and Inference, 40(1):45 { 50.
Fuchs, J.-J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601{3608.
Gai, P., Haldane, A., and Kapadia, S. (2011). Complexity, concentration and contagion. Journal of Monetary Economics, 58(5):453 { 470. Carnegie-Rochester Conference onpublic policy: Normalizing Central Bank Practice in Light of the credit Turmoi, 1213 November 2010.
Gai, P. and Kapadia, S. (2010). Contagion innancial networks. Proceedings of the Royal
Society of London A: Mathematical, Physical and Engineering Sciences.
Greenshtein, E., Ritov, Y., et al. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971{988.