PROCEDURES FOR SAMPLING IN VILLAGES

THAT ARE CONSISTENT WITH

PROBABILITY SAMPLING

John W. Hall1

ABSTRACT

Conducting a survey in a village in the developing world can present many challenges. In the face of these challenges, preserving the probabilistic nature of the sample may be difficult. Nonetheless, procedures can be devised that are consistent both with the principles of probability sampling and with data collection methods and conditions in the field. From a few of my own observations (in Kenya) and from descriptions provided by others, I surmise that surveys in villages present some difficulties of a general nature: (1) village boundaries may not be easily discerned;(2) residences may be difficult to find (remote or hidden by natural objects or other buildings); (3) the layout of the village may make it difficult to form discrete subsets of dwellings for subsampling (like blocks in United States cities); (4) linking persons to a unique structure or group of structures may be difficult for an outsider. Other difficulties a survey may encounter include cost, cultural barriers to participation in the survey, and lack of a pool of experienced (or easily trained) field workers. Surveys are a newer phenomenon in much of the developing world than they are in developed countries. Procedures developed in developed countries may not easily fit the geography or culture of developing countries. However, all of the obstacles mentioned above have analogs in surveys I have conducted in the United States.

KEY WORDS: Probability sampling; Weights; Bias; Developing countries; Low-cost surveys.

1. INTRODUCTION

The title of the session is “Sampling in Villages of the World,” and I propose to discuss procedures to follow when sampling in villages for maximum consistency with the concept of probability sampling. In thinking about the problem, I discovered that I wasn’t sure exactly what the word village means. Also, my experience is that terms such as probability sampling often mean different things to different people even at times, to different statisticians. Therefore, the first part of my presentation will define these terms, or at least give my understanding of what they mean. The next topic is a review of some problems that a researcher might encounter in sampling the villages of the world. The paper then presents an overview of the often discussed EPI method for selecting samples and some problems this method presents for both aggregate and local estimates. The final section offers a few suggestions that may help improve sampling.

2. DEFINITIONS

The Random House Dictionary (1987) defines a village as “a small community or group of houses in a rural area.” It also mentions that a village is larger than a hamlet and smaller than a town, but such distinctions do not seem helpful for present purposes. The terms small and rural are helpful but vague. How small is small?

1 John W. Hall, Mathematica Policy Research, Inc., P.O. Box 2393, Princeton, NJ 08543 USA.

E-mail: .

1

Opinions from sources such as family, acquaintances, and colleagues suggested to me that a village is usually

thought of as a few hundred persons. For this discussion I suggest a lower bound of about 10 and an upper bound of around 2,000 persons. The choice of the figure 10 comes from the Giriama people of Kenya, among whom a “village” may comprise the members of a single family2. The 2,000 figure was suggested by a colleague from Indonesia and by a newsletter of the Association for India’s Development (Dishaa, 1998). Some entities in the United States that call themselves villages are quite bit larger than that, ranging up to 10,000 or so persons, but entities of this size present a different set of problems.

The categorization of villages as rural helps limit the discussion because areas near urban centers present different concerns from communities in rural settings. Implicit in the definition seems to be the idea of a fixed place. Sampling of nomads presents interesting issues, but they are beyond the scope of this session. Thus, for the purposes of this paper, a village will be a group of no more than 2,000 people living in a geographically definable place, located in a rural area. Administratively, it may be an entity recognized by some government agency, or it may be a part of such an entity (such as a state, district, or county).

The other term to define is probability sampling. Many texts state that in probability sampling, every element in the population being studied has a known, nonzero probability of being selected. Kish adds: “This probability is attained through some mechanical operation of randomization” (1965, p. 20). The use of probability samples allows us to use statistical theory to evaluate the reliability of the resulting estimates (Levy and Lemeshow, 1991, p. 17). A probability sample does not have to be selected with equal probability for every member; however, samples with unequal probabilities should be weighted before data are analyzed. Where sample weights are required, it is obviously important to know the probabilities of selection, or at least the relative chances of all members of the sample.

3. DIFFICULTIES OF SAMPLING IN VILLAGES

Sampling in a village setting is a relatively more important topic in surveying countries where a large proportion of the population lives in rural areas. However, in more urbanized countries, studies may require samples that focus on rural areas. Thus, the issues discussed in this paper do not pertain only to developing countries. The discussion to follow focuses on the village as part of a larger sample. Sampling within a village could also take place because one wants to make estimates about the village itself. An intermediate situation might be one in which the village is part of a domain of interest within a larger survey, so that separate estimates are desired for villages having certain characteristics (size, location).

Surveys in villages present some difficulties of a general nature: (1) village boundaries may not be easily discerned; (2) residences may be difficult to find (remote or hidden by natural objects or other buildings); (3) the layout of the village may make it difficult to form discrete subsets of dwellings for subsampling (like blocks in United States cities); (4) linking persons to a unique structure or group of structures may be difficult for an outsider. Other difficulties a survey may encounter include cost, cultural barriers to participation in the survey and lack of a pool of experienced (or easily trained) field workers.

These difficulties in sampling within a village might be thought of as having roots in three sets of factors: culture, geography, and resources. Culture may be thought of as mostly affecting data collection, but it can play a part even in sample selection. For example, one step in selecting a sample in a village may involve counting (enumerating) persons in the village. Cultural beliefs and practices may interfere with this enumeration. A newspaper in Kenya, the Nation, printed an editorial urging Kenyans not to lie to interviewers

about the number of children they had. The practice of lying, the editorial stated, may have “been the result of cultural practices which forbid one from counting one’s children” (Daily Nation, 1998). I asked my wife,

2Private communication, 1998.

1

who is from Kenya, about the editorial, and although she was not aware of a child-counting taboo, she stated: “These people don’t want to be counted. Counting is something you do to goats or cattle, not to humans.”

Culture can affect cooperation with sampling procedures in other ways. Attitudes toward governmental authority differ across cultures or even across time in the same culture. If the survey is seen as part of the work of the government, cooperation may be affected, depending on whether the government is feared or trusted, respected or resented. For example, an Email message I received commented that Asians were undercounted in the 1990 United States Census; this phenomenon was attributed to new immigrants’ fear of authority. A community might hide persons or dwelling units if its members believe that revealing their presence would cause problems for these hidden individuals or for the larger community. Conversely, in the United States, our experience has been that endorsement of a survey by local government officials can increase levels of cooperation among many groups in the population.

Familiarity of survey management and field workers with local customs can also affect sampling procedures. If the person in charge of implementing sampling procedures in a particular village is not familiar with the local culture, it may affect his or her ability to carry out even simple procedures, such as identifying dwelling

units. In my experience, it is not always obvious which structures are used as dwellings and which are not. While visiting a Kenyan village, I could not always distinguish dwellings from other structures, and I would have been at a loss to decipher the situation without embarrassing myself, offending my hosts, or both. Fortunately, I was accompanied by in-laws. Survey workers may not be as unfamiliar with their assigned areas as I was with that village; nonetheless, survey workers may live at some distance from the place they are helping to sample, in which case they will be somewhat unfamiliar with both the area and its customs. In Kenya a journey of 50 or 100 miles can take one to an area with a different language and different cultural traditions. In the United States, my company tries to hire survey workers who live near their assigned areas, but those workers may have to travel up to 50 miles to reach their assignments, often from cities into rural villages or farmland. For this reason our survey training includes tips on how to spot “hidden” dwelling units, such as units in hotels, commercial structures, boats, tents, and buses set up on blocks (Mathematica Policy Research, 1993, p. 22).

The search for dwelling units can also be affected by geography. In some village settings, all potential dwelling units are clearly visible; in others, vegetation or terrain may obscure many structures from the casual viewer. Excluding these dwelling units which are hard to locate reduces frame coverage, possibly resulting in bias. Similarly, geographic conditions may make it hard to determine the boundaries of the village. Boundaries are important because for every member of the population to have a known, nonzero probability of selection, he or she should be associated with one and only one sampling unit (village or other area) at any stage of sample selection.

Resources are a concern in any survey. Money is important, but the availability of experienced, well-trained field workers may be more important. Even in regions where surveys are conducted frequently, it can be hard to find workers who are familiar with the geography and the culture and who have been trained in implementing statistically sound procedures. Initial training can be expensive, and even the best training will leave gaps that only experience can fill.

The discussion so far might lead those who are planning surveys in villages in developing countries to surrender to perceived difficulties and accept faulty practices. This happens in developed countries as well: non-probability sampling methods, low response rates, and shoddy data collection methods are sometimes accepted as necessary evils--or even justified in the name of efficiency. My argument is that we must keep trying to improve and not underestimate our capabilities. Procedures developed in the past should be reviewed frequently. Those that are sound and still practical should be kept. If a practice is of questionable

1

effectiveness, perhaps it can be supplanted or improved. Changing circumstances may cause once workable procedures to become too expensive or otherwise impractical.

4. THE EPI PROCEDURES

The EPI procedures, named for the Expanded Program on Immunization of the World Health Organization (WHO), have been used at least since 1978 for surveys of immunization coverage (Lemeshow and Stroh, 1988). These procedures have been described or evaluated in several publications (Lemeshow and Stroh, 1988, Levy and Lemeshow, 1991; Fitch et al. , 1995; Fitch, 1996; World Health Organization, 1991). The EPI procedures were designed for use in a multistage sample, where the first-stage is selected with known probability (usually with probability proportional to size). In the EPI scheme of things, a village would probably be a first stage selection unit (primary sampling unit or PSU). Dwelling units are sampled within the selected village. It appears that in an EPI survey, all survey-eligible inhabitants of a dwelling unit are selected. However, subsampling within households could also be employed.

In an EPI survey, sampling in a village would be covered under WHO procedures for rural areas. A WHO training guide (World Health Organization, 1991) describes procedures for sampling with lists of dwelling units and for instances in which no lists are available. If a list of dwelling units exists, or if it is feasible to create one (as when there are 100 or fewer households in the village), the first unit sampled is selected through use of a random number table, or by using the serial numbers on currency notes as a substitute for such a table. In other cases, field workers are instructed as follows:

  • Select a central location in the village or town, such as a market, a mosque, or church. The location should be near the approximate geographical centre of the village or area.
  • Randomly select the direction in which the first household will be located. This can be done in a variety of ways. You may choose to spin a bottle on even ground. Wherever the bottle points when it stops will be the direction for the first household.
  • Count the number of houses which exist along the directional line you selected from the central location to the edge of the village.
  • Then select a random number between 1 and the total number of houses along the directional line selected. This will identify the first house to be visited. For example, if you randomly select the number 9, you will visit the ninth house from the central location along the chosen direction (World Health Organization, 1991, p. 16).

Once the first sampled dwelling unit has been identified, the second sampled unit is the one whose front door is nearest the front door of the first. Subsequent units are identified and sampled in the same way. Units are sampled and data collection attempted until the target number of observations (usually seven) is obtained. The WHO manual is unclear about what to do if no one is at home when a sampled unit is first contacted, but Lemeshow and Stroh (1988, p. 10) state that such households are skipped. It also appears that once a household is selected, there is no subsampling of persons within the household.

Do these procedures yield a probability sample of individuals (e.g., children, mothers)? Lemeshow and Stroh (1988, p. 11) conclude that the samples are not self-weighting (self-weighting samples do not require sample weights for analysis) and that the selection methods introduce several potential sources of bias. However, the EPI procedures may lead to samples that not only are not self-weighting, but cannot be properly weighted. To evaluate the procedures, let us first assume that they are implemented correctly as described in the WHO manual: (1) a village is selected with known probability; (2) selection of the first household is done either from

a list or by the bottle-spinning technique; (3) if a cluster of units is selected first, workers correctly count the number of units in the cluster; and (4) workers accurately identify the dwelling unit nearest the one previously selected.

The cumulative or unconditional probability of selection for a person eligible for any EPI survey can be viewed as the product of the probabilities of selection at the various stages of selection. In a simple case, where the village is a primary sampling unit, these stages would be:

  • Village
  • Group of dwelling units (if no list is available)
  • Dwelling unit
  • Individual

As Levy and Lemeshow (1988, p. 11) point out, EPI procedures, with probability proportional to size (PPS) selection of primary sampling units, are not likely to produce self-weighting samples of children. Another way to state this is that the EPI procedures do not produce an equal probability sample of children. If PSUs were selected with probability proportional to the number (or estimated number) of children, or if the sampling target were a fixed number of households rather than a fixed number of children, the EPI samples would be closer to equal probability. However, the procedures used to select households within PSUs suggest that even these adjustments would not yield an equal-probability or self-weighting sample.

My first concern is with the bottle-spinning method. This method selects a group or cluster of dwellings. Unless the village is laid out with dwelling units distributed along easily identifiable radians, it would be very difficult to say what chance any group of households has of being selected. The bottle method selects one out of several clusters. The method more or less satisfies Kish’s requirement of using a mechanical method. However, if the total number of clusters is unknown, then the probability of selection at this stage cannot be calculated. Further, unless each of the clusters can be clearly identified, there may be overlap, so some dwelling units will have higher probabilities of selection than others. If this is true, it seems unlikely that these differing probabilities will be known.