Measuring the very long, fuzzytail in the occupational distribution in web-surveys and the design implications for a look-up database

Kea Tijdens, University of Amsterdam, Amsterdam Institute for Advanced Labour Studies (AIAS), The Netherlands,

Abstract

This study explores the requirements for a look-up table for the survey question ‘What is your occupation?’. In contrast to other survey modes which apply predominantly an open format question with office-coding, web surveys allow for respondents’ self-coding by using a look-up table with occupational titles. Using a random sampled web survey (N=3,224 ) with a search tree with 1,614 titles, whereby the option ‘other’ with a text box wasoffered on eachthirdlevel, 67% of respondents ticked an occupation and 32% used the textbox. Approx. 400 titles were ticked more than once and this was sufficient to meet the demands of 91% of respondents. The challenge is to identify which 400 occupational titles should be in the look-up database, to be determined before a web survey starts. The remaining 9% used 257 titles, but given the 10,000’s job titles in the labour force, a sample size of 3,440 istoosmall to capture all occupations in the long tail of the distribution. The optimal size of a look-up table shouldthereforeinclude as many rare occupational titles as possible. Analysis showed that occupations in oil, gas and food manufacturing were poorly represented in the database.

Keywords

Occupations, ISCO classification, look-up database, closedsurvey question, searchtree

Acknowledgements

The authorthanks the LISS panel of Centerdata in Tilburg for collecting the data used in thispaper. Shethanks the participants of the ESRA 2013 conference (17July 2013, Ljubljana, Slovenia), the InGRID workshop New skills new jobs (11 February 2014, Amsterdam, Netherlands) and the GOR 14conference (05-07 March 2014, University of Applied Sciences, Cologne, Germany).The authoracknowledges the contribution of WEBDATANET, a European network for web-based data collection (COST Action IS1004, paperbuilds on researchconducted as part of the InGRID - Inclusive Growth Infrastructure Diffusion – project, which has receivedfundingfrom the 7th Framework Program of the European Union [Contract no. 312691, 2013-17].

Introduction

Manysurveys have one or more questions withthousands of responsecategories, the so-called long-list variables, such as occupation, industry, car brand, medicaldrugs, companyname and alike. For these questions typically an open-ended format with office-codingisused, but thisisexpensive and time-consuming. Alternatively, closed format survey questions couldbeused, but for mostsurvey modes the number of responsecategoriesislimited. In CATI and CAPI modes no more than 5 categoriescanbeasked, becauserespondentswill not memorize more categories. In PAPI the maximum number of categoriesshown in a printsurveyis at most 25. In CAWI closed format questions offer new opportunities, because the number of responsecategoriesisnot limited as is the case in the other modes. For the survey design thisrequires a look-up databasewith all possible responses, a semanticmatchingtoolto easerespondent’ssearch, and – if desired - a searchtreeallowingrespondents to navigatethrough the database.Apartfromavoidingcodingcosts, the advantages of a look-up table are thatrespondentsunderstandwhatkind of answers the surveyholderislooking for and thatresponses at variouslevels of aggregation are prevented. This paper explores the requirements for a look-up database for the ‘Whatisyour occupation?’ survey question. Most socio-economic and healthsurveysincludethis question.

For severalreasons a look-up database for occupations unfortunatelycannotcover all possible responses. First, no country has a full registration of job titles, as for exampleis the case for medicaldrugs. Second, the stock of occupationaltitlesisvery large as itmayeasily exceed the 10,000s and it is very dynamic with frequent entries and exits. Third, national labour forces are very unequally distributed over occupations, depicting a highly skewed distribution with a very long tail, challenging the measurement of the many the rare occupations in the tail. Figure 1 illustrates this long tail. It shows how the Dutch labour force is distributed over three-digit occupational groups, coded according to ISCO-08. Note that ISCO is an abbreviation of International Standard Occupational Classification, which was updated in 2008 and which increasingly is used as the global standard (Tijdens, 2014a). The Dutch labour force is coded according to 138 of the possible 193 three-digit occupational groups. Of these 138 groups, only 31 each include more than 1% of the labour force.

Figure 1The distribution of the Dutch labour force 2011 over 138 3-digit ISCO-08 occupational groups

Source:CBS Statline, accessed 13 FEB 2015

Data and methods

Data

Thispaper uses the data of the LISS (Longitudinal Internet Studies for the Social Sciences) panel. LISS is a probability-based online panel in the Netherlands and consists of 5,000 households, comprising of 9,219individualsaged 16 and over (October 2009). The LISS panel is part of the MESS project (Measurement and Experimentation in the Social Sciences) and itisadministered by CentERdata at Tilburg University, The Netherlands. The panel wasdrawnfrom the population register in collaboration withStatisticsNetherlands. Eventhough the questionnaire iscompleted online, all people in the samplewererecruited in traditionalwaysby letter, followed by telephone call and/or house visitwith an invitation to participate in the panel (for details about the recruitment: Scherpenzeel and Das2010; Scherpenzeel & Bethlehem, 2011). Householdsthatcould not otherwiseparticipate have been providedwitha computer and Internet connection.

Eachmonth the panel members are asked to complete a web survey. In October 2009 the LISS panel wasused for astudytofurther insight intobias in volunteersamples and to developmethods to adjust for surveybias. The Dutch version of the voluntary and continuousWageIndicator web survey on work and wageswascompleted by the LISS panel members. Full details of the results of the comparison of the LISS data with the WageIndicator data canbefound in Steinmetz et al. (2014). This paperuses the data with regard to the survey question ‘Whatisyour occupation’. It does not compare the twodatasets, but focussessolely on the data from the LISS panel.

In total 5,577personsresponded to thisparticular LISS survey, reflectingaresponse rate of 60.5% (Hootsen, 2010). Note that the monthlyresponse of participants varies between 50 and 80%. For ourstudy, onlyrespondents in paidemploymentwereasked about their occupation (3,444 respondents).Students, retiredpersons and otherindividuals not active in the labour marketwereassumed not to have a job title.

Respondents in the LISS surveycould self-identifytheir occupation by usingacompulsory 3-step searchtreewith a look-up database of 1,614 occupationaltitles, similar to the wayitwasused in the WageIndicator web survey at that time. Althoughtoday the semanticmatching technique iswidelyused for searching a database, in 2009 this technique was not yet in use for thissurvey question. The searchtreeconsisted of 23 entries in step 1 (for example ‘Guards, army, police’), 207 entries in step 2 (for example ‘Bodyguard’, ‘Police officer’), and 1,614 entries in step 3. In thisstep, someoccupationaltitleswereinsertedon more than one place if the searchpathswereambiguous. Note that all 1,614 occupationaltitles are codedaccording to the mostrecent ISCO-08 classification. The searchtreeused in the LISS panel wassimilar to the one that for manyyears has been and stillis in use in the WageIndicator web survey and in the WageIndicatorSalary Check. Given the millions of web visitorsusingthisSalary Check and the very few complaints received by its web manager, weconcludethat the quality of the searchtreeissufficient. Note that thesearchtreedoes not followthe ISCOhierarchy, because ISCO isdesigned for classificationpurposes, not to facilitaterespondents’ self-coding. On request of the authorone additionalfeaturewasadded to the searchtree in the survey of the LISS panel. At the bottom of each 3rdstep in the searchtreean option ‘other’ and a subsequenta text boxwasincluded.

Research objectives

The research objectives of ourstudy are threefold. First, whatproportion of respondentsidentifiedtheir occupation via the searchtree and whatproportion usedthe text box? If the text box was used, whatproportion could have identified their occupation in the search tree and whatproportion had an occupation which was absent in the search tree? Second, is the use of the text box related to respondents’ personal characteristics, to respondents’ industries, or to the design of the steps in the search tree? Third, how many of the 1,614 occupational titles were used by respondents, and how many of these were ticked only once? Hence, how can the nature of the long tail best be described and to what extent should a search tree’s database include the long tail of occupations, in otherwords, what is the optimal size of the look-up database?

Methods

For the purpose of thisstudy, the authorcoded all text box responsesmanually and identifiedwhether the coded occupation actuallywasavailable in the look-up database or not. Descriptive statisticswereused to analyse the first research objective. For the second objective, the likelihood of ticking ‘other’ in the search tree was modeled using explanatory variables for the personal characteristics age, gender and waged employment, for the industry as ticked in the survey and for the design of the search tree the 23 items of the first level in the tree was used. For the third objectivethe distributional characteristics of the occupation database were used.Time stamps or other para-data have not been used in the analyses here. Drop-out rates in thissearchtree have been analysedelsewhere (Tijdens 2014b).

Results

Use of the searchtree and the text box

The first objective aimsto prsesent descriptive statistics about the use of the searchtree and the text box. Table 1 shows that 32% of the respondentsticked ‘other’ and enteredtheir occupation in the text box. Aftercodingthesetext strings,itturned out that 14%could have identifiedtheir occupation, but had not searched longenough. Another 17%expressed an occupation whichwasindeed absent in the database.

Table 1Distribution over answercategories

N / % / N / %
Initialrespondents / 3,444 / 100%
Identified occupation in 3rd step / 2,313 / 67%
Ticked‘other’ / 1,113 / 32%
… of whichcould have identified occupation / 497 / 14%
… of which occupationwas absent in searchtree / 600 / 17%
… of whichunidentifiabletext / 16 / 1%
Drop outduringsurveycompletion / 18 / 1%

Source:WageIndicator Questionnaire administered to the LISS panel, October 2009

Who uses the text box?

The second objective is to explore whether the use of the text box is related to personal characteristics, to industries, or to the design of the steps in the search tree. Table 2 shows that from the three personal characteristics in the first model only age and gender is significantly contributing to the explanation, whereas in the third model only age matters: the odds ratio increases 1% for every year of age. When including the industries, the results show that the odds ratio of using the text box increases for workers in manufacturing and in commercial activities. This finding is largely confirmed when including the steps of the search tree, showing that the odds ratio for using the text box increases12 times whenrespondentsin the first stepselected‘Oil, gas, mining, utilities’, 6 timeswhentheyselected‘Food manufacturing’and 2 times whentheyselected‘Media, graphic, printing, culture, design’, ‘Legal, administration, inspection, policyadviser’ ‘Industrial production, manufacture, metal’, or ‘Cars, mechanics, technicians, engineers’. Theseresultsindicatethat the look-up databaseis not sufficientlydetailed to reflect the mostfrequent occupations in thesefields.

Table 2Odds ratios for ticking ‘other’

M1 / M2 / M3
Exp(B) / S.E. / Sig. / Exp(B) / S.E. / Sig. / Exp(B) / S.E. / Sig.
Personal characteristics
In waged employment / 0.90 / 0.09 / 0.89 / 0.09 / 0.88 / 0.09
Female / 0.84 / 0.07 / ** / 0.87 / 0.08 / * / 1.04 / 0.09
Age (16-64) / 1.01 / 0.00 / *** / 1.01 / 0.00 / *** / 1.01 / 0.00 / ***
Industry
Manufacturing / 1.56 / 0.27 / *
Electricity / 1.72 / 0.41
Construction / 0.99 / 0.30
Wholesale, retail / 1.32 / 0.27
Transport / 1.04 / 0.29
Accommodation / 0.98 / 0.30
ICT / 1.31 / 0.33
Finance / 1.19 / 0.29
Commercial activities / 2.20 / 0.29 / ***
Public administration / 1.31 / 0.29
Education / 1.05 / 0.27
Health care / 1.12 / 0.26
Entertainment / 1.08 / 0.31
Search tree step 1
Management, direction / 1.02 / 0.32
Oil, gas, mining, utilities / 12.77 / 0.53 / ***
Media, graphic, printing, culture, design / 2.06 / 0.28 / **
Marketing, PR, advertising / 1.76 / 0.36
Legal, administration, inspection, policy adviser / 2.28 / 0.25 / ***
Language, library, archive, museum / 1.50 / 0.46
IT, automation, telecommunication / 1.87 / 0.25 / **
Industrial production, manufacture, metal / 2.18 / 0.23 / ***
HRM Food manufacturing, labour intermediary, organisation / 1.28 / 0.36
Hospitality, tourism, leisure, sports / 1.07 / 0.25
Health care, paramedics, laboratory / 1.00 / 0.22
Guards, army, police / 0.67 / 0.36
Food manufacturing / 6.51 / 0.35 / ***
Finance, banking, insurance / 1.10 / 0.24
Education, research, training / 1.16 / 0.22
Construction, fittings, housing / 1.58 / 0.25 / *
Commercial, shop, buy and sale / 1.04 / 0.22
Clerks, secretaries, post, telephone / 0.66 / 0.26
Cleaning, housekeeping, garbage, waste / 0.73 / 0.31
Cars, mechanics, technicians, engineers / 2.60 / 0.26 / ***
Care, children, welfare, social work / 1.75 / 0.23 / **
Agriculture, nature, animals, environment / 1.98 / 0.28 / **
Constant / 0.41 / 0.14 / *** / 0.33 / 0.28 / *** / 0.26 / 0.23 / ***
Chi-square / 15.56 / 3.00 / *** / 46.33 / 16.00 / *** / 168.84 / 25.00 / ***
-2 Log likelihood / 4335.88 / 4305.12 / 4182.61

Source:WageIndicator Questionnaire administered to the LISS panel, October 2009, N=3,440,
*** p<0.01, ** p<0.05; * p<0.10

Who uses the text box?

The third objective is to explore the quality of the look-up database. How many of the 1,614 entries occupational titles were used by the respondents, and how many of these were ticked only once? Hence, how can the nature of the long tail best be described and to what extent should a look-up database include the long tail of occupations. To phrase it differently, what is the optimal size of an occupation look-up database?

Figure 1 shows the distribution of the respondents over the occupations, covering the 2,313 respondents who ticked an occupation plus the 497 who ticked ‘other’ but who could have identified their occupation in the search tree. The figure reveals that 3.9% respondents are an ‘Office clerk’, followed by 3.0% who are a ‘Primary school teacher’. Number six in the figure is the ‘Logistics worker’, encompassing 1.1% of the respondents. All other occupations are ticked by less than 1% of the respondents.

The 2,313 respondents who could identify their occupation used 584 of the 1,614 titles in the database. The 497 respondents who completed the text box but could have identified their occupational title used 207 titles from the database, of which 139 werealsoselected in the group of 2,313 respondents. Jointlythesetwo groups of respondentsticked 652 titles. Only 55 titleswereselected by at least 10 respondents, another214 by 3 to 9 respondents and 126 titleswereselected by only 2 respondents. In total thesewereticked by 91% of the respondents. The remaining 9% selected 257 titles, reflecting the long tail of the distribution. In total 962 of the 1,614 titles in the look-up databasewere not selected in our survey.

Figure 1The distribution of 2,810 respondentsover 652 codedoccupationaltitles

Source:WageIndicator Questionnaire administered to the LISS panel, October 2009

The 600respondents who completed the text box and could not have their identified their job titles used in total 555 job titles, after cleaning for misspellings and harmonization of gendered job titles. When defining the threshold for a long tail as those occupations with at most two respondents, in our case the tail consists of 938titles of the 1,207 titlesused and the 1,614 titles presented in the search tree.

Conclusion/Discussion

This study aimed to explore the requirements for a look-up table for the survey question ‘What is your occupation?’. In contrast to other survey modes which apply predominantly an open format question with office-coding, web surveys allow for respondents’ self-coding by using a look-up table with occupational titles, which all are coded according to a classification. Respondents can search the table by text sting matching and/or by search trees. In our survey 67% of respondents ticked an occupation. Compared to an open question the currentlook-up databasereducedthereforethe codingworkloadwith 2/3.

Given the 10,000s of occupational titles and the very long tail in the distribution, it is obvious that a look-up table should at least include the occupational titles that are ticked more than once in a survey. In our study this amounts to approx. 400 occupational titles, which is sufficient to meet the demands of approx. 90% in our sample of 3,224 individuals in the labour force. Here a major challenge relates to identify which 400 occupational titles should be in the look-up database, to be determined before a web survey starts.

The next challenge is to code as many rare occupational titles as possible in order to meet the demands of the remaining 10% in the sample. This requires that the look-up table of occupations needs to be as large as possible. This is basically the answer to the third objective about the optimal size of the look-up database.However, given the 10,000’s job titles, a sample size of 3,440 isfar toosmall to capture all occupations in the long tail of the distribution.

Older respondents have more difficulties in identifying their occupations in the database and are more likely to tick ‘other’ and use the textbox. Our analysis showed also that occupations in the oil, gas and food manufacturing were poorly represented in the database.

Over time and across countriesthe stock of job titles is very dynamic, which requires regular updating of the database. This seems only possible if multiple web surveys use the same occupation database and all add their manually coded occupations to the database.

References

Hootsen, J. (2010) WageIndicator Questionnaire administered to the LISS panel. Tilburg, CentERdata

Scherpenzeel, A. & Das, M. (2010). True longitudinal and probability-based Internet panels: Evidence from the Netherlands. In Das, M., Ester, P. & Kaczmirek, L. (eds.). Social and BehavioralResearch and the Internet: Advances in AppliedMethods and New ResearchStrategies. Boca Raton: Taylor & Francis.

Scherpenzeel, A. Bethlehem, J. (2011). How representative are online panels? Problems of coverage and selection and possible solutions. In: M. Das, P. Ester and L. Kaczmirek (eds.), Social Research and the Internet: Advances in AppliedMethods and New ResearchStrategies. Boca Raton: Taylor & Francis, pp. 105-132.

Steinmetz, S., Bianchi, A., Tijdens, K.G., Biffignandi, S. (2014). Improving web surveyquality – Potentials and constraints of propensity score adjustments. In:Callegaro, M., Baker, R., Bethlehem, J., Goritz, A., Krosnick, J., Lavrakas, P. (eds.) Online Panel Research: A Data Quality Perspective. Chichester:Wiley, pp 273-298.

Tijdens, K.G. (2014a).Reviewing the measurement and comparison of occupations across Europe.Amsterdam: University of Amsterdam, AIAS Working Paper 149.

Tijdens, K.G. (2014b). Drop-out rates duringcompletion of an occupation searchtree in web-surveys.Journal of Official Statistics, 30 (1), 23–43.