7 (7)
WP1
Virtual sprint 29-30 September
We have three sources of data:
Unemployment Agency Job Portal (PB) 2012 – June 2016.
Business register (BR)
Job vacancy survey (JVS) 2012 - June 2016 (quarterly data)
First problem
PB contains organization number for private and public sector. The organizations can thus be linked to BR at business level (but not for place of work). We plan to collect portal data from private holders which do not have organization number for the advertising businesses. We test linking the PB data to BR by the use of other variables, and then use the organization number to evaluate the matching. Linking variables are for example municipality, postal code, and name of enterprise.
Businesses that are linked to BR can also be linked to JVS.
Second problem
Both BR and JVS have identification number for place of work (for private sector). Matching using geographical data and name of enterprise can possibly give more information about place of work also for the PB data.
Third problem
Linking to JVS also includes matching on the reference period. JVS asks for the number of vacant jobs on a Wednesday in the middle of the month. Businesses with more than 100 employees answer for all months in a quarter. Smaller businesses answer for one month in a quarter. The period during which an advertisement appears in the PB data needs to be calculated and matched with the reference dates for the JVS data. We can then compare the number of vacancies reported in JVS to the number of advertised jobs on the reference day, at business level.
Methodology
Use the available geographic information, post nr and municipality in PB to restrict the comparing enterprise name list from BR. The problem is that many advert records are lacking geographic information, which causes a long list of around 1.8 million enterprise names to compare. The data are stored in MS database, sql statements are used for retrieving the BR comparing name list.
Compare an enterprise name from PB with the BR list retrieved by applying multiple fuzzy string match methods, restricted Damerau-Levenshtein distance (osa), Levenshtein distance (lv), full Damerau-Levenshtein distance (dl), longest common substring distance (lcs)[1], and Jaro-Winker distance (jw)[2]. R package ‘stringdist’ is used.
We tested the method on six records from PB, and compared each record with the BR to identify the matches.
Result
NameInPB / NameInBR / Method osa / Method lv / Methoddl / Methodlcs / Methodjw / TypeFolksam / FOLKSAM ÖMSESIDIG SAKFÖRSÄKRING / 24 / 24 / 24 / 24 / 0.2580645 / 3
Nordea Bank AB / NORDEA BANK AB / 0 / 0 / 0 / 0 / 0 / 1
IF Skadeförsäkring / IF SKADEFÖRSÄKRING AB (PUBL) / 10 / 10 / 10 / 10 / 0.1190476 / 2
KUNDKONTOR / SKANDIABANKEN AKTIEBOLAG (PUBL) / 24 / 24 / 24 / 27 / 0.4056836 / 4
Swedbank Stockholmsregionen / Swedbank AB / 18 / 18 / 18 / 20 / 0.2828283 / 2
Swedbank Telefonbanken / Swedbank AB / 12 / 12 / 12 / 13 / 0.1969697 / 1
The table above summarizes the comparing result. The yellow scores give the correct match result. Each cell presents the comparing scores of the two names, one from the PB and the corresponding one in the BR. The examples describe 4 types of scenarios, 1, 2, 3 and 4. Type 1 is the best scenario where every method returned the correct match. Type 4 gives the worst scenario where the correct match was not included in the comparing list; therefore no method can help to identify the correct match. The detailed comparing results are presented below.
”Folksam”: the true match value is not among any of the best 10 match values of the methods. The top 10 match values according to method jw are presented below, which returned some results that are closer to the true value in the BR than other methods. Methods osa/lv/dl/lcs returned very similar results. Many false matches returned better scores than the true match.
name osa /lv /dl /lcs / jw
1: FOSAM 2 /2 / 2 / 2 /0.0952381
2: FTF FOLKSAM 4 /4 / 4 / 4/ 0.1212121
3: FLOKS 3 /4 / 3 / 4/ 0.1619048
4: Folkhus AB 4 /4 / 4/ 5/ 0.1809524
5: FOLKSAM HÄLSA AB 9 / 9 / 9 / 9 /0.1875000
6: FOLKKUNSKAP 5 / 5 / 5 / 6/ 0.1991342
7: Folusz, Kamil 7 / 7 / 7 / 8 /0.2014652
8: FOLKESON, ANN MARI 11/ 11 /11 / 11/ 0.2037037
9: FOLKSAM LO SVERIGE 11 /11/ 11 / 11/ 0.2037037
10: FOLKSAM LO VÄRLDEN 11/ 11 /11 / 11/ 0.2037037
”Nordea Bank AB”: top 10 match values according to method jw are presented below. Five methods all identified the correct match.
name osa/ lv /dl/ lcs / jw
1: NORDEA BANK AB 0 / 0 / 0 / 0/ 0.00000000
2: NORDEA BANK S.A. 3 / 3/ 3 / 4/ 0.08630952
3: NORDEA BANK FINLAND 6/ 6/ 6 / 7/ 0.12907268
4: Nordnet Bank AB 2/ 2 / 2 / 3/ 0.13235653
5: NORDEA BANK NORGE ASA 8 / 8/ 8/ 9/ 0.15079365
6: NORDIC EGG BANK AB 5/ 5/ 5/ 6/ 0.15486365
7: ANDERS BRAUN AB 8 / 8/ 8/ 9 /0.15595238
8: Nordeb AB 5 / 5/ 5 / 5/ 0.15608466
9: BORENA TRADING AB 8 / 8 / 8 / 11/ 0.16634346
10: LogTrade BarLink AB 8 / 8 / 8 / 9/ 0.16753422
”IF Skadeförsäkring”: method jw returned the correct match. The best 10 match values according to jw are presented below.
name osa/ lv/ dl/ lcs/ jw
1: IF SKADEFÖRSÄKRING AB (PUBL) 10/ 10/ 10 / 10/ 0.1190476
2: IF SKADEFÖRSÄKRING HOLDING AB (PUBL) 18/ 18 /18 / 18/ 0.1666667
3: SANDVIK FÖRSÄKRINGS AB 12/ 12/ 12/ 14/ 0.1904461
4: SVENSK FÖRSÄKRING 7 / 7 / 7/ 11/ 0.2043262
5: FRYKSÄNDE FÖRSAMLING 10/ 10/ 10/ 16/ 0.2078704
6: Bliwa Skadeförsäkring AB (publ) 14 /14 /14 / 15 /0.2082718
7: FRISTADS VÄGFÖRENING 13/ 13/ 13/ 18/ 0.2182870
8: FORS FÖRSAMLING 10 /10 /10 /15 /0.2194444
9: EDEFORS FÖRSAMLING 11 /11/ 11 / 18/ 0.2314815
10: DOKKAS FRITIDSFÖRENING 15 /15 /15 / 22/ 0.2315310
“KUNDKONTOR”: the best 10 match values according to method jw are presented. None of the methods can identify the correct match. The correct match was not included in the comparing list, which was returned by the same postal code of enterprise “KUNDKONTOR”. According to the organization number, the correct postal code is another place. This is a scenario that the name in the advert is totally different from the BR, maybe it is a workplace that use the mother company’s organization number. The name of the workplace does not resemble the BR name at all. The matching methods do not help in this case.
namn osa/ lv/ dl/ lcs/ jw
1: Bostaden Projektutveckling i Linköping AB 36/ 36 /36 / 41/ 0.4020325
2: HANDELSBANKENS KONSTFÖRENING 21/ 21/ 21/ 24/ 0.4089286
3: LKG SVENSKA AKTIEBOLAG 17/ 17/ 17/ 22/ 0.4242424
4: DIAKONIKRETSEN I DOMKYRKOFÖRSAMLINGEN 31 /31/ 31 / 33/ 0.4337337
5: HANDELSBOLAGET FYRTORNET 5 / 20/ 20/ 20/ 24/ 0.4452991
6: LINKÖPINGS DOMKYRKOFÖRSAMLING 23 /23 /23 / 27 /0.4538793
7: STIFTELSEN DOMKYRKOMUSIKEN I LINKÖPING 33/ 33/ 33 / 38/ 0.4548246
8: Vissland Invest AB 14/ 14 /14 / 20/ 0.4592593
9: Auktionshuset Gomér & Andersson AB 28 /28/ 28/ 30/ 0.4599440
10: Vissland Holding AB 15 /15 /15 /21 /0.4631579
” Swedbank Stockholmsregionen”: method lcs returned the correct match, below presents the top 10 values according to lcs.
namn osa/ lv /dl/ lcs/ jw
1: Swedbank AB 18 18 18 20 0.2828283
2: SWEDBANK ROBUR SKOGSFOND 14 14 14 21 0.2500000
3: SWEDBANK ROBUR MIXFOND 14 14 14 21 0.2875421
4: Swedbank Babs Holding AB 14 14 14 21 0.2989969
5: Swedbank Försäkring AB 17 17 17 21 0.3208754
6: SWEDBANK ROBUR TECHNOLOGY 16 16 16 22 0.2280864
7: SWEDBANK ROBUR HOCKEYFOND 14 14 14 22 0.2600000
8: Swedbank Hypotek AB 16 16 16 22 0.2958322
9: SWEDBANK ROBUR MEDICA 14 14 14 22 0.2989418
10: SWEDBANK ROBUR KINAFOND 15 15 15 22 0.2997517
”Swedbank Telefonbanken”: Method lcs and jw both identified the correct match. Below presents the top 10 according to jw and lcs.
namn osa lv dl lcs jw
1: Swedbank AB 12 12 12 13 0.1969697
2: Swedbank Finans Aktiebolag 16 16 16 18 0.2229174
3: SWEDBANK ROBUR TALENTEN AKTIEFOND MEGA 22 22 22 26 0.2236842
4: SWEDBANK ROBUR REALRÄNTEFOND 15 15 15 20 0.2315448
5: Swedbank Hypotek AB 12 12 12 17 0.2317916
6: SWEDBANK ROBUR AKTIEFOND PENSION 17 17 17 22 0.2386364
7: SWEDBANK ROBUR TALENTEN MIXFOND MEGA 22 22 22 27 0.2465972
8: Swedbank Robur AB 11 11 11 15 0.2495544
9: SWEDBANK ROBUR FINANSFOND 12 12 12 19 0.2525758
10: SWEDBANK ROBUR KINAFOND 12 13 12 19 0.2548584
namn osa lv dl lcs jw
1: Swedbank AB 12 12 12 13 0.1969697
2: Swedbank Robur AB 11 11 11 15 0.2495544
3: Swedbank Hypotek AB 12 12 12 17 0.2317916
4: REALRÄNTEFONDEN 14 14 14 17 0.3154040
5: Swedbank Finans Aktiebolag 16 16 16 18 0.2229174
6: SWEDBANK ROBUR ASIENFOND 13 13 13 18 0.2645202
7: SWEDBANK ROBUR NY TEKNIK 12 13 12 18 0.2749369
8: SWEDBANK ROBUR JAPANFOND 12 12 12 18 0.2866162
9: SWEDBANK ROBUR FINANSFOND 12 12 12 19 0.2525758
10: SWEDBANK ROBUR KINAFOND 12 13 12 19 0.2548584
Conclusions
The enterprise names in adverts decide a great deal of the matching result. In the BR, the names are often the full registered names; while in the adverts, how the enterprise names presented differs greatly. Sometime, the names in the adverts are shorter than the names in BR. The worst scenario is that a workplace use a totally different name than the name registered.
There are four types of match scenarios identified in our examples. Type 1 returned the perfect match result from more than one method. The enterprise name presented in the advert resembles closely the BR name. More than one method can identify the same best match. This is the best scenario and the different methods can be used as evaluation method to ensure the correct match.
In type 2 scenario, the enterprise names are quite similar to the registered BR names. One of the method used returned the correct match.
Type 3 scenario cannot identify the correct match in the top results of the method used, which means the enterprise names used in the adverts resemble bigger differences to the BR names. Further work need to be carried out to adopt other features to match names in the adverts to the BR.
Type 4 gives the worst scenario where a work place uses the organization number of the mother enterprise in the adverts. The work place use a name that does not resemble the enterprise name registered in the BR at all. A complex model that includes other features needs to be developed for identifying the match.
[1] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
[2] https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance