The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
With an ever-expanding set of financial data providers, each covering
many of the same firms and entities, matching data sets that do not
share common identifiers has never been more
important.fedmatch is a set of tools for performing record
linkage between two data sets. It allows for a variety of different
matching techniques, letting the user build a matching algorithm for
their specific application. Although fedmatch was designed with economic
data in mind (i.e. loans or companies), it is very flexible, so it can
be used for any matching problem. With fedmatch, a
researcher or analyst can quickly go from having 0 matches between two
datasets to having many. With more time and care, they can use more
advanced techniques to pull out even more matches.
Fedmatch has many features, including:
This vignette will explain the basics of fedmatch, including the
merge_plus function and the clean_strings
function. From there, other vignettes go further into the details of the
different matching types.
Before diving into the matching code, we’ll first go over the
clean_strings function that can help standardize company
names across data sets.
A basic example of clean strings looks like this:
raw_names <- c("Hamlin, Hamlin, McGill", "Schweibert & Cokely ", "Wexler McGill, LLC",
               "Davis and Main, Inc.")
clean_names <- clean_strings(raw_names)
clean_names
#> [1] "hamlin hamlin mcgill"  "schweibert and cokely" "wexler mcgill llc"    
#> [4] "davis and main inc"Without any additional arguments, clean_strings does the
following:
As described in the manual for clean_strings, one can
specify further arguments to remove words or try different replacements.
Fedmatch comes with a set of words that are commonly used for this, but
you can use whatever you’d like. (You can also use
word_frequency to look for common words in your data.)
fedmatch::corporate_words[1:5]
#>      abbr     long.names
#>    <char>         <char>
#> 1:  accep     acceptance
#> 2:   amer        america
#> 3:  assoc     associates
#> 4:     cl company listed
#> 5:  cmnty      community
scrubbed_names <- clean_strings(raw_names, common_words = fedmatch::corporate_words)
scrubbed_names
#> [1] "hamlin hamlin mcgill"                       
#> [2] "schweibert and cokely"                      
#> [3] "wexler mcgill limited liability corporation"
#> [4] "davis and main incorporated"Through string cleaning, we can make it so that even if two different databases record names differently (e.g. “Hamlin Hamlin McGill INC” vs “Hamlin Hamlin McGill Incorporated”), we will still count these records as a match.
merge_plusThe workhorse of fedmatch is merge_plus.
merge_plus is an extremely flexible function that can
perform several different types of matches: exact, fuzzy, and
multivar.
Here are the example datasets that come with fedmatch:
fedmatch::corp_data1
#>                Company Country  State   SIC Revenue unique_key_1
#>                 <char>  <char> <char> <num>   <num>        <int>
#>  1:            Walmart     USA     OH  3300     485            1
#>  2:   Bershire Hataway     USA         2222     223            2
#>  3:              Apple     USA     CA  3384     215            3
#>  4:       Exxon Mobile     USA     TX  2222     205            4
#>  5:          McKesson  Germany     MA   222     192            5
#>  6: UnitedHealth Group     USA     MA    NA     184            6
#>  7:         CVS Health     USA     RI  1112     177            7
#>  8:     General Motors     USA     MI  2222     166            8
#>  9:               AT&T     USA     TN  4000     163            9
#> 10: Ford Motor Company     USA     MI    NA     151           10
fedmatch::corp_data2
#>                   Name country state_code SIC_code earnings unique_key_2
#>                 <char>  <char>     <char>    <num>   <char>        <int>
#>  1:            Walmart     USA         OH     3380  490,000            1
#>  2:  Bershire Hathaway     USA         NE     2220  220,000            2
#>  3:     Apple Computer     USA         CA       NA  220,000            3
#>  4: Exxon Mobile Inc.      USA         TX     2222  210,000            4
#>  5:     McKesson Corp.                 MA     2222  190,000            5
#>  6: UnitedHealth Group     USA         MA     1130  180,000            6
#>  7:                CVS                 RI     1122  180,000            7
#>  8:                 GM                 MI     2222  170,000            8
#>  9:             AT & T     USA         TN     4000  160,000            9
#> 10:         Ford Motor     USA         MI     2222  150,000           10The most basic way to use merge_plus is by simply making
it equivalent to base::merge.
basic_merge <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "exact", 
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")This code will run merge on the “Company” and “Name” variables, and
return cases where the two have an exact match. The only differences
between this and base::merge are
merge_plus requires data1 and data2 each to have a
“unique key” that can be used to identify an observation.merge_plus returns a list.Let’s take a look at each of the elements of the list returned by
merge_plus. These will always be the same, no matter which
match_type you select in merge_plus.
The first item is the matches themselves. This is a
data.table with one row for each matching observation,
along with all variables present in each data set.
print(basic_merge$matches)
#> Key: <unique_key_2>
#> Index: <tier>
#>               Company Country  State   SIC Revenue unique_key_1 country
#>                <char>  <char> <char> <num>   <num>        <int>  <char>
#> 1:            Walmart     USA     OH  3300     485            1     USA
#> 2: UnitedHealth Group     USA     MA    NA     184            6     USA
#>    state_code SIC_code earnings unique_key_2               Name   tier
#>        <char>    <num>   <char>        <int>             <char> <char>
#> 1:         OH     3380  490,000            1            Walmart    all
#> 2:         MA     1130  180,000            6 UnitedHealth Group    allThe next item is matches_filter. In this example, it’s
empty, because we didn’t supply the argument filter. If we
did supply filter (which can either be a function that
filters, or a numeric cutoff for a matchscore (more on this later)), we
would see a subsample of the matches dataset.
Next in the list is data1_nomatch and
data2_nomatch, which return the rows that were not matched
from the datasets.
print(basic_merge$data1_nomatch)
#>               Company Country  State   SIC Revenue unique_key_1
#>                <char>  <char> <char> <num>   <num>        <int>
#> 1:   Bershire Hataway     USA         2222     223            2
#> 2:              Apple     USA     CA  3384     215            3
#> 3:       Exxon Mobile     USA     TX  2222     205            4
#> 4:          McKesson  Germany     MA   222     192            5
#> 5:         CVS Health     USA     RI  1112     177            7
#> 6:     General Motors     USA     MI  2222     166            8
#> 7:               AT&T     USA     TN  4000     163            9
#> 8: Ford Motor Company     USA     MI    NA     151           10
print(basic_merge$data2_nomatch)
#>                  Name country state_code SIC_code earnings unique_key_2
#>                <char>  <char>     <char>    <num>   <char>        <int>
#> 1:  Bershire Hathaway     USA         NE     2220  220,000            2
#> 2:     Apple Computer     USA         CA       NA  220,000            3
#> 3: Exxon Mobile Inc.      USA         TX     2222  210,000            4
#> 4:     McKesson Corp.                 MA     2222  190,000            5
#> 5:                CVS                 RI     1122  180,000            7
#> 6:                 GM                 MI     2222  170,000            8
#> 7:             AT & T     USA         TN     4000  160,000            9
#> 8:         Ford Motor     USA         MI     2222  150,000           10Finally, there is match_evaluation, which is a
data.table that summarizes how well the match worked. It shows the
number of matches in each dataset broken down by tier (more on tiers
later), along with the percent matched.
We can also use merge_plus to perform “fuzzy” matches. A
fuzzy match uses a string distance algorithm to compute the distance
between one string and a set of other strings, then picks the closest
string that’s over a certain threshold. fedmatch uses
stringdist::amatch to execute these matches, and you can
read more about string distances in the stringdist package
documentation.
Here is an example of how this is implemented in
merge_plus.
fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", fuzzy_settings = build_fuzzy_settings(nthread = 1),
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#> Key: <unique_key_2>
#> Index: <tier>
#>    unique_key_2 unique_key_1 Country  State   SIC Revenue            Company
#>           <int>        <int>  <char> <char> <num>   <num>             <char>
#> 1:            1            1     USA     OH  3300     485            Walmart
#> 2:            2            2     USA         2222     223   Bershire Hataway
#> 3:            6            6     USA     MA    NA     184 UnitedHealth Group
#>                  Name country state_code SIC_code earnings   tier
#>                <char>  <char>     <char>    <num>   <char> <char>
#> 1:            Walmart     USA         OH     3380  490,000    all
#> 2:  Bershire Hathaway     USA         NE     2220  220,000    all
#> 3: UnitedHealth Group     USA         MA     1130  180,000    allWe can see that we picked up an additional match here: “Bershire
Hataway” and “Bershire Hathaway.” These are off by 1 character, so the
exact match didn’t pick them up, but the fuzzy match did. We can also
tweak the fuzzy match settings with the argument
fuzzy_settings. This is a list that will be passed to
stringdist::amatch.
fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", 
                          fuzzy_settings = build_fuzzy_settings(maxDist = .5),
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#> Key: <unique_key_2>
#> Index: <tier>
#>     unique_key_2 unique_key_1 Country  State   SIC Revenue            Company
#>            <int>        <int>  <char> <char> <num>   <num>             <char>
#>  1:            1            1     USA     OH  3300     485            Walmart
#>  2:            2            2     USA         2222     223   Bershire Hataway
#>  3:            3            3     USA     CA  3384     215              Apple
#>  4:            4            4     USA     TX  2222     205       Exxon Mobile
#>  5:            5            5 Germany     MA   222     192          McKesson 
#>  6:            6            6     USA     MA    NA     184 UnitedHealth Group
#>  7:            7            7     USA     RI  1112     177         CVS Health
#>  8:            9            9     USA     TN  4000     163               AT&T
#>  9:           10            8     USA     MI  2222     166     General Motors
#> 10:           10           10     USA     MI    NA     151 Ford Motor Company
#>                   Name country state_code SIC_code earnings   tier
#>                 <char>  <char>     <char>    <num>   <char> <char>
#>  1:            Walmart     USA         OH     3380  490,000    all
#>  2:  Bershire Hathaway     USA         NE     2220  220,000    all
#>  3:     Apple Computer     USA         CA       NA  220,000    all
#>  4: Exxon Mobile Inc.      USA         TX     2222  210,000    all
#>  5:     McKesson Corp.                 MA     2222  190,000    all
#>  6: UnitedHealth Group     USA         MA     1130  180,000    all
#>  7:                CVS                 RI     1122  180,000    all
#>  8:             AT & T     USA         TN     4000  160,000    all
#>  9:         Ford Motor     USA         MI     2222  150,000    all
#> 10:         Ford Motor     USA         MI     2222  150,000    allSo, cranking up the maxDist (maximum distance between
strings, a threshold for determining matches) gave us a bunch more
matches. Note that we return multiple matches per the same unique key
sometimes, for example Ford Motor got matched to General Motors and Ford
Motor Company. There are many tweaks that one can make via
fuzzy_settings, and these change the match behavior
significantly. It is worth exploring various options to see which make
the most sense for your specific application. See the fuzzy matching
vignette for more details, including a new method of string comparison
that we call a “Weighted Jaccard” comparison.
The final setting for match_type in
merge_plus is a “multivariable match”, or “multivar” for
short. This match is complex, and may take some playing around with the
code to fully understand how it works. We’ll just go over the basic
usage here. The idea behind the multivariable match is to use several
variables from each dataset to execute a match, rather than just using
the name of an entity.
One way is to take the set of variables (say, company name, state, and earnings), compare them with some numeric metric, and then perform a linear combination of those metrics to arrive at a final score. Then, you can compare each observation in one dataset to each other observation in the other dataset to pick the match with the highest score.
The other way is similar, but instead of a linear combination of scores, you can use a logit model. In this method, you create a hand-verified match set between your two datasets, then use a logit model to estimate how much each variable contributes to determining a match.
Here’s an example of the first method, the linear combination:
# for simplicity's sake, rename columns in corp_data2
data.table::setnames(corp_data2, c("Name", "country"), c("Company", "Country"))
multivar_linear_result <- merge_plus(corp_data1, corp_data2, 
                                     match_type = "multivar",
                                     by = c("Country", "Company"), 
                                     unique_key_1 = "unique_key_1",
                                     suffixes = c("_1", "_2"),
                                     unique_key_2 = "unique_key_2",
                                    
                                     multivar_settings = build_multivar_settings(compare_type = c("indicator", "stringdist"),
                                                               wgts = c(.5, .5),
                                                              top = 1))
multivar_linear_result$matches                                     
#> Key: <unique_key_2>
#> Index: <tier>
#>     unique_key_1          Company_1 Country_1  State   SIC Revenue
#>            <int>             <char>    <char> <char> <num>   <num>
#>  1:            1            Walmart       USA     OH  3300     485
#>  2:            2   Bershire Hataway       USA         2222     223
#>  3:            3              Apple       USA     CA  3384     215
#>  4:            4       Exxon Mobile       USA     TX  2222     205
#>  5:            5          McKesson    Germany     MA   222     192
#>  6:            6 UnitedHealth Group       USA     MA    NA     184
#>  7:            7         CVS Health       USA     RI  1112     177
#>  8:            9               AT&T       USA     TN  4000     163
#>  9:            8     General Motors       USA     MI  2222     166
#> 10:           10 Ford Motor Company       USA     MI    NA     151
#>              Company_2 Country_2 state_code SIC_code earnings unique_key_2
#>                 <char>    <char>     <char>    <num>   <char>        <int>
#>  1:            Walmart       USA         OH     3380  490,000            1
#>  2:  Bershire Hathaway       USA         NE     2220  220,000            2
#>  3:     Apple Computer       USA         CA       NA  220,000            3
#>  4: Exxon Mobile Inc.        USA         TX     2222  210,000            4
#>  5:     McKesson Corp.                   MA     2222  190,000            5
#>  6: UnitedHealth Group       USA         MA     1130  180,000            6
#>  7: UnitedHealth Group       USA         MA     1130  180,000            6
#>  8:             AT & T       USA         TN     4000  160,000            9
#>  9:         Ford Motor       USA         MI     2222  150,000           10
#> 10:         Ford Motor       USA         MI     2222  150,000           10
#>     Country_compare Company_compare multivar_score   tier
#>               <num>           <num>          <num> <char>
#>  1:               1       1.0000000      1.0000000    all
#>  2:               1       0.9882353      0.9941176    all
#>  3:               1       0.8714286      0.9357143    all
#>  4:               1       0.9333333      0.9666667    all
#>  5:               0       0.9285714      0.4642857    all
#>  6:               1       1.0000000      1.0000000    all
#>  7:               1       0.5333333      0.7666667    all
#>  8:               1       0.9111111      0.9555556    all
#>  9:               1       0.7333333      0.8666667    all
#> 10:               1       0.9111111      0.9555556    allTo specify this type of match, we put in the match_type
as “multivar,” and then we specified how we wanted the match to run by
passing the list multivar_settings. Each element of this
list is a separate argument to go into multivar_match. The
compare_type argument tells the multivar how to compare
each variable in the by argument. Because “Country” is a
binary variable, we specify “indicator”, and because “Company” is a
string variable, we specify “stringdist.” You can see a full list of
options for comparison in the multivar_match documentation,
or equivalently, the merge_plus score_settings
documentation. (merge_plus has an option to compute
matchscores post-hoc, as a method of evaluation.)
Next, here’s an example the second method, using a logit model. First, we’ll set up a fake training table. Normally, one would construct a human-verified match set. Here, I just create a table where the first half are matches, the second half are a mix of matches and not, and then the two comparison variables are biased to be more of a match in the first half of the sample. This is just a way to ensure that our logit model gives us positive coefficients, so that our example makes a little more sense.
set.seed(111)
training_table <- data.table::data.table(match = c(rep(1, 5e4), sample(c(0,1 ), 5e4, replace = TRUE)),
                                Company_compare = seq(1, 0.00001, -.00001),
                                Country_compare = c(rep(1, 5e4), sample(c(1, 0), 5e4, replace = TRUE)))
# training_table
logit_model <- glm(match ~ Company_compare + Country_compare, family = "binomial",
                   data = training_table)
summary(logit_model)
#> 
#> Call:
#> glm(formula = match ~ Company_compare + Country_compare, family = "binomial", 
#>     data = training_table)
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -1.27701    0.01702  -75.04   <2e-16 ***
#> Company_compare  4.98075    0.04134  120.49   <2e-16 ***
#> Country_compare  0.56183    0.01821   30.85   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 112745  on 99999  degrees of freedom
#> Residual deviance:  83050  on 99997  degrees of freedom
#> AIC: 83056
#> 
#> Number of Fisher Scoring iterations: 5Then, we plug our logit model into the
multivar_settings. The code will then use our trained logit
model on the variables we specified. Note that the name of the columns
in the training set must match the name of the variables in the match
datasets, with “_compare” at the end.
result <- merge_plus(corp_data1, corp_data2, by = c("Country", "Company"), unique_key_1 = "unique_key_1",
                        unique_key_2 = "unique_key_2", 
                     match_type = "multivar",
                     multivar_settings = list(logit = logit_model, compare_type = c("indicator", "stringdist"),
                        wgts = NULL),
                        suffixes = c("_1", "_2"))
result$matches
#> Key: <unique_key_2>
#> Index: <tier>
#>     unique_key_1          Company_1 Country_1  State   SIC Revenue
#>            <int>             <char>    <char> <char> <num>   <num>
#>  1:            1            Walmart       USA     OH  3300     485
#>  2:            2   Bershire Hataway       USA         2222     223
#>  3:            3              Apple       USA     CA  3384     215
#>  4:            4       Exxon Mobile       USA     TX  2222     205
#>  5:            5          McKesson    Germany     MA   222     192
#>  6:            6 UnitedHealth Group       USA     MA    NA     184
#>  7:            7         CVS Health       USA     RI  1112     177
#>  8:            9               AT&T       USA     TN  4000     163
#>  9:            8     General Motors       USA     MI  2222     166
#> 10:           10 Ford Motor Company       USA     MI    NA     151
#>              Company_2 Country_2 state_code SIC_code earnings unique_key_2
#>                 <char>    <char>     <char>    <num>   <char>        <int>
#>  1:            Walmart       USA         OH     3380  490,000            1
#>  2:  Bershire Hathaway       USA         NE     2220  220,000            2
#>  3:     Apple Computer       USA         CA       NA  220,000            3
#>  4: Exxon Mobile Inc.        USA         TX     2222  210,000            4
#>  5:     McKesson Corp.                   MA     2222  190,000            5
#>  6: UnitedHealth Group       USA         MA     1130  180,000            6
#>  7:                CVS                   RI     1122  180,000            7
#>  8:             AT & T       USA         TN     4000  160,000            9
#>  9:         Ford Motor       USA         MI     2222  150,000           10
#> 10:         Ford Motor       USA         MI     2222  150,000           10
#>     Country_compare Company_compare multivar_score   tier
#>               <num>           <num>          <num> <char>
#>  1:               1       1.0000000      0.9861508    all
#>  2:               1       0.9882353      0.9853272    all
#>  3:               1       0.8714286      0.9740476    all
#>  4:               1       0.9333333      0.9808013    all
#>  5:               0       0.9285714      0.9660386    all
#>  6:               1       1.0000000      0.9861508    all
#>  7:               0       0.8366667      0.9473611    all
#>  8:               1       0.9111111      0.9786024    all
#>  9:               1       0.7333333      0.9496635    all
#> 10:               1       0.9111111      0.9786024    allNote the last few columns in the data.table: we see the comparison
metrics, just like in the linear combination version of
multivar_match. But, note that instead of computing a 50/50
linear combination like before, we are now computing a matchscore as the
fitted probability of a match based on our logit model. In this toy
example, the coefficients are a little strange because of the random
data we fed in. But, we see the behavior we’d expect: a higher company
name comparison and a country match gives us a higher matchscore.
We’ve covered the several different types of matching with
fedmatch: exact matching, fuzzy matching, and multivar
matching. Each match is useful in its own right, and they become even
more useful when combined. That’s where the next step comes in: tier
matching with the function tier_match. See the vignette for
tier match for more details.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.