Title: | Cluster and Merge Similar Values Within a Character Vector |
---|---|
Description: | These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refine <https://openrefine.org/>. More info on key collision and ngram fingerprint can be found here <https://openrefine.org/docs/technical-reference/clustering-in-depth>. |
Authors: | Chris Muir [aut, cre] |
Maintainer: | Chris Muir <[email protected]> |
License: | GPL-3 |
Version: | 0.3.3 |
Built: | 2024-11-09 03:07:10 UTC |
Source: | https://github.com/chrismuir/refinr |
This function takes a character vector and makes edits and merges values that are approximately equivalent yet not identical. It clusters values based on the key collision method, described here https://openrefine.org/docs/technical-reference/clustering-in-depth.
key_collision_merge( vect, ignore_strings = NULL, bus_suffix = TRUE, dict = NULL )
key_collision_merge( vect, ignore_strings = NULL, bus_suffix = TRUE, dict = NULL )
vect |
Character vector, items to be potentially clustered and merged. |
ignore_strings |
Character vector, these strings will be ignored during
the merging of values within |
bus_suffix |
Logical, indicating whether the merging of records should be insensitive to common business suffixes or not. Default value is TRUE. |
dict |
Character vector, meant to act as a dictionary during the
merging process. If any items within |
Character vector with similar values merged.
x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "pizza, acme llc", "Acme Pizza, Inc.") key_collision_merge(vect = x) # Use parameter "dict" to influence how clustered values are edited. key_collision_merge(vect = x, dict = c("Nicks Pizza", "acme PIZZA inc")) # Use parameter 'ignore_strings' to ignore specific strings during merging # of values. x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield") key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))
x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "pizza, acme llc", "Acme Pizza, Inc.") key_collision_merge(vect = x) # Use parameter "dict" to influence how clustered values are edited. key_collision_merge(vect = x, dict = c("Nicks Pizza", "acme PIZZA inc")) # Use parameter 'ignore_strings' to ignore specific strings during merging # of values. x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield") key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))
This function takes a character vector and makes edits and merges values
that are approximately equivalent yet not identical. It uses a two step
process, the first is clustering values based on their ngram fingerprint (described here
https://openrefine.org/docs/technical-reference/clustering-in-depth).
The second step is merging values based on approximate string matching of
the ngram fingerprints, using the [sd_lower_tri()] C function from the
package stringdist
.
n_gram_merge( vect, numgram = 2, ignore_strings = NULL, bus_suffix = TRUE, edit_threshold = 1, weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5), ... )
n_gram_merge( vect, numgram = 2, ignore_strings = NULL, bus_suffix = TRUE, edit_threshold = 1, weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5), ... )
vect |
Character vector, items to be potentially clustered and merged. |
numgram |
Numeric value, indicating the number of characters that will occupy each ngram token. Default value is 2. |
ignore_strings |
Character vector, these strings will be ignored during
the merging of values within |
bus_suffix |
Logical, indicating whether the merging of records should be insensitive to common business suffixes or not. Default value is TRUE. |
edit_threshold |
Numeric value, indicating the threshold at which a
merge is performed, based on the sum of the edit values derived from
param |
weight |
Numeric vector, indicating the weights to assign to
the four edit operations (see details below), for the purpose of
approximate string matching. Default values are
c(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed along
to the |
... |
additional args to be passed along to the |
The values of arg weight
are edit distance values that
get passed to the stringdist
edit distance function. The
param takes four arguments, each one is a specific type of edit, with
default penalty value.
d: deletion, default value is 0.33
i: insertion, default value is 0.33
s: substitution, default value is 1
t: transposition, default value is 0.5
Character vector with similar values merged.
x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC") n_gram_merge(vect = x) # The performance of the approximate string matching can be ajusted using # parameters 'weight' or 'edit_threshold' n_gram_merge(vect = x, weight = c(d = 0.4, i = 1, s = 1, t = 1)) # Use parameter 'ignore_strings' to ignore specific strings during merging # of values. x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield") n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))
x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC") n_gram_merge(vect = x) # The performance of the approximate string matching can be ajusted using # parameters 'weight' or 'edit_threshold' n_gram_merge(vect = x, weight = c(d = 0.4, i = 1, s = 1, t = 1)) # Use parameter 'ignore_strings' to ignore specific strings during merging # of values. x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield") n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))
These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refine.
Open Refine Site https://openrefine.org/
Details on Open Refine clustering algorithms https://openrefine.org/docs/technical-reference/clustering-in-depth
refinr
features the following functionsMaintainer: Chris Muir [email protected]
Useful links: