merging in R on name
Often people want to merge datasets and have names of countries or locations that they want to merge on. These names are often somewhat similar, but not exactly. A function in R that is hugely useful to merge in this case is called agrep. With this function you can do approximate matching of names (or rather, or strings as subset of other strings). To merge properly, though, you do want to avoid matching the same name twice and you want to prioritize exact matches over very fuzzy matches. The idea is not mine, but Eduardo’s. To do so, I wrote a little R function, which is here in beta version:
agrep.wrapper < - function(x, y, names.x = "name", names.y = "name", ids.x = "id", ignore.case=TRUE, max.threshold=1) {
x <- as.data.frame(x, stringsAsFactors=FALSE)
y <- as.data.frame(y, stringsAsFactors=FALSE)
unique.x.select <- !duplicated(x[,ids.x])
unique.x.names <- x[,names.x][unique.x.select]
unique.x.ids <- x[,ids.x][unique.x.select]
unique.y.select <- !duplicated(y[,names.y])
unique.y.names <- y[,names.y][unique.y.select]
unique.y.ids <- rep(NA,length(unique.y.names))
matching.x.names <- unique.x.names
matching.x.ids <- unique.x.ids
for (threshold in seq(from=0, to=max.threshold, by=.1)) {
i <- 1
while (i <= length(matching.x.names)) {
select <- (1:length(unique.y.ids) %in% agrep(matching.x.names[i], unique.y.names, ignore.case=ignore.case, max.distance=threshold)) & is.na(unique.y.ids)
if (sum(select) > 0) {
unique.y.ids[select] <- matching.x.ids[i]
matching.x.ids <- matching.x.ids[-i]
matching.x.names <- matching.x.names[-i]
} else
i <- i + 1
}
}
unique.data <- merge(data.frame(unique.x.names, unique.x.ids), data.frame(unique.y.names, unique.y.ids), by.x=”unique.x.ids”, by.y=”unique.y.ids”, all=TRUE)
list(matches = unique.data)
}
Add comment March 8th, 2007