merging in R on name
March 8th, 2007
Often people want to merge datasets and have names of countries or locations that they want to merge on. These names are often somewhat similar, but not exactly. A function in R that is hugely useful to merge in this case is called agrep. With this function you can do approximate matching of names (or rather, or strings as subset of other strings). To merge properly, though, you do want to avoid matching the same name twice and you want to prioritize exact matches over very fuzzy matches. The idea is not mine, but Eduardo’s. To do so, I wrote a little R function, which is here in beta version:
agrep.wrapper < - function(x, y, names.x = "name", names.y = "name", ids.x = "id", ignore.case=TRUE, max.threshold=1) {
x <- as.data.frame(x, stringsAsFactors=FALSE)
y <- as.data.frame(y, stringsAsFactors=FALSE)
unique.x.select <- !duplicated(x[,ids.x])
unique.x.names <- x[,names.x][unique.x.select]
unique.x.ids <- x[,ids.x][unique.x.select]
unique.y.select <- !duplicated(y[,names.y])
unique.y.names <- y[,names.y][unique.y.select]
unique.y.ids <- rep(NA,length(unique.y.names))
matching.x.names <- unique.x.names
matching.x.ids <- unique.x.ids
for (threshold in seq(from=0, to=max.threshold, by=.1)) {
i <- 1
while (i <= length(matching.x.names)) {
select <- (1:length(unique.y.ids) %in% agrep(matching.x.names[i], unique.y.names, ignore.case=ignore.case, max.distance=threshold)) & is.na(unique.y.ids)
if (sum(select) > 0) {
unique.y.ids[select] <- matching.x.ids[i]
matching.x.ids <- matching.x.ids[-i]
matching.x.names <- matching.x.names[-i]
} else
i <- i + 1
}
}
unique.data <- merge(data.frame(unique.x.names, unique.x.ids), data.frame(unique.y.names, unique.y.ids), by.x=”unique.x.ids”, by.y=”unique.y.ids”, all=TRUE)
list(matches = unique.data)
}
Entry Filed under: text parsing, R, statistics
Leave a Comment
You must be logged in to post a comment.
Trackback this post | Subscribe to the comments via RSS Feed