Archive for March 8th, 2007

merging in R on name

Often people want to merge datasets and have names of countries or locations that they want to merge on. These names are often somewhat similar, but not exactly. A function in R that is hugely useful to merge in this case is called agrep. With this function you can do approximate matching of names (or rather, or strings as subset of other strings). To merge properly, though, you do want to avoid matching the same name twice and you want to prioritize exact matches over very fuzzy matches. The idea is not mine, but Eduardo’s. To do so, I wrote a little R function, which is here in beta version:

agrep.wrapper < - function(x, y, names.x = "name", names.y = "name", ids.x = "id", ignore.case=TRUE, max.threshold=1) {

    x <- as.data.frame(x, stringsAsFactors=FALSE)
    y <- as.data.frame(y, stringsAsFactors=FALSE)

    unique.x.select <- !duplicated(x[,ids.x])
    unique.x.names <- x[,names.x][unique.x.select]
    unique.x.ids <- x[,ids.x][unique.x.select]
    
    unique.y.select <- !duplicated(y[,names.y])
    unique.y.names <- y[,names.y][unique.y.select]
    unique.y.ids <- rep(NA,length(unique.y.names))
    
    matching.x.names <- unique.x.names
    matching.x.ids <- unique.x.ids
    
    for (threshold in seq(from=0, to=max.threshold, by=.1)) {
        
        i <- 1
        while (i <= length(matching.x.names)) {
            
            select <- (1:length(unique.y.ids) %in% agrep(matching.x.names[i], unique.y.names, ignore.case=ignore.case, max.distance=threshold)) & is.na(unique.y.ids)
                
            if (sum(select) > 0) {
            
                unique.y.ids[select] <- matching.x.ids[i]
                matching.x.ids <- matching.x.ids[-i]
                matching.x.names <- matching.x.names[-i]
            } else
                i <- i + 1
        }
    }
            
    unique.data <- merge(data.frame(unique.x.names, unique.x.ids), data.frame(unique.y.names, unique.y.ids), by.x=”unique.x.ids”, by.y=”unique.y.ids”, all=TRUE)
    
    list(matches = unique.data)
}

Add comment March 8th, 2007


Calendar

March 2007
M T W T F S S
« Jan   Apr »
 1234
567891011
12131415161718
19202122232425
262728293031  

Posts by Month

Posts by Category