Posts filed under 'text parsing'

merging in R on name

Often people want to merge datasets and have names of countries or locations that they want to merge on. These names are often somewhat similar, but not exactly. A function in R that is hugely useful to merge in this case is called agrep. With this function you can do approximate matching of names (or rather, or strings as subset of other strings). To merge properly, though, you do want to avoid matching the same name twice and you want to prioritize exact matches over very fuzzy matches. The idea is not mine, but Eduardo’s. To do so, I wrote a little R function, which is here in beta version:

agrep.wrapper < - function(x, y, names.x = "name", names.y = "name", ids.x = "id", ignore.case=TRUE, max.threshold=1) {

    x <- as.data.frame(x, stringsAsFactors=FALSE)
    y <- as.data.frame(y, stringsAsFactors=FALSE)

    unique.x.select <- !duplicated(x[,ids.x])
    unique.x.names <- x[,names.x][unique.x.select]
    unique.x.ids <- x[,ids.x][unique.x.select]
    
    unique.y.select <- !duplicated(y[,names.y])
    unique.y.names <- y[,names.y][unique.y.select]
    unique.y.ids <- rep(NA,length(unique.y.names))
    
    matching.x.names <- unique.x.names
    matching.x.ids <- unique.x.ids
    
    for (threshold in seq(from=0, to=max.threshold, by=.1)) {
        
        i <- 1
        while (i <= length(matching.x.names)) {
            
            select <- (1:length(unique.y.ids) %in% agrep(matching.x.names[i], unique.y.names, ignore.case=ignore.case, max.distance=threshold)) & is.na(unique.y.ids)
                
            if (sum(select) > 0) {
            
                unique.y.ids[select] <- matching.x.ids[i]
                matching.x.ids <- matching.x.ids[-i]
                matching.x.names <- matching.x.names[-i]
            } else
                i <- i + 1
        }
    }
            
    unique.data <- merge(data.frame(unique.x.names, unique.x.ids), data.frame(unique.y.names, unique.y.ids), by.x=”unique.x.ids”, by.y=”unique.y.ids”, all=TRUE)
    
    list(matches = unique.data)
}

Add comment March 8th, 2007

practice of programming

Not much new to tell. Have been reading a few chapter thus far in Kernighan & Pike, The Practice of Programming, which is a very good introduction to, well, programming. Very basic concepts, most of which I already knew, but very well laid out and good to sharpen my skills. I’m anxious to spend some time doing their exercises, but without a computer at home that will have to wait, I suppose.

Their somewhat silly example of a random text generator made me think about how to implement some code that reads the news and then runs some kind of factor analysis to group articles. So that might well be my next coding project - something different from online games (just read an extensive blog which convincingly argues how dangerous those games can be) and even kind of related to my studies.

Also discovered that Neoware is going to sell a thin client laptop. This means a laptop without harddisk, so that the prime use is through networking, including remote desktops and SSH. Cool idea, were it not that it does not actually lead to a cheaper laptop - they still charge $800.

Add comment October 19th, 2006


Calendar

November 2008
M T W T F S S
« Oct    
 12
3456789
10111213141516
17181920212223
24252627282930

Posts by Month

Posts by Category