Is there an algorithm that can match approximate blocks of text?

Digital Humanities Questions & Answers » Programming

Is there an algorithm that can match approximate blocks of text?

(6 posts) (3 voices)

Asked 5 years ago by elotroalex
Latest answer from Stéfan Sinclair
This question has a best answer.

Tags:

elotroalex
Member
This question has been floating on my mind this week during conversations about the future of Juxta. The traditional diff algorithm would look at strings of text linearly. So if you have two series, A) a b c d f g h j q z, and B) a b c d e f g i j k r x y z, the diff algorithm recognizes that they share C) a b c d f g j z. In a series of versions of a given text, where an author and his collaborators have moved/transposed several blocks of text from one version to the next, the diff algorithm fails to recognize these moves, even though internal to those blocks the diff still applies. So the question is: Given a series X) a b c [d f g h j] q ... n, and a series Y) a [d f i h j] b c q ... n, how can we get the computer to recognize that Z) [d f h j] corresponds to each other and has moved? If I could solve this puzzle, we could teach Juxta to recognize moves automatically, or at least, suggest possible matches to make the scholar's work easier... I think.

One of the problems is that I don't know what kind of problem this is in terms of computer science or math: Is it a Matching problem? A String searching algorithm problem? I'm randomly walking in the Math and Computer Science department next week, but I figure I'd run it by the DH community just in case.
Tweet this question
Posted 5 years ago Permalink
Stéfan Sinclair
Admin

This isn't a tidy algorithmic answer, but you might want to check out what the folks at ARTFL have done with PAIR: Pairwise Alignment for Intertextual Relations.

Posted 5 years ago Permalink
elotroalex
Member

Replying to @Stéfan Sinclair's post:

This is very promising! Thanks, Stéfan. I was talking to some biologist friends over dinner and they suggested to look in both genetic sequencing and plagiarism software. Interestingly enough that's where the ARTFL folks say they got their ideas from. I will try to get in touch with them.

Posted 5 years ago Permalink
Wayne Graham
Member
Best Answer

Alex,

I'd classify this as an approximate string matching problem. In more mathematical terms, I believe you want to find the locations of a query of length m that matches a substring of length n, with k or fewer differences. There are a lot of ways to do this, and there is a long line of literature (check out Knuth's work) on this, but the 'big' problem here is more in computational efficiency. Most of the decisions really are going to require some trade offs based on the underlying technology, and the amount of time you are willing to make users wait.

For some actual algorithms to get you going, try the Boyer-Moore and Knuth-Morris-Pratt. You may also find some use in Stochastic Optimization and Metaheuristic class of algorithms (variable neighborhood search, adaptive random search, guided local search).

Posted 5 years ago Permalink
elotroalex
Member

Replying to @Wayne Graham's post:

Thanks Wayne! This is a strange world indeed, and fascinating! Reminds me a bit of the time I read René Thom's work on morphogensis and saw functions and geometric forms in everything for over a week. Now I'm seeing algorithms in my kid's breastfeeding habits.

Let me add a couple of things to your formulation of the problem. Given a query of length m that matches a substring of length n, with k or fewer differences, where k is determined relative to the length of m and n, it looks like a possible solution would be to start with a relatively small size for m, and if there is a match, slowly increase the length of m until the relative value of k is no longer appropriate.

In terms of the big question, I'm guessing that the computing can be done before the final product is presented to the public. We're talking scholarly editions here. Back in the days, the "user" time was days of pouring over manuscripts and editions before we could work all this stuff out. I doubt we would have to wait this long with a good solution.

Posted 5 years ago Permalink
Stéfan Sinclair
Admin
Best Answer

If you know something about the structure/scope of the repeating sequences – and if they're regular – than I suspect Normalized Compression Distances could be very effective at finding similar passages. For instance, you could build a matrix of distances between each sentence in one text to each sentence in another text, and align as applicable based on values.

Posted 5 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.