I want to do two different sorts of text comparison, though I suspect the answer for one will apply for the other. Ideally, I'll find a web-based or freeware solution to one or both of the following:
1. I want to see how much of an early text (a published article, in this case) was lifted and republished in a later one (a book chapter). Being able to find verbatim reuse of paragraphs, sentences, or even strings of words (>5 or so) would be very useful. Both texts have already been identified (i.e. I don't need to search across texts to find candidates for comparison - though that would be nice for the future!), and the ultimate goal is to highlight, in the more recent text, those portions that came from the earlier one.
A side interest (which seems like it would follow from the above) would be to find moments were specific words, examples, or references were swapped out for others.
2. Expanding on the former, I'd like to be able to take the entire set of texts by a given author published before a certain date and find, as before, portions lifted and republished in a text published on that date (a large book, known to have been cobbled together from previous work, for which the MS is unknown and so finding provenance of various passages has proven laborious). Corpus size is obviously larger here, and representing/listing the relevant portions of lifted text is more complicated because they would each be tagged, ideally, as belonging to specific prior texts. Generating a list of passages from the recent text that listed (a) their location in the new text, (b) the earlier text in which they appeared, and (c) their location in the earlier text would be ideal.
Any idea?
Note #1: All the relevant texts are pre-1923 and available in relatively-clean text files (or can be turned into them).
Note #2: I poked around before posting – the closest thread I found is this, though the focus there is music notation.