Some copyright holders would certainly argue that you can't make a full copy of a work, even for a transformative purpose not involving making that copy fully available. (See Authors Guild et al. v. Google Inc., pending in the New York Southern District Court.) If sued, you might win a challenge for copyright infringement by arguing that your use is very different from selling copies to readers and therefore not infringement.
Getting a machine-readable version of your files through a brute-force method like scanning, OCR, and/or keyboarding allows you to work with practically any source document you can get your hands on, but it's awfully expensive. As you say, it's a lot cheaper to get source files from a publisher -- or to get access to a corpus someone else has built. People who have built corpora of contemporary materials would know how to negotiate with rightsholders.
If you're willing to limit yourself to content under copyright whose rightsholder has made it available by agreement to the world through HathiTrust, you might consider using the HathiTrust Data API described at http://www.hathitrust.org/data to get at the OCR text, thereby saving you all digitization costs.