We have a large number of TEI-encoded XML files, some of which aren't valid (usually because of character encoding problems). We'd like to generate a report that lists each file and the error(s) encountered, with line and character index, etc. I can imagine writing a quick and dirty script that essentially does what we need, but I'm wondering (lazily) if anyone has seen anything that would do the trick.
Command-line tool to generate report on large number of XML files
(7 posts) (5 voices)-
Posted 8 years ago Permalink
-
Eep. No, I wish I had. Mostly what I see is people writing shell scripts or ant tasks or even little XSLT scripties (dependent on Saxon, usually) to do the job.
If you're really gunning to solve the character-encoding problem first, though, chardet (Python) might help, and there's a bit of a laundry list of encoding-detection libraries too (looks a bit old, may be obsolete). I happened upon a slightly more recent Stack Overflow thread on the subject while I was trying to dig up chardet.
Posted 8 years ago Permalink -
It's not really a command line tool, but it works really well: oXygen XML Editor has a number of tools for Validating XML Documents Against a Schema. I frequently use these functions to validate my project files before committing them to subversion or uploading them to my production server (eXist).
Posted 8 years ago Permalink -
Are these new files that are getting brought in, or an existing collection? If these are new files, and you're using some type of source code control, you can use a pre-commit hook to validate the document before it gets in to the repository. If it's the latter, it may be worth while taking some time to write up a Java (or jruby) app that could generate this type of report. Lord knows there are enough folks with heaps of XML that is of dubious origin/validity...
Posted 8 years ago Permalink -
I started writing some code that could provide a fairly quick and dirty solution. There are at least two issues: 1) I don't think character encoding is being examined properly so errors aren't being raised as I'd like; 2) I don't see a way to validate with a DTD if one is declared, but skip along merrily if it's not.
require 'rubygems' require 'libxml' include LibXML # process a directory recursively def process_directory(directory) Dir.new(directory).each do |file| full_file = File.expand_path(file, directory) if file[0,1] == '.' # skip dot files next elsif File.directory? full_file process_directory full_file elsif File.extname(file) == '.xml' process_file full_file end end end # process an XML file def process_file(file) begin # this fails when there's no DTD – how to toggle this without manual read of file? #parser = XML::Parser.file(file, :options => XML::Parser::Options::DTDVALIDD) parser = XML::Parser.file(file) parser.parse rescue LibXML::XML::Error # already reported rescue Exception puts $! end end process_directory ARGV[0].nil? ? Dir.getwd : ARGV[0]
Usage (assuming you've saved the above contents as parseDirectory.rb):
ruby parseDirectory.rb ruby parseDirectory /someotherdirectory/
Posted 8 years ago Permalink -
Replying to @Stéfan Sinclair's post:
Rather than a custom script, it might be simpler to just use xmllint (the command line tool bundled with libxml2) for this. It can handle multiple files and if you use the "--valid" switch it will check against included DTDs. You could combine this with libxml's catalog support to cache local copies of the relevant DTDs.Posted 8 years ago Permalink
Reply
You must log in to post.