My first thought was to clean it up semi manually using regular expressions in a good text editor like BBEdit, TextMate, or oXygen (the latter is my tool of choice given its XML etc. savviness). I've done this many times, but it was always painful.
But a new feature of oXygen XML Editor v12.1 comes to mind:
Smart Paste - Automatic Conversion to DITA, DocBook, TEI, etc.
Styled content can be inserted by copying content from Office applications (for example Microsoft Word and Microsoft Excel, OpenOffice.org Writer and OpenOffice.org Calc) and Web browsers (Mozilla Firefox, Microsoft Internet Explorer, etc.) and pasting it in the Author editor. The styles and general layout of the copied content, like sections with headings, tables, list items, bold and italic text, hyperlinks, are converted in the target document equivalent XML elements. <oXygen/> provides default implementations for the following document types: DITA, DocBook, TEI, XHTML but this support can be configured also for user defined document types.
So the idea would be to open the Word HTML file in a web browser, copy the desired text, then paste it into a new XHTML document in oXygen's Author mode. Haven't tried it with this combination of source and destination formats, but I was pleased with the results when I pasted generic web content into a new TEI file. May be promising for you.