With python you can do scripting of open-office:
Step 1: batch convert all your .doc files into odt files
- odt files are zip files and you can extract the content.xml
Step 2: to extract the context.xml you open it within python as a zip file by (roughly - this has no error checking, etc.):
- now you have your file in xml
step 3: using pythons xml tools (there are a bunch)
you can transform this in to whatever format you need it to be - but I'd need to know the specification of the required output format to give you advice that is directly relevant to you.
If it is a complicated format that really needs to preserve a lot of the information and layout then you pretty much have to go through this route...
If the output format is loosely specified then you might be able to get away with opening the .doc file directly in python and stripping out all the null characters and other random .doc crap.
you could do this by:
step 1: work out what your flags are for the boundaries of the crap
step 2: output what is left
this will go something along the lines of:
this isn't php code it's python - but this is how the forum thingy formatted it. I just threw this together just now so I can't guarantee it wont bug out on you - or eat your first born - I tested it on your sample file and it spits out just the text.. you can play with this to see if it is useful..
lines = 
with open("foobar.doc") as f:
for l in f.readlines():
for s in l.split("\r"):#readlines doesn't seem to split on "\r"...
goodlines = 
for lin in lines:
add = True
for c in lin:
if ord(c) < 32 or ord(c) > 128:
add = False
for lin in goodlines:
hopefully someone else that is more familiar with this stuff will post something else to help you...