Python: Need a stripping program

Jonny · Feb 15, 2011

Hello all,

My programming skills are non-existent, but I need to familiarize myself enough with python to create a program that strips information from files such as this one, and converts it into a format that is suitable for a data analysis package called stata.

Before going on, let me say that there is no way to do this by hand, as there are tens of thousands of files such as that one that need to be converted; this is a nation wide DOT bid contract analysis.

Does anyone have advice on relevant tutorial programs, or information on how to create such a program?

Any info will be helpful. Thanks in advance.

A window to the soul · Feb 15, 2011

Why Python? Just curious...

spin-1/2-nuclei · Feb 16, 2011

With python you can do scripting of open-office:
Step 1: batch convert all your .doc files into odt files
- odt files are zip files and you can extract the content.xml
Step 2: to extract the context.xml you open it within python as a zip file by (roughly - this has no error checking, etc.):
import zipfile
zipfile.ZipFile("file.odt").extract("content.xml")

- now you have your file in xml

step 3: using pythons xml tools (there are a bunch)
you can transform this in to whatever format you need it to be - but I'd need to know the specification of the required output format to give you advice that is directly relevant to you.

If it is a complicated format that really needs to preserve a lot of the information and layout then you pretty much have to go through this route...

If the output format is loosely specified then you might be able to get away with opening the .doc file directly in python and stripping out all the null characters and other random .doc crap.
you could do this by:

step 1: work out what your flags are for the boundaries of the crap
step 2: output what is left
this will go something along the lines of:

PHP:

lines = []

with open("foobar.doc") as f:
   for l in f.readlines():
       for s in l.split("\r"):#readlines doesn't seem to split on "\r"...
           lines.append(s)

goodlines = []

for lin in lines:
   add = True
   for c in lin:
       if ord(c) < 32 or ord(c) > 128:
           add = False
           break
   if add:
       goodlines.append(lin)

for lin in goodlines:
   print lin

this isn't php code it's python - but this is how the forum thingy formatted it. I just threw this together just now so I can't guarantee it wont bug out on you - or eat your first born - I tested it on your sample file and it spits out just the text.. you can play with this to see if it is useful..

hopefully someone else that is more familiar with this stuff will post something else to help you...

good luck,
Cheers,
spin

Python: Need a stripping program

Jonny

null

A window to the soul

Guest

spin-1/2-nuclei

New member

Similar threads