• You are currently viewing our forum as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to additional post topics, communicate privately with other members (PM), view blogs, respond to polls, upload content, and access many other special features. Registration is fast, simple and absolutely free, so please join our community today! Just click here to register. You should turn your Ad Blocker off for this site or certain features may not work properly. If you have any problems with the registration process or your account login, please contact us by clicking here.

Python: Need a stripping program

Jonny

null
Joined
Sep 8, 2009
Messages
3,134
MBTI Type
FREE
Hello all,

My programming skills are non-existent, but I need to familiarize myself enough with python to create a program that strips information from files such as this one, and converts it into a format that is suitable for a data analysis package called stata.

Before going on, let me say that there is no way to do this by hand, as there are tens of thousands of files such as that one that need to be converted; this is a nation wide DOT bid contract analysis.

Does anyone have advice on relevant tutorial programs, or information on how to create such a program?

Any info will be helpful. Thanks in advance.
 

spin-1/2-nuclei

New member
Joined
May 2, 2010
Messages
381
MBTI Type
INTJ
With python you can do scripting of open-office:
Step 1: batch convert all your .doc files into odt files
- odt files are zip files and you can extract the content.xml
Step 2: to extract the context.xml you open it within python as a zip file by (roughly - this has no error checking, etc.):
import zipfile
zipfile.ZipFile("file.odt").extract("content.xml")

- now you have your file in xml

step 3: using pythons xml tools (there are a bunch)
you can transform this in to whatever format you need it to be - but I'd need to know the specification of the required output format to give you advice that is directly relevant to you.

If it is a complicated format that really needs to preserve a lot of the information and layout then you pretty much have to go through this route...

If the output format is loosely specified then you might be able to get away with opening the .doc file directly in python and stripping out all the null characters and other random .doc crap.
you could do this by:

step 1: work out what your flags are for the boundaries of the crap
step 2: output what is left
this will go something along the lines of:

PHP:
lines = []

with open("foobar.doc") as f:
   for l in f.readlines():
       for s in l.split("\r"):#readlines doesn't seem to split on "\r"...
           lines.append(s)

goodlines = []

for lin in lines:
   add = True
   for c in lin:
       if ord(c) < 32 or ord(c) > 128:
           add = False
           break
   if add:
       goodlines.append(lin)

for lin in goodlines:
   print lin

this isn't php code it's python - but this is how the forum thingy formatted it. I just threw this together just now so I can't guarantee it wont bug out on you - or eat your first born - I tested it on your sample file and it spits out just the text.. you can play with this to see if it is useful..

hopefully someone else that is more familiar with this stuff will post something else to help you...

good luck,
Cheers,
spin
 
Top