    Default Python: Need a stripping program

    Hello all,

    My programming skills are non-existent, but I need to familiarize myself enough with python to create a program that strips information from files such as this one, and converts it into a format that is suitable for a data analysis package called stata.

    Before going on, let me say that there is no way to do this by hand, as there are tens of thousands of files such as that one that need to be converted; this is a nation wide DOT bid contract analysis.

    Does anyone have advice on relevant tutorial programs, or information on how to create such a program?

    Any info will be helpful. Thanks in advance.

    Why Python? Just curious...

    With python you can do scripting of open-office:
    Step 1: batch convert all your .doc files into odt files
    - odt files are zip files and you can extract the content.xml
    Step 2: to extract the context.xml you open it within python as a zip file by (roughly - this has no error checking, etc.):
    import zipfile

    - now you have your file in xml

    step 3: using pythons xml tools (there are a bunch)
    you can transform this in to whatever format you need it to be - but I'd need to know the specification of the required output format to give you advice that is directly relevant to you.

    If it is a complicated format that really needs to preserve a lot of the information and layout then you pretty much have to go through this route...

    If the output format is loosely specified then you might be able to get away with opening the .doc file directly in python and stripping out all the null characters and other random .doc crap.
    you could do this by:

    step 1: work out what your flags are for the boundaries of the crap
    step 2: output what is left
    this will go something along the lines of:

    PHP Code:
    lines = []

    with open("foobar.doc") as f:
    l in f.readlines():
    s in l.split("\r"):#readlines doesn't seem to split on "\r"...

    goodlines = []

    lin in lines:
    add True
    for c in lin:
    ord(c) < 32 or ord(c) > 128:
    add False

    lin in goodlines:
    this isn't php code it's python - but this is how the forum thingy formatted it. I just threw this together just now so I can't guarantee it wont bug out on you - or eat your first born - I tested it on your sample file and it spits out just the text.. you can play with this to see if it is useful..

    hopefully someone else that is more familiar with this stuff will post something else to help you...

    good luck,
