User Tag List

Results 1 to 3 of 3

  1. #1
    null Jonny's Avatar
    Join Date
    Sep 2009

    Default Python: Need a stripping program

    Hello all,

    My programming skills are non-existent, but I need to familiarize myself enough with python to create a program that strips information from files such as this one, and converts it into a format that is suitable for a data analysis package called stata.

    Before going on, let me say that there is no way to do this by hand, as there are tens of thousands of files such as that one that need to be converted; this is a nation wide DOT bid contract analysis.

    Does anyone have advice on relevant tutorial programs, or information on how to create such a program?

    Any info will be helpful. Thanks in advance.

  2. #2
    A window to the soul


    Why Python? Just curious...

  3. #3
    Senior Member
    Join Date
    May 2010


    With python you can do scripting of open-office:
    Step 1: batch convert all your .doc files into odt files
    - odt files are zip files and you can extract the content.xml
    Step 2: to extract the context.xml you open it within python as a zip file by (roughly - this has no error checking, etc.):
    import zipfile

    - now you have your file in xml

    step 3: using pythons xml tools (there are a bunch)
    you can transform this in to whatever format you need it to be - but I'd need to know the specification of the required output format to give you advice that is directly relevant to you.

    If it is a complicated format that really needs to preserve a lot of the information and layout then you pretty much have to go through this route...

    If the output format is loosely specified then you might be able to get away with opening the .doc file directly in python and stripping out all the null characters and other random .doc crap.
    you could do this by:

    step 1: work out what your flags are for the boundaries of the crap
    step 2: output what is left
    this will go something along the lines of:

    PHP Code:
    lines = []

    with open("foobar.doc") as f:
    l in f.readlines():
    s in l.split("\r"):#readlines doesn't seem to split on "\r"...

    goodlines = []

    lin in lines:
    add True
    for c in lin:
    ord(c) < 32 or ord(c) > 128:
    add False

    lin in goodlines:
    this isn't php code it's python - but this is how the forum thingy formatted it. I just threw this together just now so I can't guarantee it wont bug out on you - or eat your first born - I tested it on your sample file and it spits out just the text.. you can play with this to see if it is useful..

    hopefully someone else that is more familiar with this stuff will post something else to help you...

    good luck,
    Quote Originally Posted by whatever View Post
    watch where you're driving f$cktards! I have the right of way!!! :steam:

Similar Threads

  1. I need to find new ways to entertain myself..
    By runvardh in forum The Fluff Zone
    Replies: 7
    Last Post: 02-04-2017, 10:20 PM
  2. Mars Astronauts Needed!
    By Totenkindly in forum Science, Technology, and Future Tech
    Replies: 26
    Last Post: 01-19-2016, 11:12 PM
  3. USA needs another 9/11?
    By heart in forum Politics, History, and Current Events
    Replies: 22
    Last Post: 10-16-2007, 07:43 AM
  4. Need Help... [Assignment]
    By Sona in forum The Bonfire
    Replies: 44
    Last Post: 08-11-2007, 08:04 PM
  5. Does Eraserhead need a mastectomy?
    By Martoon in forum The Fluff Zone
    Replies: 24
    Last Post: 05-14-2007, 09:18 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Single Sign On provided by vBSSO