Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
333 views
in Technique[技术] by (71.8m points)

zipfile - python-docx: Error opening file - "Bad magic number for file header" / "EOFError"

The company I work for distributes document assembly software that uses the python-docx library. The software runs a function on every generated document that opens the document and does a simple search and replace for characters that weren't escaped properly (namely "& amp;" -> "&").

FYI The actual document assembly uses python-docx-template. However, the error happens after the document has already been assembled and the error is triggered by the search-and-replace function, which only uses python-docx.

Recently, we've had a few cases where documents are failing to generate on client deployments. They're throwing an error on this line where the document object is instantiated:

doc = Document(docx=Path(doc_path))

We've seen two errors:

raise BadZipFile("Bad magic number for file header")

and

raise EOFError

The software is widely used and we've never had this issue before. We can't reproduce it in our test environments. The error has only started appearing in the past week but has shown up for several clients after they were updated. The software will fail to generate a particular document some number of times but will succeed after a few tries.

We've only seen it happen with one document in particular, but all documents use the same search and replace function, and like I said the error is only intermittent with the problem document.

There have been no changes in code to this search and replace function and I can't think of any other meaningful difference to our doc assembly process that would explain this.

I'm having a lot of trouble finding info on what could cause this specifically with the python-docx library. Is this a sign that the generated document is corrupted? If anyone is able to shed some light on possible causes that would be very helpful!

Here's the stack trace for both errors:

Bad magic number...

File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
    doc = Document(docx=Path(doc_path))
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
    phys_reader, pkg_srels, content_types
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 104, in _walk_phys_parts
    part_srels = PackageReader._srels_for(phys_reader, partname)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 83, in _srels_for
    rels_xml = phys_reader.rels_xml_for(source_uri)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 129, in rels_xml_for
    rels_xml = self.blob_for(source_uri.rels_uri)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "/usr/lib/python3.6/zipfile.py", line 1337, in read
    with self.open(name, "r", pwd) as fp:
  File "/usr/lib/python3.6/zipfile.py", line 1396, in open
    raise BadZipFile("Bad magic number for file header")

zipfile.BadZipFile: Bad magic number for file header

EOFError

File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
    doc = Document(docx=Path(doc_path))
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
    phys_reader, pkg_srels, content_types
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 110, in _walk_phys_parts
    for partname, blob, reltype, srels in next_walker:
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 105, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "/usr/lib/python3.6/zipfile.py", line 1338, in read
    return fp.read()
  File "/usr/lib/python3.6/zipfile.py", line 858, in read
    buf += self._read1(self.MAX_N)
  File "/usr/lib/python3.6/zipfile.py", line 940, in _read1
    data += self._read2(n - len(data))
  File "/usr/lib/python3.6/zipfile.py", line 975, in _read2
    raise EOFError
EOFError
question from:https://stackoverflow.com/questions/65946376/python-docx-error-opening-file-bad-magic-number-for-file-header-eoferror

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Both of these errors indicate that the specified file is not a valid zip archive. So I expect something is going wrong with the writing of the file (by the step prior to find-and-replace).

I would start by stopping the process after writing the file and seeing if the file is present on the filesystem and whether it can be opened manually using Word. This should bisect the problem and narrow it down to a writing problem or a reading problem.

It could be possible that an error is raised on the write and it's not being caught or whatever, leaving an empty or un-flushed (open) file. So having a way to monitor that step is probably a good idea. Writing to a log comes to mind as how you might manage that.

Inspecting the particular cases where there is a failure and managing to reproduce it are going to be critically important. If that's not possible, it's going to be a tough road of guesswork and disappointment on both sides.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...