This project has moved. For the latest updates, please go here.

Insert large number of documents failed

Apr 16, 2012 at 10:15 AM

I'm trying to generate some documentation using DocX.

Currently I have a lot of methods that generates a document (DocX object).
All these objects are then merged into a single DocX, using the InsertDocument method.

This have been working fine but when the number of documents increase I started to get exceptions:

A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
Additional information: Could not find file 'C:\Users\****\AppData\Local\IsolatedStorage\rsoflgyo.v2e\5lovqg1f.c0x\Url.bhvcthf3x3enirmdqysxw04d0hklkjpy\Publisher.tf45yme2p4uimfw30gyzehzjob2jrq2v\Files\npadde1w.3eu'.

 

This is how I merge the list of documents:

int i;
for (i = 1; i < documents.Count; ++i)
{
     documents[0].InsertDocument(documents[i]);
}

First of all why should DocX read from that folder?
Are DocX generating filenames (using random names or by hashing?) internally and this is a collision?

Any help is welcome!

/Dosh

Apr 16, 2012 at 11:07 AM

Forgot to include a callstack:

mscorlib.dll!System.IO.__Error.WinIOError(int errorCode, string maybeFullPath) + 0x4ba bytes   mscorlib.dll!System.IO.LongPathFile.GetLength(string path) + 0xa0 bytes   mscorlib.dll!System.IO.IsolatedStorage.IsolatedStorageFileStream.IsolatedStorageFileStream(string path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, int bufferSize, System.IO.IsolatedStorage.IsolatedStorageFile isf) + 0x13f bytes   WindowsBase.dll!MS.Internal.IO.Packaging.PackagingUtilities.SafeIsolatedStorageFileStream.SafeIsolatedStorageFileStream(string path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, MS.Internal.IO.Packaging.PackagingUtilities.ReliableIsolatedStorageFileFolder folder) + 0x3c bytes   WindowsBase.dll!MS.Internal.IO.Packaging.PackagingUtilities.CreateUserScopedIsolatedStorageFileStreamWithRandomName(int retryCount, out string fileName) + 0xa7 bytes   WindowsBase.dll!MS.Internal.IO.Packaging.SparseMemoryStream.SwitchModeIfNecessary() + 0x176 bytes   WindowsBase.dll!MS.Internal.IO.Packaging.SparseMemoryStream.Write(byte[] buffer, int offset, int count) + 0xbe bytes   WindowsBase.dll!MS.Internal.IO.Packaging.CompressEmulationStream.Write(byte[] buffer, int offset, int count) + 0x3d bytes   WindowsBase.dll!MS.Internal.IO.Packaging.CompressStream.Write(byte[] buffer, int offset, int count) + 0x8b bytes   WindowsBase.dll!MS.Internal.IO.Zip.ProgressiveCrcCalculatingStream.Write(byte[] buffer, int offset, int count) + 0xab bytes   WindowsBase.dll!MS.Internal.IO.Zip.ZipIOModeEnforcingStream.Write(byte[] buffer, int offset, int count) + 0x57 bytes   mscorlib.dll!System.IO.StreamWriter.Flush(bool flushStream, bool flushEncoder) + 0x63 bytes   mscorlib.dll!System.IO.StreamWriter.Write(char[] buffer, int index, int count) + 0xa2 bytes   System.Xml.dll!System.Xml.XmlEncodedRawTextWriter.FlushBuffer() + 0xaa bytes   System.Xml.dll!System.Xml.XmlEncodedRawTextWriter.RawText(char* pSrcBegin, char* pSrcEnd) + 0xbc bytes   System.Xml.dll!System.Xml.XmlEncodedRawTextWriter.RawText(string s) + 0x23 bytes   System.Xml.dll!System.Xml.XmlEncodedRawTextWriterIndent.WriteIndent() + 0x22 bytes   System.Xml.dll!System.Xml.XmlEncodedRawTextWriterIndent.WriteStartElement(string prefix, string localName, string ns) + 0x21 bytes   System.Xml.dll!System.Xml.XmlWellFormedWriter.WriteStartElement(string prefix, string localName, string ns) + 0xaa bytes   System.Xml.Linq.dll!System.Xml.Linq.ElementWriter.WriteStartElement(System.Xml.Linq.XElement e) + 0x4c bytes   System.Xml.Linq.dll!System.Xml.Linq.ElementWriter.WriteElement(System.Xml.Linq.XElement e) + 0x3e bytes   System.Xml.Linq.dll!System.Xml.Linq.XElement.WriteTo(System.Xml.XmlWriter writer) + 0x49 bytes   System.Xml.Linq.dll!System.Xml.Linq.XContainer.WriteContentTo(System.Xml.XmlWriter writer) + 0xd0 bytes   System.Xml.Linq.dll!System.Xml.Linq.XDocument.WriteTo(System.Xml.XmlWriter writer) + 0x7d bytes   System.Xml.Linq.dll!System.Xml.Linq.XDocument.Save(System.IO.TextWriter textWriter, System.Xml.Linq.SaveOptions options) + 0x43 bytes   System.Xml.Linq.dll!System.Xml.Linq.XDocument.Save(System.IO.TextWriter textWriter) + 0x1a bytes   DocX.dll!Novacode.DocX.InsertDocument(Novacode.DocX document) + 0x6a9 bytes

Saw in the change list that there was a recent update regarding InsertDocument, could this fix resolve my issue to?
I guess I'll have to build the assembly from code and try it out.

 

Apr 16, 2012 at 1:34 PM

Tried the latest version but it takes forever to complete and generates corrupt documents.
Merging about 200 documents worked fine on the binary version but not 350+ documents, dunno about the latest version.

Gonna try a comercial product instead, good luck with your project.

Developer
Apr 16, 2012 at 1:42 PM

Have you tried compiling newest version (from sources) ? And using more or less code like this:\

 

public static void documentsMerge(object fileName, List<string> arrayList) {
           // MsWord.Merge(arrayList, (string) fileName, true);
            bool varTest = deleteFile(fileName.ToString());
            if (varTest)
            {
                using (DocX documentToCreate = DocX.Load(arrayList[0]))
                {


                    foreach (var alItem in arrayList.Skip(1))
                    {
                        documentToCreate.InsertParagraph().InsertPageBreakAfterSelf();
                        DocX documentToMergeIn = DocX.Load(alItem);
                        documentToCreate.InsertDocument(documentToMergeIn);
                    }
                    documentToCreate.SaveAs(fileName.ToString());
                }
            }
        }

 

Apr 16, 2012 at 2:34 PM
Edited Apr 16, 2012 at 2:38 PM

I've tried using the newest version, but no luck (doesn't throw an exception but the generated document is corrupt and won't load in Word).

BTW, the latest version is extremely slow, takes more than an hour to complete on my I7 machine (vs the "old" binary that completed within 30 seconds for 200 documents).
I've profiled and it's the merge_numbering method that consumes all the time, it's seems like there's a nested loop in there and I guess that's the problem (I guess it's an O(N^2) operation).

Your code is functionally equal to the code I posted, only difference is that I don't want a page break before each document and that my documents are all in memory.

Developer
Apr 16, 2012 at 2:49 PM

Do you have headers / footers? I guess I could try doing merging of 300 documents and see how it behaves here.

Developer
Apr 16, 2012 at 2:58 PM

It indeed takes some time. But the errors/corrupted documents was before the newest version which was released recently. I guess I will write to Cathal to take a look at the slowdown. If you can wait then wait ;-) If not, then go for the commercial product.

Apr 17, 2012 at 7:30 AM
Edited Apr 17, 2012 at 7:34 AM

The main document has headers and footers (first one in the array), so the final document has headers and footers.
The other generated ones does not have any headers.

The documents are generated by doing a search and replace (could probably use a mail merge) from a set of templates.
I also add rows to the tables present in the templates.
I have about ten templates that is used for the 300+ documents.

The sequence is something like this:
1. Load document template.
2. Manipulate (search and replace + add table rows).
3. Add to list of documents. 
4. Repeat 1 - 3 N times.
5. Build one document from list of documents. 

I could probably produce the same output by just building a document from scratch (in code) but then I'll loose the ability to customize / change design (without a lot of code changes).

As for commercial alternatives I've found them to be very bloated and/or very expensive for my purpose (http://www.syncfusion.com/ for instance).
Other seems to lack a simple search and replace functionality (http://www.gemboxsoftware.com/) but I could probably make my own.
I'm thinking about trying out the gembox component since I've already bought their excel component and it works reasonably well (this is company work and we've got no problems paying for good components but I still doesn't want to buy a massive component to solve a minor issue).

EDIT: As for the slow processing I can't really wait hours, minutes would be fine.
Problem is that the number of documents depends on the input data and some future input could probably generate a lot more documents than my current worst case.
So I need to be able to support at least 2x to 3x the number of documents (around 1000 would suffice I guess).

Coordinator
Apr 17, 2012 at 9:04 AM
dosh,

thank you for posting your feedback. Especially the below two lines.

>> BTW, the latest version is extremely slow, takes more than an hour to complete on my I7 machine (vs the "old" binary that completed within 30 seconds for 200 documents).
>> I've profiled and it's the merge_numbers function that consumes all the time, it's seems like there's a nested loop in there and I guess that's the problem.

I will spend sometime today profiling large examples which contain numbering elements.
I will let you know what I find.

>> Other seems to lack a simple search and replace functionality (http://www.gemboxsoftware.com/) but I could probably make my own.
Search and replace is not a simple function to implement for the DocX format. Internally a document with even simple text can be very fragmented due to edits and revisions.

>> I'm thinking about trying out the gembox component since I've already bought their excel component and it works reasonably well (this is company work and we've got no problems paying for good components but I still doesn't want to >> buy a massive component to solve a minor issue).
Please let us know how you get on with gembox.

I am sorry that you have found your experience with the insertDocument() function frustrating.
It is a newly released function and merging document with zero loss is after all a difficult task.

Kind regards and happy coding,
Cathal
Apr 17, 2012 at 12:11 PM

FYI: for the old DLL that throws an exception:
If I ignore the exception and continue to execute the produced document will be corrupt.
Word will complain but still managed to fix it when loading and I couldn't see any errors in the document.
This will work for now, since I just wanted a proof of concept document for now.
The other alternative is to generate html documentation (which is probably need anyway) but I'd like to get everything as a docx / pdf also.

If it's decided that we should go ahead with an implementation, I'll need to make sure that things are working better and faster.

I've understood that S&R can be hard, but in my case I own and control the template documents so I can make sure that there are no "complex" data in it.

I wont have enough time to try out GemBox.Document before handing of my scope documentation but if we use it later on I'll share the experience.
I did however look at the documentation and did some small tests (no S&R though) but the trial version was too limiting to try out for real.

..and don't be sorry!
I don't have any expectations on projects without any real funding and I do understand that merging documents correctly is hard (just use google to verify that). Besides it's not really needed for what I want to achieve, it's just convenient to use separate documents as templates, and I'm lazy :) 

/dosh