This project has moved. For the latest updates, please go here.

bug? in Paragraph.RemoveText

Aug 20, 2009 at 3:33 PM

Hi Cathal,

I'm still working on tracing this out, but it appears that if Paragraph.RemoveText is called with RegexOptions.SingleLine option, an exception is thrown.

This also affects any of the functions which call RemoveText, such as ReplaceText.

I'm still not entirely sure, but I believe it traces down to:

internal Run GetFirstRunEffectedByEdit(int index)
        {
            foreach (int runEndIndex in runLookup.Keys)
            {
                if (runEndIndex > index)
                    return runLookup[runEndIndex];
            }
           
                if (runLookup.Last().Value.EndIndex == index)
                    return runLookup.Last().Value;
           

            throw new ArgumentOutOfRangeException();
        }

Again, I'm not entirely sure at this point, but repro is to call RemoveText or ReplaceText with RegexOptions.SingleLine.

I'll update this post if I can get any more information.

Thanks!

-Brian

Coordinator
Aug 20, 2009 at 4:31 PM
Hi chickendelicious,

it is possible that runLookup.Last() is returning null and then .Value
is being called on null. I will try and repro this, however it should
not be only specific to SingleLine.

thanks for reporting this,
Cathal
Aug 20, 2009 at 6:30 PM
Edited Aug 20, 2009 at 6:43 PM

Hi Cathal,

Here is code to repro the issue. It does appear to be related to the RegexOption.SingleLine, because if I change that option to ignoreCase, the exception does not occur. It appears that the actual exception (System.NullReferenceException) is thrown at:

 switch (parentElement.Name.LocalName) in Paragraph.cs because parentElement is null.

make sure your test doc contains some text similar to the following:

<c>test

multiline

text

</c>

------------------------------------------------------------------------------

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Novacode;

namespace testDocX
{
    class Program
    {
        static void Main(string[] args)
        {
            DocX doc = DocX.Load(@"C:\path\to\test.docx");
            string regxstr = "<c>(.*?)</c>";
            Regex reg = new Regex(regxstr, RegexOptions.Singleline);
           
            foreach (Paragraph p in doc.Paragraphs)
            {
                MatchCollection matches = reg.Matches(p.Text);
                foreach (Match match in matches)
                {
                    string matchstr = match.Value;
                    string newtext = reg.Replace(matchstr, "$1");
                    p.ReplaceText(matchstr, newtext, false, RegexOptions.Singleline);                   
                }
            }
            doc.Save();
        }
    }
}

-------------------------------------------------------------------

Thanks!

-Brian

 

Coordinator
Aug 20, 2009 at 8:40 PM
Brian,

I have tried to recreate this bug by creating a document as you explained above, but I cannot get it to crash. Can you please send me the following by email.

1) The document that is causing the crash,
2) The code you are running that is causing the crash.

By the way, the code you have written above is great. However it will not be efficient if a Paragraph contains the same match more than once. For example if a Paragraph contained the following "<c>Hello</c> <c>Hello</c> <c>Hello</c>" your example would replace all instances of <c>Hello</c> with Hello in the first call to ReplaceText() the following 2 calls are wasted CPU time.

The below code is more efficient as it exploits the fact that we know the start index and length of every match, therefore we do not need to search the entire Paragraph using ReplaceText().

// Load a document.
using (DocX document = DocX.Load(@"Test.docx"))
{
// Create a regex to replace text.
Regex regex = new Regex("<c>(.*?)</c>", RegexOptions.Singleline);

// Loop through each Paragraph in this document.
foreach (Paragraph p in document.Paragraphs)
{
    // Get a collection of matches.
    MatchCollection matches = regex.Matches(p.Text);

    /* 
     * We must process matches in reverse order so that all matches before this match
     * are not shifted by the change in text length caused by Replace.
     */ 
    foreach (Match match in matches.Cast<Match>().Reverse())
    {
        // Remove the matched text
        p.RemoveText(match.Index, match.Length, false);
        
        // Insert the matched text after it has been regexed.
        p.InsertText(match.Index, regex.Replace(match.Value, "$1"), false);
    }
}

/* 
 * Save all changes made to this document as Test2.docx.
 * I always do this the first time I run a new DocX program incase I break the origional file.
 */ 
document.SaveAs("Test2.docx");
}// Release the document from memory.

I hope you find this useful,
Happy coding,
Cathal
Aug 20, 2009 at 10:28 PM

Thanks Cathal,

Looks like the issue must be with the document I'm using. I sent it to you along with the sample code to repro.

also, thanks for the tip, I've implmented it.

-Brian

Sep 22, 2009 at 9:44 PM

Hi Cathal,

I can reproduce this issue with the following code. It appears that the issue occurs when a single paragraph contains multiple lines of text:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Novacode;
namespace testDocX
{
class Program
{
static void Main(string[] args)
{

DocX docTest = DocX.Create("test.docx");
docTest.InsertParagraph(
"test\n<c>test\nmultiline\ntest\n</c>", false);
docTest.Save();

DocX doc = DocX.Load("test.docx");
string regxstr = "<c>(.*?)</c>";
Regex reg = new Regex(regxstr, RegexOptions.Singleline);
foreach (Paragraph p in doc.Paragraphs)
{
MatchCollection matches = reg.Matches(p.Text);
foreach (Match match in matches.Cast<Match>().Reverse())
{
// Remove the matched text
p.RemoveText(match.Index, match.Length, false);
// Insert the matched text after it has been regexed.
p.InsertText(match.Index, reg.Replace(match.Value, "$1"), false);
}
}
doc.Save();
}
}
}

Nov 11, 2009 at 11:46 AM

Hi

Did you find a solution to the mulitline errors ?

I am having a simlar problem

Regards

 

Jonathan

 

 

Nov 24, 2009 at 7:39 PM

Sorry Jonathan,

I haven't found a solution to this yet and I don't think Cathal has a lot of free time to update docx at the moment.

I'll post here if I find a solution.

-Brian

Apr 12, 2010 at 6:28 PM

Hi Cathal,

I am wondering if this issue will be addressed in the new release?

Thanks again for this library.

-Brian

Coordinator
Apr 12, 2010 at 6:31 PM
Brian,

yes it most certainly will. I would once again like to apologize for the delay. My last few weeks of college are upon me.

kind regards,
Cathal
Apr 14, 2010 at 8:01 PM

Thanks Cathal,

I understand the delay completely. Thanks again for your work on this.

-Brian

Coordinator
Nov 14, 2010 at 6:46 PM

As of change set 57500 this issue has been fixed.

Sorry about the delay on this one guys, I had to re-think and then re-write a lot of code to get this working correctly.