Suggested feature: search/replace w/formatting

Aug 18, 2009 at 11:21 AM

Hi, thank you, this looks very promising! I'd like to suggest a feature: it would be nice if we could use search & replace functions including formatting information, e.g. find all the text matching the specified string literal or regular expression AND the specified formatting like font=Arial, weight=bold, color=red, and eventually replace them with a new text and/or a new formatting.

Coordinator
Aug 18, 2009 at 2:04 PM
Hi Mathetes,

this would be possible to implement, but as of now, it is not next on my list. Do you have a serious need for this feature or is it on your nice to have list?

Cathal
Aug 22, 2009 at 7:22 PM

Hi, sorry for the late reply... really I think it would be very useful because I often use the OOXML SDK v2 to do similar things in several stages of editorial works. Typically I have to deal with entire books or dictionaries or the like written in Word, where each semantically different region of text is marked with some unique combination of formatting features (styles and/or direct formatting): e.g. in a dictionary lemmata are red and bold with size 14, translations in Times black 12, grammatical annotations in italic, and the like. So I typically analyze the document using the SDK to extract the text with specified formatting attributes and then pass it to other software for treatment; or I use Word addins which directly lookup the document for some formatting and then apply heavy manipulations to the text found, and the like. For instance there are several texts written with ad-hoc non standard fonts for non Latin alphabets which require to be recoded into Unicode: so a routine task is finding all the contiguous text with font X, convert it as appropriate, replace it with a new Unicode text using a different font and save the modified docx.

At this time I use the OOXML SDK to EXTRACT the text to be treated and used in other formats (typically XML dialects) for cases like dictionaries, and Word addins when I need to directly TOUCH a document like converting some text into another encoding. A much more manteinable solution would be avoiding to use Word at all (with all the troublesome addin deployment issues) and directly do things like find all the text with this formatting and matching this literal/regex and convert its content into something else, and the like.

I know that just getting to know the right formatting of a region in DOCX is not a trivial issue as we have to take into account the long hierarchy chain from themes to template and document defaults through paragraph and region styles up to direct formatting, and I do something similar in my code when required, but a solution integrated in a system like yours would be very useful.

Thank you anyway!

Coordinator
Aug 23, 2009 at 6:19 PM
Hi Mathetes,

you have convinced me to add this functionality to DocX. You are correct, this would be a powerful addition, and I am sure that others would also find it useful.

I have a few days of this week coming, I will work on this feature and some other features that I have been asked to add.

kind regards and thank you for your input,
Cathal
Aug 24, 2009 at 3:56 PM

Thank you, this is great. I'll give it a try as soon as you publish your code (and my deadlines allow it :).

Coordinator
Aug 27, 2009 at 2:57 PM

Hi Mathetes,

I have added new overloads to Paragraph.ReplaceText and DocX.ReplaceText. These new overloads allows you to specify

1) A new formatting,
2) A match formatting,
3) MatchFormattingOptions 

public void ReplaceText(string oldValue, string newValue, bool trackChanges, RegexOptions options, Formatting newFormatting, Formatting matchFormatting, MatchFormattingOptions fo)
        {

Below is an example of this new functionality in action. I will be adding this to DocX version 1.0.0.8. I want to add more before releasing a new version, so if you would like to test this functionality before release, please send me an email.

happy coding,
Cathal

// Load a document.
using (DocX document = DocX.Load(@"Test.docx"))
{

    // The formatting to match.
    Formatting matchFormatting = new Formatting();
    matchFormatting.Size = 10;
    matchFormatting.Italic = true;
    matchFormatting.FontFamily = new FontFamily("Times New Roman");

    // The formatting to apply to the inserted text.
    Formatting newFormatting = new Formatting();
    newFormatting.Size = 22;
    newFormatting.UnderlineStyle = UnderlineStyle.dotted;
    newFormatting.Bold = true;

    // Loop through the paragraphs in this document.
    foreach (Paragraph p in document.Paragraphs)
    {
        /* 
         * Replace all instances of the string "wrong" with the string "right" and ignore case.
         * Each inserted instance of the string "new" should use the Formatting newFormatting.
         * Only replace an instance of "wrong" if it is Size 10, Italic and Times New Roman.
         * SubsetMatch means that the formatting must contain all elements of the match formatting,
         * but it can also contain additional formatting for example Color, UnderlineStyle, etc.
         * ExactMatch means it must not contain additional formatting.
         */

        p.ReplaceText("wrong", "right", false, RegexOptions.IgnoreCase, newFormatting, matchFormatting, MatchFormattingOptions.SubsetMatch);
    }

    // Save all changes made to this document.
    document.Save();

}// Release this document from memory.

Aug 29, 2009 at 7:41 AM

Thank you, this is really useful! I'd just like to suggest an additional overload to the ReplaceText method: for most scenarios I described in my previous posts there is no simple replacement (like replace A with B), but I have some complex code which evaluates the whole text with a specified formatting and outputs a new text. So the replacement text is generated at runtime, and I look for ANY text with a specific formatting: for both these reasons I cannot simply use a method which accepts an input and output literal or expression. In this case the ReplaceText method could not be used, unless you provide an additional overload: you might add e.g. another overload which accepts a delegate to be called whenever a match is found, like:

delegate string GetReplacementText(string sText)

where sText is the input text coming from the match and the output is the replacement text generated by client code; so we'd have something like (in ReplaceText implementation: I'm just having a quick look at your code):

// If the formatting matches, do the replace
if (formattingMatch)
{
    InsertText(m.Index + oldValue.Length, GetReplacementText(m.Value), trackChanges, newFormatting);
    RemoveText(m.Index, m.Length, trackChanges);
}

(BTW, wandering through your code I found this Run's property setter which seems a typo for text=value:

internal string Value { set { value = text; } get { return text; } })

Thanks again!