DocX and HTML

Coordinator
Apr 20, 2011 at 11:33 AM

A lot of people have asked me if DocX can handle HTML and I am sick of saying no.

How would you like this functionality to work? Is the below useful?
Lets discuss it here, would you like to see this functionality in DocX?

Novacode.Table t = DocX.parseHTML('<table><tr><td>Cell 1 of Row 1</td><td>Cell 2 of Row 1</td></tr></table>')
Novacode.Paragraph p = DocX.parseHTML('<p>Hello World</p>')
Is there a better way to structure this functionality?

Coordinator
Apr 20, 2011 at 4:28 PM

I've been thinking about this for awhile now and I am not convinced that this will be a useful feature.

Why?

1) A HTML to DocX converter seems okay but would it have to support CSS?
2) There are going to be lots of attributes which cannot be transformed from HTML\CSS to DocX.
3) Layout is going to be a massive issue.

Why have so many people asked for this functionality?
Can you please provide me with use cases. 

Is it a normal thing for Word users to insert HTML into their documents? Am I missing something :-)

Apr 28, 2011 at 7:37 PM

One use case involves users entering data into a database via web browser, using a rich editor like CKEditor or TinyMCE.  The users format their text in the WYSIWYG tool, and it is stored as html.  Users have the option of generating a report on the fly from the web browser.  The report generation is simply custom xml injection into a docx.  In this case, the html tags show up in the injected text.

Jan 3, 2012 at 6:43 PM
Edited Jan 4, 2012 at 7:34 AM

Another use case: We have an application that dates back to .NET 1.0 which we have refactored but couldn't get rid of the HTML (now XHTML) content. We now also use the TX Text Control and kept on storing any text in XHTML format. I.e. we have customers with DBs full of XHTML data.

If we would just ignore CSS, what about Html to OpenXML (http://notesforhtml2openxml.codeplex.com/) which is based on the Open XML SDK for MS Office?

Could this somehow be used together with DocX? This would really really be great because the major PITA we are currently experiencing are different versions of MS Word used by our customers. We are still using MS Word automation via COM and this sucks big time. DocX would stop us from having nightmares if it could allow for injection of (X)HTML.

Are there any chances of combining the efforts of DocX and Html to OpenXML?

Best regards,

Kasimier Buchcik

Oct 3, 2012 at 12:28 PM

any solution regarding this issue??

Cheers,

Raúl

Oct 5, 2012 at 10:49 AM

Another use case is where you need to create a PDF and a Word document of the same content.

Most PDF creation tools have an option to create a document from a string of html (or a url), so it'd be cool if we could have a similar option for Docx.

Dec 10, 2012 at 8:51 AM

i don't think a full HTML\CSS import feature would be useful to enough users to make up for the amount of work it would require. interpreting HTML\CSS and converting it to anything besides a bitmap is not a trivial problem. better to focus on making the creation of a Word doc from scratch trivial (which you've already done a great job at).

also if people really want HTML\CSS to DocX, that sounds like a different project to me that would consume DocX. that is, conversion is its own problem, call it 'Html2DocX' or whatever.

May 19, 2014 at 6:37 PM
coffeycathal wrote:
I've been thinking about this for awhile now and I am not convinced that this will be a useful feature. Why?1) A HTML to DocX converter seems okay but would it have to support CSS?2) There are going to be lots of attributes which cannot be transformed from HTML\CSS to DocX.3) Layout is going to be a massive issue. Why have so many people asked for this functionality?Can you please provide me with use cases.  Is it a normal thing for Word users to insert HTML into their documents? Am I missing something :-)
DocX doesn't need to support the CSS. It just needs to pass it through and include it in the doc. For example, take a properly formatted HTML file, with absolute CSS link references, inline style blocks, or inline styles. Rename the file to a *.docx and open it in word. Assuming you are using only CSS supported by word then it will properly handle it. Now obviously that's a hack and not a true Open XML document, but point being is you do not need to parse the HTML/CSS or "support" anything other than inserting the HTML as HTML instead of text. You don't need to parse the HTML and try to transform all the elements into something that looks similar in word.

Here are techniques using Open XML SDK or XPath:

http://stackoverflow.com/questions/18089921/add-html-string-to-openxml-docx-document
http://stackoverflow.com/questions/187448/insert-html-into-openxml-word-document-net

If someone wants/needs to inline CSS styles first before passing to you, tell them they need to use something like Premailer.NET. Your library doesn't need to take on that responsibility, at least for a first pass at the feature it doesn't.

For me, I am exporting some data to word documents, but some of the data is snippets of HTML, and I'd like to preserve the display of those snippets as much as possible. I'm willing to accept whatever limitations of Word's CSS support there is, but DOCX has no way for me to insert them and I can't figure out how to get access through the API to the root document, at least so I can manually insert it myself when needed(yet continue using the nice DOCX API for the other non-HTML entities).
May 19, 2014 at 6:49 PM
See this note though:

http://stackoverflow.com/a/17051826/84206

Essentially this makes the output very Microsoft Word specific and not so "open".