How to identify Hierarchy Location when using GetChildElements(true)

Hi guys,

I want to use GemBox.Document to translate a word document into HTML using user-mapped styles and classes (e.g. the user will decide that “Heading 1” style maps to the “.TopTitle” class in HTML, for example).

To do this I’m iterating through the document elements using GetChildElements(true) which gives me every word document component in order. However, because I am navigating up and down the hierarchy tree I need some way of identifying elements.

I can’t see any sort of “unique identifier” on an element that can use inside my processing loop to uniquely identify the element/parent I am processing.

Am I missing something or is there a better way?

Hi Dave,

Each Element has an ElementType which specifies what type of element it is, and it has Parent and ParentCollection which specify its location:

foreach (Element element in document.GetChildElements(true))
{
    Console.WriteLine(element.ElementType);

    Element parent = element.Parent;
    Console.WriteLine(parent.ElementType);
}

For more information, see the content model diagram.

I hope this helps.

Regards,
Mario

You’re right. It does.
I guess my explanation wasn’t very good.

Because I’m “manually” mapping between docx and html, I effectively need a reference from the HTML node to the equivalent DOCX node, so I know how to organise the HTML hierarchy. I was hoping to do this by just storing some sort of “unique id” link, but it seems I’ll need to build a wrapper around the two nodes (the Gembox Element and the equivalent HTML node) and use that to link them. I wouldn’t mind a bit more detail about “ParentCollection” though? Not sure how that can help me at the moment.

Hi Dave,

There is no “unique id”, nevertheless, how about doing something like this.

You would iterate through elements for which you want to specify the CSS class and add some sort of placeholder (like a bookmark). After that, you would process the resulting HTML and apply the required modification based on those placeholders.

string wordStyle = "heading 1";
string htmlStyle = "TopTitle";

var document = DocumentModel.Load("input.docx");

// Add "class" placeholder to paragraphs.
foreach (Paragraph paragraph in document.GetChildElements(true, ElementType.Paragraph)
    .Cast<Paragraph>()
    .Where(p => p.ParagraphFormat.Style?.Name == wordStyle))
{
    paragraph.Inlines.Insert(0, new BookmarkStart(document, $"class:{htmlStyle}"));
    paragraph.Inlines.Insert(1, new BookmarkEnd(document, $"class:{htmlStyle}"));
}

// Save as HTML.
string html;
var htmlOptions = new HtmlSaveOptions() { HtmlType = HtmlType.HtmlInline };
using (var htmlStream = new MemoryStream())
{
    document.Save(htmlStream, htmlOptions);
    html = htmlOptions.Encoding.GetString(htmlStream.ToArray());
}

// Edit HTML, for instance, with HtmlAgilityPack.
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);

// Get all "class" placeholders.
foreach (var anchor in htmlDocument.DocumentNode.SelectNodes(@"//a[not(@href) and @name]"))
{
    string anchorName = anchor.GetAttributeValue("name", null);
    if (!anchorName.StartsWith("class:"))
        continue;

    var parent = anchor.ParentNode;

    // Remove placeholder.
    parent.ChildNodes.Remove(anchor);

    // Remove "style" from paragraph and child elements, like runs.
    parent.Attributes.Remove("style");
    foreach (var child in parent.ChildNodes)
        child.Attributes.Remove("style");

    // Add "class" to paragraph.
    parent.Attributes.Add("class", anchorName.Substring("class:".Length));
}

html = htmlDocument.DocumentNode.OuterHtml;
File.WriteAllText("output.html", html);

I hope this helps.

Regards,
Mario

Hi Mario,
Thanks for this. Quite neat! It’s not far off what I’ve done, but I’ve used a super-class instead of manipulating the document.
I’ve made a super-class containing both Gembox.Element and HtmlNode…for all Gembox.Element nodes that have children. I store each Gembox Element’s collection of immediate children in the “super class” too…which means that to generate the HTML content I can just iterate through the collection of super-classes, map the “parent” element type to the equivalent I need in HTML (using my mapping lookup) and populate the content using the collection of immediate children (eg if I have three child “run”'s they might generate <p class = 'topTitle'>this is all my content including <em>italic text</em> <b>and some bold text</b></p>)