HTML import preprocessing with Javascript

ReqEdit Online Documentation

HTML import preprocessing with Javascript

4 min read
last updated: 03/27/2024

HTML Conversion diagram

The Javascript engine enables users to do preprocessing on the html content directly before importing to ReqIF. If no userscript is not selected the pre-processing part is skipped.

The javascript engine is https://github.com/paulbartrum/jurassic – a standalone JS engine with no other dependency, therefore only a reduced set of features are available from the browser world.

Javascript API v.1

functiondescription
removeElementByID(id)id attribute to match and delete
removeElements(nodes)remove the nodes from the dom object
getElementsBy(node, tag, css, id)get back a list of html objects from the specified node optionally filtered by tag name (cs), css class name (ci) or id attribute (ci)
getFirstNodesBy(node, tag)get back the first child node under the specified node that is the specified tag name
trimAll(val)replace all starting and trailing whitespace characters and all newlines with spaces in val
hasClass(node, val)detect if the specified node contains the specified class name (ci) by val
hasID(node, val)detect if the specified node has the specified ID name (ci) by val
cs – case sensitive, ci – case insensitive

User javascript

Each user javascript receives the htmlstring containing the html raw content and must return a htmlstring value by calling setReturnValue(htmlstring). A doc object is available that is basic DOM model of the hierarchical nodes of the htmlstring.

doc – the DOM model based on – https://github.com/xmldom/xmldom (v0.8.6)

Large HTML splitting

Javascript runner has a limitation processing large HTML source files, the content has to be splitted. This is done with the following annotations (in the comment, that’s right)

contentSelector and excludeSelector – uses https://github.com/hcesar/HtmlAgilityPack.CssSelector engine

variabledescription
contentSelectorsplit the html file into multiple elements, each element will be processed separately
excludeSelectorexclude elements from the list based on the specified selector – currently in development
javascript global variables

Example script

/*
contentSelector = "body > *"
excludeSelector = ".footnote, .notes"
*/


// PERFORM CUSTOM ACTIONS

var tables = getElementsBy(doc, "table", "table"); // tables with .table class
var images = getElementsBy(doc, "img");

//console.log("table" + tables.length + " images" + images.length)
// node can be .table or can contain images - leave alone
// node contains super/sub formatted strings replace tags

if (!tables.length) {
    if (!images.length) {
        // if structural table
        if (doc.firstChild.nodeName.toLowerCase() == "table") {
            var line = "<p class='object'>";

            // process first row
            var trs = getFirstNodesBy(doc, 'tr');
            for (var i = 0; i < trs[0].childNodes.length; i++) {
                var td = trs[0].childNodes[i];
                //console.log(td.nodeName + " " + trimAll(td.textContent));
                if (trimAll(td.textContent)) {
                    // first p of TD
                    // var firstp = getFirstNodesBy(td, 'p');
                    // line += trimAll(firstp[0].textContent)+" ";
                    //console.log(trimAll(td.textContent));

                    for (var ii = 0; ii < td.childNodes.length; ii++) {
                        if (td.childNodes[ii].nodeName.toLowerCase() != "table") {
                            line += trimAll(td.childNodes[ii].textContent) + " ";
                        }
                    }

                    var subtables = getFirstNodesBy(td, 'table');
                    if (subtables.length) {
                        line += "<br>";
                        var l2tables = getFirstNodesBy(td, "table");
                        l2tables.forEach(function (table) {
                            line += trimAll(table.textContent) + "<br>";
                            //console.log(trimAll(table.textContent));
                        })
                    }
                }
            }
            line += "</p>";
            //console.log("line");
            htmlstring = line;
            /*
              htmlstring = htmlstring.replace(/<span class="super">(.+?)<\/span>/ig, "<sup>$1</sup>");
              htmlstring = htmlstring.replace(/<span class="sub">(.+?)<\/span>/ig, "<sub>$1</sub>");
              htmlstring = "<p>" + doc.firstChild.textContent + "</p>";
              */
        } else
            if (doc.firstChild.nodeName.toLowerCase() == "div") {
                // annex that has to be processed again
                //console.log('div annex');
                htmlstring = doc.toString();
            }
            else
            {
                var elem = doc.firstChild;
                // is not a table

                // starts with number
                if (/\d+/i.test(elem.textContent)) {
                    htmlstring = "<p class='object'>" + elem.textContent + "</p>";
                    //console.log("p ");
                } else
                    if (hasClass(elem, 'ti-art') || hasID(elem)) {
                        htmlstring = "<h1 class='object heading'>" + elem.textContent + "</h1>";
                        //console.log("h1 ");
                    } else
                        if (hasClass(elem,'.sti-art')) {
                            htmlstring = "<h2 class='object heading'>" + doc.firstChild.textContent + "</h2>";
                            //console.log("h2 ");
                        } else {
                            htmlstring = "<small>" + doc.firstChild.textContent + "</small>";
                            //console.log("small ");
                        }
            }
            //htmlstring = doc.toString();
    } else {
        // has images
        // do not change
        //console.log("has image");
        htmlstring = doc.toString();
    }
} else {
    // content is other then tables
    // do not change
    //console.log("is table");
    htmlstring = doc.toString();
}

//htmlstring = doc.toString();
setReturnValue(htmlstring);

Testing

ReqEdit will execute whatever the userscript contains. Any errors are logged back to the user.

One way of testing is to use JSBIN https://jsbin.com/?js,console – online js sandbox, where the userscript can be tested with html parts. The contentSelector and excludeSelector will not work.

JSBIN online sandbox for JS testing

Always use the actual scripts available in ReqEdit.

  • Take the ReqEdit/Data/JavascriptEngine/dom-parser.js and
    ReqEdit/Data/JavascriptEngine/jsapi.js and paste it in the Javascript panel.
  • Add the custom script
  • End the script with console.log(…) to test the output of the script
  • using in ReqEdit setReturnValue(htmlstring) must be used

Running in ReqEdit

Select HTML
Select user script and change timeout if needed
if progress takes long the process can be stopped and restarted
open the converted ReqIF file

Tips

Clean up and optimize the original HTML as much as possible.

Remove deeply nested elements to produce a nice list of html nodes and let the contentSelector split the html up.

Split large tables into smaller tables.

Create a user script that will clean-up even more the HTML and handle the sup/sub formatting or convert tags.