Explore the future of Web Scraping. Request a free invite to ScrapeCon 2024

2 JavaScript Approaches to Encode/Decode HTML Entities

No plugins required. 2 strategies — (1) A <textarea> HTML element & (2) Regex pattern implementations.

Before diving straight into the implementation details, the following aims to showcase the encoding/decoding of a HTML code snippet respectively so as to specify the context for the rest of this article:

Image by Author | Showcases the raw html string <py-script></py-script> transformed into unicode characters <py-script></py-script> and vice-versa

Image by Author | Showcases the raw html string <py-script></py-script> transformed into unicode characters &#60py-script&#62&#60/py-script&#62 and vice-versa

Why is this necessary?

Technical documentations in general often encompass code snippets as part of its content for readers' future self-references. Hence, when these aforementioned documentations are digitally compiled and rendered in HTML format, encoding HTML entities of code snippets would thus be necessary for proper display and viewing. Case in point, please feel free to check out the following published piece which has highlighted aspects of displaying code snippets in any HTML formatted document:

6 Useful but Lesser-Known HTML Tags & Attributes

  • Note: Tags <code> and <pre> in particular have exemplified this in the above content.

As stated by Vit Tertiumnon in his post: JS: How to decode HTML-entities

There are no JavaScript built-in functionalities to encode or decode HTML entities.

As such, I believe as well that it would be beneficial to share two distinct (proven) means of carrying out this task programmatically for others in the online community.

Method I. Using a <textarea> HTML DOM Element

(Note: Just a disclaimer, this approach was originally taken from Vit Tertiumnon from his post above but with certain tweaks in place)

  • To encode:

    function encodeHTMLEntities(text) {
      let textArea = document.createElement('textarea');
      textArea.innerText = text;
      let encodedOutput=textArea.innerHTML;
      let arr=encodedOutput.split('<br>');
      encodedOutput=arr.join('\n');
      return encodedOutput;
    }
    
  • Usage:

    let sampleRawHTMLStr="<py-script></py-script>";
    let encodedOutput=encodeHTMLEntities(sampleRawHTMLStr);
    
    // output: &lt;py-script&gt;&lt;/py-script&gt;
    
  • To decode encodedOutput:

    function decodeHTMLEntities(text) {
      let textArea = document.createElement('textarea');
      textArea.innerHTML = text;
      return textArea.value;
    }
    
  • Usage:

    let decodedOutput=**decodeHTMLEntities**(encodedOutput);
    
    // output: <py-script></py-script>
    

Method II. Using regex patterns

Alternatively, it is also possible to use regex patterns in JavaScript to attain a desirable output similar to the above:

  • To encode:

    function encodeHTMLEntities(rawStr) {
      return rawStr.replace(/[\u00A0-\u9999<>\&]/g, ((i) => `&#${i.charCodeAt(0)};`));
    }
    
  • Usage:

    let sampleRawHTMLStr="<py-script></py-script>";
    let encodedOutput=encodeHTMLEntities(sampleRawHTMLStr);
    
    // output: &#60;py-script&#62;&#60;/py-script&#62;
    
  • To decode encodedOutput:

    function decodeHTMLEntities(rawStr) {
      return rawStr.replace(/&#(\d+);/g, ((match, dec) => `${String.fromCharCode(dec)}`));
    }
    
  • Usage:

    let decodedOutput=decodeHTMLEntities(encodedOutput);
    
    // output: <py-script></py-script>
    

Summary of Method I & Method II

Based on each respective output, both methods have managed to encode <py-script></py-script> successfully —

I. <textarea> approach

&lt;py-script&gt;&lt;/py-script&gt;

II. Regex approach

&#60;py-script&#62;&#60;/py-script&#62;

While the former is rendered as Unicode characters, the latter has instead encoded each HTML entity in Decimal format. Both approaches would however be displayed no differently from each other on the HTML document:

<py-script></py-script>
  • Feel free to refer to html_entities for the full list of HTML entities and their encoded counterparts.

Personal Comments

By and large I personally prefer the latter's approach (i.e. using regex patterns) since the code is cleaner but it is entirely up to personal preferences. Ideally the encoding standard is consistent throughout the webpage for easy maintenance.

And there you have it! Many thanks for persisting to the end of this article! ❤ Hope you have found this article useful.

— 🌮 Please buy me a Taco ξ(🎀˶❛◡❛)




Continue Learning