How to get text in beautifulsoup as .innerText and not as .textContent in JS

I have an HTML file that contains text inside a p tag, something like this:

<body>
    <p>Lorem ipsum dolor sit amet, 
        consectetur adipiscing elit. 
        Maecenas sed mi lacus. 
        Vivamus luctus vehicula lacus, 
        ut malesuada justo posuere et. 
        Donec ut diam volutpat</p>
</body>

Using Python and BeautifulSoup I tried to get to the text in the p tag, like:

with open("foo.html", 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f.read(), 'lxml')
p = soup.p
print(p.text)

and the result: 'Lorem ipsum dolor sit amet, \n\t\tconsectetur adipiscing elit. \n\t\tMaecenas sed mi lacus. \n\t\tVivamus luctus vehicula lacus, \n\t\tut malesuada justo posuere et. \n\t\tDonec ut diam volutpat'

The problem is that I get the result together with the \n and \t that appear in the original file (like .textContent in JS). I need a solution that was similar to .innerText in JS that returns as the user sees in the browser.

I tried using p.text.replace("\n", " ").replace("\t", "") But for more complicated things, like a tag within a tag, it just doesn’t work (like unnecessary spaces).

Does anyone have an idea how to do this? Thanks in advance!

Python

from bs4 import BeautifulSoup

def get_visible_text(element):
  """
  Extracts text from a BeautifulSoup element, removing unnecessary whitespace.

  Args:
      element: The BeautifulSoup element to extract text from.

  Returns:
      A string containing the extracted text with proper spacing.
  """
  if element.parent.name in ['script', 'style', 'head', 'meta', '[document]']:
    return ''  # Exclude script, style, head, meta, and document elements

  if element.name == 'br':
    return '\n'  # Replace br tags with newlines

  text = ''.join(str.strip() for item in element.stripped_strings)
  return ' '.join(text.split())  # Remove extra spaces, collapse consecutive spaces

with open("foo.html", 'r', encoding='utf-8') as f:
  soup = BeautifulSoup(f.read(), 'lxml')

p_text = get_visible_text(soup.find('p'))

print(p_text)

Explanation:

  1. get_visible_text function:
  • Takes a BeautifulSoup element as input.
  • Excludes text from script, style, head, meta, and document elements (optional, depending on your needs).
  • Replaces <br> tags with newlines for proper line breaks.
  • Extracts text using stripped_strings , which removes whitespace around tags and between consecutive text nodes.
  • Joins the text strings, removes extra spaces using str.strip() , and collapses consecutive spaces using ' '.join(text.split()) .
  1. Main code:
  • Opens the HTML file with proper encoding.
  • Creates a BeautifulSoup object.
  • Finds the <p> element using soup.find('p') .
  • Calls get_visible_text on the <p> element to extract the text.
  • Prints the extracted text.

This code effectively removes unnecessary whitespace characters like \n and \t , giving you the text in a format that resembles what a user would see in a browser. It also handles potential issues with excluded elements and <br> tags.