I have an HTML file that contains text inside a p tag, something like this:
<body>
<p>Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Maecenas sed mi lacus.
Vivamus luctus vehicula lacus,
ut malesuada justo posuere et.
Donec ut diam volutpat</p>
</body>
Using Python and BeautifulSoup I tried to get to the text in the p tag, like:
with open("foo.html", 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f.read(), 'lxml')
p = soup.p
print(p.text)
and the result: 'Lorem ipsum dolor sit amet, \n\t\tconsectetur adipiscing elit. \n\t\tMaecenas sed mi lacus. \n\t\tVivamus luctus vehicula lacus, \n\t\tut malesuada justo posuere et. \n\t\tDonec ut diam volutpat'
The problem is that I get the result together with the \n and \t that appear in the original file (like .textContent in JS). I need a solution that was similar to .innerText in JS that returns as the user sees in the browser.
I tried using p.text.replace("\n", " ").replace("\t", "") But for more complicated things, like a tag within a tag, it just doesn’t work (like unnecessary spaces).
Does anyone have an idea how to do this? Thanks in advance!
from bs4 import BeautifulSoup
def get_visible_text(element):
"""
Extracts text from a BeautifulSoup element, removing unnecessary whitespace.
Args:
element: The BeautifulSoup element to extract text from.
Returns:
A string containing the extracted text with proper spacing.
"""
if element.parent.name in ['script', 'style', 'head', 'meta', '[document]']:
return '' # Exclude script, style, head, meta, and document elements
if element.name == 'br':
return '\n' # Replace br tags with newlines
text = ''.join(str.strip() for item in element.stripped_strings)
return ' '.join(text.split()) # Remove extra spaces, collapse consecutive spaces
with open("foo.html", 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f.read(), 'lxml')
p_text = get_visible_text(soup.find('p'))
print(p_text)
Explanation:
get_visible_text function:
Takes a BeautifulSoup element as input.
Excludes text from script, style, head, meta, and document elements (optional, depending on your needs).
Replaces <br> tags with newlines for proper line breaks.
Extracts text using stripped_strings , which removes whitespace around tags and between consecutive text nodes.
Joins the text strings, removes extra spaces using str.strip() , and collapses consecutive spaces using ' '.join(text.split()) .
Main code:
Opens the HTML file with proper encoding.
Creates a BeautifulSoup object.
Finds the <p> element using soup.find('p') .
Calls get_visible_text on the <p> element to extract the text.
Prints the extracted text.
This code effectively removes unnecessary whitespace characters like \n and \t , giving you the text in a format that resembles what a user would see in a browser. It also handles potential issues with excluded elements and <br> tags.