Parsers - HTML parser¶
-
MordinezNLP.parsers.HTML_Parser.
HTML_Parser
(html_doc: str, separator: str = ' ') → str¶ Function which removes not vaulable text and tags from HTML docs. It is based on research https://rushter.com/blog/python-fast-html-parser/
IMPORTANT If You must be 100% sure, that text You want to process is a HTML doc. Otherwise some parts of the source text can be deleted because of misunderstanding text as a tags.
- Parameters
separator – Separator used to join HTML nodes in selectolax package
html_doc (str) – a HTML doc
- Returns
String which is a vaulable text parsed from HTML doc.
- Return type
str
Example usage for HTML files:
from MordinezNLP.parsers import HTML_Parser
with open("my_html_file.html", "r") as f:
html_content = HTML_Parser(f.read())
print(html_content)