Parsers - HTML parser

MordinezNLP.parsers.HTML_Parser.HTML_Parser(html_doc: str, separator: str = ' ') → str

Function which removes not vaulable text and tags from HTML docs. It is based on research https://rushter.com/blog/python-fast-html-parser/

IMPORTANT If You must be 100% sure, that text You want to process is a HTML doc. Otherwise some parts of the source text can be deleted because of misunderstanding text as a tags.

Parameters
  • separator – Separator used to join HTML nodes in selectolax package

  • html_doc (str) – a HTML doc

Returns

String which is a vaulable text parsed from HTML doc.

Return type

str

Example usage for HTML files:

from MordinezNLP.parsers import HTML_Parser

with open("my_html_file.html", "r") as f:
    html_content = HTML_Parser(f.read())
    print(html_content)