Parsers - PDF parser¶

MordinezNLP.parsers.process_pdf.process_pdf(pdf_input: _io.BytesIO) → List[str]¶

A function can read strings from PDF docs handled in the BytesIO object. It extracts whole text and removes text that occurs in tables. The reason for that is that tables have mainly messy data for NLP tasks.

Function is divided into two parts. First removes tokens by exact match and the same number of occurences in text and tables. First part uses list of tokens, second uses tokens joined with space.

Parameters: pdf_input (BytesIO) – A PDF as a BytesIO object
Returns: Parsed text without texts found in tables
Return type: List[str]

Example usage for TXT files:

from io import BytesIO
from MordinezNLP.parsers import process_pdf

with open("my_pdf_doc.pdf", "rb") as f:
       pdf = BytesIO(f.read())
   output = process_pdf(pdf)
   print(output)