Parsers - PDF parser

MordinezNLP.parsers.process_pdf.process_pdf(pdf_input: _io.BytesIO) → List[str]

A function can read strings from PDF docs handled in the BytesIO object. It extracts whole text and removes text that occurs in tables. The reason for that is that tables have mainly messy data for NLP tasks.

Function is divided into two parts. First removes tokens by exact match and the same number of occurences in text and tables. First part uses list of tokens, second uses tokens joined with space.

Parameters

pdf_input (BytesIO) – A PDF as a BytesIO object

Returns

Parsed text without texts found in tables

Return type

List[str]

Example usage for TXT files:

from io import BytesIO
from MordinezNLP.parsers import process_pdf

with open("my_pdf_doc.pdf", "rb") as f:
       pdf = BytesIO(f.read())
   output = process_pdf(pdf)
   print(output)