PDF Topic Extractor

  • Role: NLP Engineer
  • Client: M&G Investment
  • Technology: NLP - Phrase Extraction
  • Demo URL: Click Here

Project Description :

The objective of this project is to extract text from PDF and sucessfully clean the text. We later run an unsupervised phrase extraction algorithem to classify text into a different topics. This helps user to Analyze any document into a well defined section and he can query the topic for which the tool will rerturn the most relevant paragraphs.

Responsibilities: :

  • PDF scraping, text extraction and preprocessing.
  • Text Cleaning – font Identification.
  • Extracting important phrases from short paragraph text.
  • User input text matching to different extracted phrases.
  • Making of Streamlit based UI of the same.
  • Packaging the model into docker and deploy on heroku server.