Radix Analytics Pvt Ltd

Issues & Objectives

  • To develop ML based system for searching contents in contracts:
    • Contacts are present with the client in pdf format with scanned images.
    • We have to convert it into readable formats and then index them in database.

    • Our aim is to build a ML based system that can accurately find the search term in headings of the contracts.

Solution

  • PDFs need to be OCR’d to load texts in it.

  • Headings were extracted from the texts with a high recall using Tika.

  • Solr:
    • Headings were indexed in solr.

    • Used for searching terms and getting top results.

  • ML:
    • The top results are then passed to ML to determine the best match among them.

  • Database: MariaDB (SQL)

  • Deployment Framework: Flask API + Gunicorn

Project information

Skills

Machine Learning

Client

Regulatory Authority in Qatar

Domain

Text Analytics

Location

Qatar

Challenges

  • The Contacts PDFs needs to be run through an Optical Character Recognition system.

  • We are only searching for topics/headings, therefore we only need to extract the headings from the text.

  • Handling and Searching in a large amount of data.

Results

  • API which takes contact number and search term as an input was built.

  • It outputs the potential pages where the term could be found.

  • The client’s team then integrates them in their PDF reader to enable direct search and read operations.