Challenges
- The Contacts PDFs needs to be run through an Optical Character Recognition system
- We are only searching for topics/headings, therefore we only need to extract the headings from the text
- Handling and Searching in a large amount of data
Solution
- PDFs need to be OCR’d to load texts in it
- Headings were extracted from the texts with a high recall using Tika
- Headings were indexed in solr to determine top results
- The top results are then passed to ML to determine the best match among them
- MariaDB (SQL) database is used
- The deployment framework is Flask API + Gunicorn
