Searchable Contracts for Public Works Authority
Issues & Objectives
- To develop ML based system for searching contents in contracts:
- Contacts are present with the client in pdf format with scanned images.
- We have to convert it into readable formats and then index them in database.
- Our aim is to build a ML based system that can accurately find the search term in headings of the contracts.
Solution
- PDFs need to be OCR’d to load texts in it.
- Headings were extracted from the texts with a high recall using Tika.
- Solr:
- Headings were indexed in solr.
- Used for searching terms and getting top results.
- ML:
- The top results are then passed to ML to determine the best match among them.
- Database: MariaDB (SQL)
- Deployment Framework: Flask API + Gunicorn
Project information
Skills
Machine Learning
Client
Regulatory Authority in Qatar
Domain
Text Analytics
Location
Qatar
Challenges
- The Contacts PDFs needs to be run through an Optical Character Recognition system.
- We are only searching for topics/headings, therefore we only need to extract the headings from the text.
- Handling and Searching in a large amount of data.
Results
- API which takes contact number and search term as an input was built.
- It outputs the potential pages where the term could be found.
- The client’s team then integrates them in their PDF reader to enable direct search and read operations.
Address/Entity Matching
Issues & Objectives
- To develop ML based system for matching addresses and entity names:
- Same address or entity names can be written in few different ways.
- It may have spelling issues, order might be different, abbreviations may exist.
- Our aim is to build a ML based system that can accurately match the addresses and entity with the ones already in our database.
Solution
- Two Stage Modelling: Solr + ML
- Solr:
- It is used to search an address/entity from the database and to determine top 10,20 or 30 results.
- ML:
- The top results are then passed to ML to determine the best match among them.
-
Database: MariaDB (SQL)
-
Deployment Framework: Flask API + Gunicorn
Project information
Skills
Machine Learning
Client
Corporate Data Aggregator
Domain
Text Analytics
Location
India
Challenges
-
The database is mostly on addresses and entities (company names & person names)
-
Different models are being built for address and entity separately.
-
Entity matching can be tricky as it contains both person name and company name which ideally should require different set of features.
-
Handling and Searching in a large amount of data.
-
Annotation of data consistently by multiple SMEs was a challenge.
Results
Test recall (KPI) for different models are given as below:
- Address Matching: 87.47 %
-
Entity Matching: 88.62 %
Topic Prediction for EdTech
Issues & Objectives
- To develop ML based system for predictions for Online Test Preparation Systems:
- Students are asked questions for there exam preparations.
- They may face issue with solving questions.
- Our aim is to build a ML and NLP based system that can accurately predict the topic/chapter of that question.
- Detecting appropriate topics removes the need for manual tagging and enables faster and frequent uploads of new questions/tests.
Solution
- Natural Language Processing (NLP):
- Images: Converted to text using appropriate Optical Character Recognition for different subjects.
- Text: Converted to vectors using Word2Vec
- Algorithm(s): Deep Neural Network + Random Forest
- Storage: AWS Cloud
- Database: MariaDB (SQL)
- Deployment Framework: Flask API + Gunicorn
Project information
Skills
Machine Learning
Client
EdTech
Domain
Natural Language Processing
Location
India
Challenges
- The questions are primarily on 4 subjects: Physics, Biology, Mathematics and Chemistry.
- Questions are available as text, but many of them contains images of text, figures, equations and chemical diagrams.
- Converting equations and chemical diagrams to appropriate formats for ML processing.
Results
Test recall (KPI) for different Subjects are given as below:
- Physics: 92%
- Biology: 88%
- Chemistry: 89%
- Mathematics: 89%
Smart Analysis on Bus Transportation System
Issues & Objectives
- Transport regulatory authority in Singapore commissioned a system to:
- Automatically discover wrong fare incidents and flag commuter’s cards affected by wrong fare charging; and
- Detect emerging fault trends in fare collection equipment so that corrective action could be taken in a timely manner
Solution
-
Data Storage: Hadoop and MySQL
-
Query Tools: Hive and SQL
-
Algorithms: rmr (Parallel versions of R) and Java
-
Reporting and Dashboards: Pentaho
Project information
Skills
Advanced Statistical Model
Client
Transportation Authority
Domain
Big Data Analytics
Location
Singapore
Challenges
-
Large data 15 million transactions per day, which translates to more than 5 billion historical transactions in a year needs to be processed to identify fault patterns and trends
-
The data consisted of financial, operations, transit and events data of buses
Results
Robust solution in use for over 2 years allows pro-active rather than reactive maintenance
Child Support Case Management Predictive Analytics
Issues & Objectives
-
There are two parents in every child support case. One is the Custodial Parent (CP) – the parent who lives with the child the majority of the time and has the primary day-to-day responsibility; the other is the Non-Custodial Parent (NCP) who also has important responsibilities. An aggrieved CP may appeal to the state to enforce child support by the NCP.
-
The project objective is to predict the collection category of cases based on its past payment pattern and various attributes
Solution
-
Multinomial Logistics Regression technique was used to build the predictive models
-
Models were developed for 4 major states in the United States.
Project information
Skills
Advanced Statistical Model
Client
Child Support Service
Domain
Predictive Modeling
Location
USA
Challenges
-
Extremely large data – Approx. 300,000 cases per month
-
Available data and relevant variables differ from state to state. Predictive models built for four states so far
-
Collection categories definition
Results
-
The accuracy of prediction for a dataset of 12 months was 71–83%.
Scoring Customer Quality Experience (QoE)
Issues & Objectives
- To develop a score of the customer quality of experience (QoE) based on objective factors such as such as number of stalls, frame drops, ghost sessions, and play delay for an internet video service provider
Solution
- Applied Principal Component Analysis (PCA) technique to build models for scoring
- Developed 4 different models using different variable transformation techniques
- Scored 8,800 records/sec
Project information
Skills
Principal Component Analysis
Client
Telecom
Domain
Scoring
Location
India
Challenges
- Extremely large data – Over 2.5 million records
- Data inconsistencies
- Traffic variation at different time of the day
Results
Distribution Analytics – Demand Forecast
Issues & Objectives
- A Singapore based company provide multi country mobile platform for distributed sales representatives who gets updated information on demand forecast, recommendation and target sale
-
They wanted to build appropriate models for forecast
-
All output were to be pre processed in nightly batch run and saved in a centralized database
-
A customized software for managerial decision making was also needed
Challenges
-
High attrition of DSRs made it hard to collate a time series sales data
-
Customer base changes between transition from one DSR to another
-
Intermittent sales data for about 30% of customers
-
Discontinued or new product SKUs with short history of sales data
Project information
Skills
Advanced Statistical Model
Client
SaaS Provider
Domain
Demand Forecasting
Location
Singapore
Solution
-
Software developed in R Shiny
-
K-means and hierarchical clustering and time series forecasting methods were used
-
Batch code is developed in R with input and output link to client database.
Benefits
-
Batch run for a dataset of 60K transaction take less than 10 minutes producing multiple output tables
-
Experiment with customer segments and view a particular subset for any discount/promotion
-
ØView the position of customers and the recommendation to be made
-
Review the profile of DSR and extent of target achievement
-
Employ Various methods and visualize actual vs forecast
Airline O&D Passenger & Revenue Forecasting
Issues & Objectives
- Forecast passenger and revenue for major O&D (Origin & Destination)/POS (Point of Sale) combinations for a large East African Airline
- Short term O&D forecasts for every flight date up to 90 days in the future to be generated everyday
- Long term rolling forecasts up to 5-10 years to be generated quarterly
Methodology
- Linear Regression
- ARIMA/ARIMAX
- Neural Networks
- Etc.
Project information
Skills
Advanced Statistical Model
Client
Budget Airlines
Domain
Revenue Management
Location
Africa
Data
- Short term forecasts based on:
- Current bookings
- Historical bookings
- Seasonality
- DOW (Day-of-Week)
- Etc.
- Long term forecasts based on:
- GDP
- Population growth at origin
- Population growth at destination
- Employment growth at origin
- Employment growth at destination
- Etc.
Solution
- O&D forecasting is very challenging because of the small numbers involved
- Good accuracies obtained
Ad-Spot Optimizer (ASO)
Issues & Objectives
-
Generate in real time (typically a few minutes) the daily spot allocation plan which determines the program/breaks in which each spot will be aired.
Solution
- Designed and developed a software with following features
- Assured allocation at spot, brand, advertiser, deal level
- Even distribution of spots of different brands, products, clients when the rates are same
- Long term even distribution of spots across day-parts from each deal time-band
Project information
Skills
Mathematical Optimization Models
Client
Leading TV Broadcaster
Domain
Media Analytics
Location
India
Benefits
-
Maximizes revenue ( 2-4% incremental gain)
-
Automates the spot allocation process
-
Respects FCT Caps
-
All allocation rules such as cap on number of ads for the same brand in a program are satisfied
-
Checks that all deal conditions are satisfied while allocating ads
Ad Revenue Optimiser
Issues & Objectives
Creating advertising proposals is a vital aspect of a broadcaster’s operations, as it centres around persuading advertisers to commit to investing in advertising slots or campaigns on the broadcaster’s platforms. These proposals encompass different rates for different advertisers based on the frequency and quantum of ad bookings.
Ad Revenue Optimiser (ARO) is a web application to automate and optimise advertisement inventory planning for the client resulting in additional revenue gain. The solution focuses on creation of proposals, planning of ad inventory and post evaluation.
Project information
Skills
Mathematical Optimization Models
Client
One of Leading Broadcasters in Asia
Domain
Media Analytics
Location
India
Solution
- Designed and developed a software serve clients evolving requirements like
- state of the art inventory visualisation
- advertising on digital and mobile platforms
- interactive content
- comprehensive sales management
Benefits
- Improved inventory pricing.
- Advertiser-specific price variation and
- Demand driven pricing of inventory.
- Improved allocation of inventory.
- Planned inventory overfill to manage day-to-day demand (RO) variation & avoid wastage.
- Reduced servicing issues and
- Reduced make goods effort
- Improved pipeline visibility
- Sales executives performance tracking
- Negotiations history tracking for future reference
- Improved handling of Make goods
- Streamlined, accurate and faster billing
Application Scorecard
Issues & Objectives
SMEs play a crucial role in the Indonesian economy, contributing significantly to employment and economic growth. Our client, a non-bank financial institution based in Indonesia, offers a range of financing services to individuals and businesses.
The objective of this project was to develop application scoring model for SME. The SME portfolio was new to the client which had been recently acquired by a multinational bank headquartered in Australia. The scoring model was used for SME loan origination decisions.
Project information
Skills
Advanced Statistical Model
Client
Bank Lending to SMEs
Domain
Risk Analytics
Location
Indonesia
Challenges
- Scaling up operations in accordance with Indonesian government directive
- Very small number of data points ≈ 400 making it difficult to obtain reliable results through predictive modelling
- The SME portfolio was new so the history of defaults had not been well established
Solution
- Bootstrapping was used to overcome the limitation of a small sample.
- Reject rates were taken as a surrogate for default rate.
- High quality scorecard was developed
- Model Gini = 66.74 (Gini > 55 indicates a high quality scorecard)
- Model KS = 53.85 (KS > 45 indicates a high quality scorecard)
Benefits
- Process Automation
- Ensures Consistency in decision making
- Predictive modelling replaces gut feel
- Scores recalibrated with default data after sometime
Calibration of Expert Scorecard by ML Methods
Issues & Objectives
-
For the first time in India, a scorecard was developed for the client to keep vigil on the listed companies to avoid potential financial disaster
-
Scorecard was based on financial as well nonfinancial events such as auditors, board of directors, litigation, news etc.
-
The task was to refine expert scorecard with ML methods
Challenges
-
Listed and unlisted flag was incomplete in the database
-
Many companies had large number of missing data
-
Frequent modification of event logic
-
Running ML models and processing score with new weights took several hours posing a challenge to multiple iteration
Project information
Skills
Advanced Statistical Techniques
Client
Corporate Data Aggregator
Domain
Scoring Models
Location
India
Solution
-
Decision tree, Random forest and Gradient boosting were used to obtain weights of the events
-
ML methods were run in h2o
-
Models for listed and unlisted companies were built
-
Separate weights for listed and unlisted companies were used to arrive at the consolidated score of parent companies
Benefits
-
Discriminatory power of the calibrate scorecard was found to be higher than the expert scorecard
-
Apply a decision overlay which enhanced the predictive power of the scorecard
-
Better separation between GOOD and BAD companies in modelled score
GBM Score
Expert Score
Application Scorecard For Auto Loan
Issues & Objectives
- Application scorecard for sub-prime customers
- Review and recalibrate scorecard
- Use insight data to improve alignment between underwriting rules and scores
Challenges
- Methodology for current scorecard not well documented
- Scores not aligned with underwriting rules
- Data in batches – Credit history, product information, loan terms in different files from different time periods
- Performance available only for 8% TTD population who take up loan from 55% approval
Project information
Skills
Advanced Statistical Techniques
Client
Auto Loan Provider
Domain
Credit Scoring
Location
UK
Solution
- Data collation to align all variables from same time-period was carried out using R for analysis
- Customers classification using domain knowledge and statistical methods – Decision tree and cluster analysis.
- Multiple scorecards each with superior performance than existing scorecard
- All scorecards rescaled to have similar odds
- Scorecard as a linear function for easy integration with loan origination system
- Reviewed underwriting rule and corporate reporting system and recommended changes
Credit Scoring for Leasing Company
Issues & Objectives
- A large company in UK finances lease of office equipment, primarily to small and medium companies with ticket size less than £10K
-
Leased items depreciates rapidly and seizure of collateral does not recover the debt
-
The company currently cherry picks customers who seldom go bad
-
They want to expand customer base while controlling risk
-
For this they want a scorecard to replace rule driven underwriting for better screening
Challenges
-
Company book identified only 2.5% bad lease – payment history data was fraught with inconsistent figures
-
After incorporating liquidation/insolvency/dissolution status and rating from credit bureau record, the incidence was boosted to 12%. The process classified non takers of loan to Good and Bad by a logical method and not by reject inference
Project information
Skills
Advanced Statistical Techniques
Client
Equipment Leasing Company
Domain
Credit Scoring
Location
UK
Solution
-
Model developed by R program
-
2 scorecards with and without credit bureau ratings were delivered
-
Discriminatory power of the scorecards were high as seen from high KS and GINI
Benefits
-
Scorecard developed by Statistical method
-
Scrutiny restricted to high scorers reducing manual work by a factor of 5 -10
Fraud Scoring for Insurance Claims
Issues & Objectives
-
A major insurance company in Singapore used to manually examine each travel insurance claim to identify potentially fraudulent one
-
Suspicious claims were subject to a more detailed investigation
- This involved considerable manual effort & inconsistent processes
- The project objective was to develop a score to identify potentially fraudulent claims which would be subject to greater scrutiny.
Challenges
-
Data included 77,445 claim records of which only 120 had been determined to be potentially fraudulent
-
So identified potentially fraudulent claims are rare events (0.15%) and therefore hard to detect
-
It was however expected that there could be a large number of undetected fraudulent claims
Project information
Skills
Advanced Statistical Techniques
Client
Travel Insurance Provider
Domain
Scoring Models
Location
Singapore
Solution
-
Gradient Boosting – a powerful machine learning algorithm was used for detecting potentially fraudulent cases
- Substantial lift demonstrated. – on the client test data set it sufficed to examine 7.75% of all claims to identify 91.67% of all fraudulent claims
Benefits
- Process automation, ensuring consistency, cost saving and increased accuracy
-
Scrutiny restricted to high scorers reducing manual work by a factor of 5 -10
Credit Scoring for Micro Finance Provider

Issues & Objectives
- A Multi-finance company providing financing facilities for small and medium enterprises in Indonesia.
-
The objective was to develop multiple portfolios for different loan types.
-
Link the scorecard to core banking system to produce instant score at the time of application
Challenges
-
On account of data security, the company developed the model in-house. No data was shared with Smart
-
Most of the data fields are in local language
-
Smart guided the whole process remotely with only one on-site visit and many hours of online consulting. This involved detailed analysis of numerous variables for each scorecard
Project information
Skills
Advanced Statistical Techniques
Client
NBFC
Domain
Credit Scoring
Location
Indonesia
Solution
-
The whole process is automated with Smart proprietary software, ACreS
-
4 portfolios were built one each for Retail SME, Corporate SME, Retail Vehicle Loan and Corporate Vehicle Loan.
-
In absence of core banking system, webservice is created to input data and receive instant score
-
Newly entered data is stored in a database
Benefits
-
Process automation, ensured consistency, decreased manual work and increased accuracy.
-
Generation of scorecards in real time with performance measures
-
Variable transformations are automatically accounted in scoring population
-
Deployment of the model and scorecard for scoring new applicants.
-
Easy monitoring of score with the help of interactive reports
Scoring Return of COD Consignments
Issues & Objectives
- To develop a scoring model to identify consignments likely to be returned, in case of Cash on Delivery (CoD) payments for one of the largest e-tailer distributors.
Challenges
-
Extremely large data – Over 2.5 million records
-
Data inconsistencies
-
Traffic variation at different time of the day
Project information
Skills
Advanced Statistical Techniques
Client
E-COM Courier Company
Domain
Scoring Models
Location
India
Solution
-
Univariate analysis to identify significant variables
-
Clustered clients based on number of orders from a specific vendor
-
Developed various models and recommended the most suitable one
Benefits
*K-S 36 – 45 | High separation for application scorecard |
**Gini36 – 45 | Average separation, definitely useful |