Public Sector &amp; Education

Freight Pricing across US mainland

Issues & Objectives

The trucking company was facing challenges in agreeing on right freight price with truck operators.
Only a few skilled agents with knowledge of certain lanes could negotiate successfully
Suggest appropriate fare to be offered on load-boards for full truck loads (FTL) freight on a specific lane (Origin-Destination) for a particular date, equipment & load features, in the continental US market

Results

A combination of very accurate localized models and broader hub-based models gave ~95% accuracy

Project information

Techniques

Forecasting

Client

Large Trucking Company

Industries

Supply Chain & Retail

Location

Challenges

Very sparse data: 70,000 loads (over 3 years), spread over 70 lanes and 42 equipment to answer a problem for any lane across the continent
Apparently erratic prices: e.g. OH-TX lanes (750-1200 mi) are priced same as IN-PA (500-600 mi), for same equipment!
Prohibitive price of historic data from Truckstop or DAT meant only generic third-party data like Fuel prices & CASS index

Solution

PDFs need to be OCR’d to load texts in it
Ensemble modelling – A very large number of models were developed
Solution deployed using MLFlow, integrated with DataLake and transaction system

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Telecom Scorecard

Telecom Scorecard

Issues & Objectives

Scorecard for telecom customers for a Malaysian Bureau
Bureau variables as well as information from litigation/legal suit and electronic transfer data used in the model
Logistic Regression and a machine learning model to replace generic bureau score for these customers

Benefits

Performance of both LR and XGBoost scores better than generic bureau scorecard
Implementation code in python facilitated easy integration

Project information

Techniques

Risk Analytics

Client

Telecom Customers for a Malaysian Bureau

Industries

Financial Services

Location

Malaysia

Challenges

Data in batches
Data change during analysis resulting in frequent rework
Variable recoding towards the end of modeling work

Solution

Data processing and analysis carried out in python
Scorecard with reason code for LR model
For machine learning method, XGBoost model showed better performance than Random Forest model
Score for ML model was generated
Score interpretation using LIME (Local Interpretable Model Agnostic Explanation)

What we build, how it performs – Explore our work!

Public Sector & Education, Text Analytics

Would love to hear your thoughts!

Searchable Contracts for Public Works Authority

Searchable Contracts for Public Works Authority

Issues & Objectives

To develop ML based system for searching contents in contracts:

Contacts are present with the client in pdf format with scanned images
We have to convert it into readable formats and then index them in database
Our aim is to build a ML based system that can accurately find the search term in headings of the contracts

Results

API which takes contact number and search term as an input was built
It outputs the potential pages where the term could be found
The client’s team then integrates them in their PDF reader to enable direct search and read operations

Project information

Techniques

Text Analytics

Client

Regulatory Authority in Qatar

Industries

Public Sector & Education

Location

Qatar

Searchable-Contracts

Challenges

The Contacts PDFs needs to be run through an Optical Character Recognition system
We are only searching for topics/headings, therefore we only need to extract the headings from the text
Handling and Searching in a large amount of data

Solution

PDFs need to be OCR’d to load texts in it
Headings were extracted from the texts with a high recall using Tika
Headings were indexed in solr to determine top results
The top results are then passed to ML to determine the best match among them
MariaDB (SQL) database is used
The deployment framework is Flask API + Gunicorn

What we build, how it performs – Explore our work!

Public Sector & Education, Text Analytics

Would love to hear your thoughts!

Address/Entity Matching

Address/Entity Matching

Issues & Objectives

To develop ML based system for matching addresses and entity names:

Same address or entity names can be written in few different ways
It may have spelling issues, order might be different, abbreviations may exist
Our aim is to build a ML based system that can accurately match the addresses and entity with the ones already in our database

Results

Test recall (KPI) for different models are given as below:

Address Matching: 87.47 %
Entity Matching: 88.62 %

Project information

Techniques

Text Analytics

Client

Corporate Data Aggregator

Industries

Public Sector & Education

Location

India

address-matching

Challenges

The database is mostly on addresses and entities (company names & person names)
Different models are being built for address and entity separately
Entity matching can be tricky as it contains both person name and company name which ideally should require different set of features
Handling and Searching in a large amount of data
Annotation of data consistently by multiple SMEs was a challenge

Solution

Solr is used to search an address/entity from the database and to determine top 10,20 or 30 results
The top results are then passed to ML to determine the best match among them
MariaDB (SQL) database is used
The deployment framework is Flask API + Gunicorn

What we build, how it performs – Explore our work!

Public Sector & Education, Text Analytics

Would love to hear your thoughts!

Topic Prediction for EdTech

Topic Prediction for EdTech

Issues & Objectives

To develop ML based system for predictions for Online Test Preparation Systems:

Students are asked questions for there exam preparations
They may face issue with solving questions
Our aim is to build a ML and NLP based system that can accurately predict the topic/chapter of that question
Detecting appropriate topics removes the need for manual tagging and enables faster and frequent uploads of new questions/tests

Challenges

The questions are primarily on 4 subjects: Physics, Biology, Mathematics and Chemistry
Questions are available as text, but many of them contains images of text, figures, equations and chemical diagrams
Converting equations and chemical diagrams to appropriate formats for ML processing

Project information

Techniques

Text Analytics

Client

EdTech

Industries

Public Sector & Education

Location

India

Topic_Prediction_for_EdTech

Solution

Natural Language Processing (NLP):

Images: Converted to text using appropriate Optical Character Recognition for different subjects
Text: Converted to vectors using Word2Vec

Algorithm(s): Deep Neural Network + Random Forest
Storage: AWS Cloud
Database: MariaDB (SQL)
Deployment Framework: Flask API + Gunicorn

Results

Test recall (KPI) for different Subjects are given as below

Physics: 92%
Biology: 88%
Chemistry: 89%
Mathematics: 89%

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Smart Analysis on Bus Transportation System

Smart Analysis on Bus Transportation System

Issues & Objectives

Transport regulatory authority in Singapore commissioned a system to

Automatically discover wrong fare incidents and flag commuter’s cards affected by wrong fare charging; and
Detect emerging fault trends in fare collection equipment so that corrective action could be taken in a timely manner

Challenges

Large data 15 million transactions per day, which translates to more than 5 billion historical transactions in a year needs to be processed to identify fault patterns and trends
The data consisted of financial, operations, transit and events data of buses

Project information

Techniques

Forecasting

Client

Transportation Authority

Industries

Public Sector & Education

Location

Singapore

Solution

Data Storage: Hadoop and MySQL
Query Tools: Hive and SQL
Algorithms: rmr (Parallel versions of R) and Java
Reporting and Dashboards: Pentaho

Results

Robust solution in use for over 2 years allows pro-active rather than reactive maintenance

What we build, how it performs – Explore our work!

Public Sector & Education, Risk Analytics

Would love to hear your thoughts!

Child Support Case Management Predictive Analytics

Child Support Case Management Predictive Analytics

Issues & Objectives

There are two parents in every child support case. One is the Custodial Parent (CP) – the parent who lives with the child the majority of the time and has the primary day-to-day responsibility; the other is the Non-Custodial Parent (NCP) who also has important responsibilities. An aggrieved CP may appeal to the state to enforce child support by the NCP
The project objective is to predict the collection category of cases based on its past payment pattern and various attributes

Challenges

Extremely large data – Approx. 300,000 cases per month
Available data and relevant variables differ from state to state. Predictive models built for four states so far
Collection categories definition

Project information

Techniques

Risk Analytics

Client

Child Support Service

Industries

Public Sector & Education

Location

USA

child-support

Solution

Multinomial Logistics Regression technique was used to build the predictive models
Models were developed for 4 major states in the United States

Results

The accuracy of prediction for a dataset of 12 months was 71–83%

child_support_analytics

What we build, how it performs – Explore our work!

Public Sector & Education, Risk Analytics

Would love to hear your thoughts!

Scoring Customer Quality Experience (QoE)

Scoring Customer Quality Experience (QoE)

Issues & Objectives

To develop a score of the customer quality of experience (QoE) based on objective factors such as such as number of stalls, frame drops, ghost sessions, and play delay for an internet video service provider

Challenges

Extremely large data – Over 2.5 million records
Data inconsistencies
Traffic variation at different time of the day

Project information

Techniques

Risk Analytics

Client

Telecom

Industries

Public Sector & Education

Location

India

Scoring-Customer-Quality

Solution

Applied Principal Component Analysis (PCA) technique to build models for scoring
Developed 4 different models using different variable transformation techniques
Scored 8,800 records/sec

Results

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Distribution Analytics – Demand Forecast

Distribution Analytics – Demand Forecast

Issues & Objectives

A Singapore based company provide multi country mobile platform for distributed sales representatives who gets updated information on demand forecast, recommendation and target sale
They wanted to build appropriate models for forecast
All output were to be pre processed in nightly batch run and saved in a centralized database
A customized software for managerial decision making was also needed

Benefits

Batch run for a dataset of 60K transaction take less than 10 minutes producing multiple output tables
Experiment with customer segments and view a particular subset for any discount/promotion
View the position of customers and the recommendation to be made
Review the profile of DSR and extent of target achievement
Employ Various methods and visualize actual vs forecast

Project information

Techniques

Forecasting

Client

SaaS Provider

Industries

Supply Chain & Retail

Location

Singapore

Challenges

High attrition of DSRs made it hard to collate a time series sales data
Customer base changes between transition from one DSR to another
Intermittent sales data for about 30% of customers
Discontinued or new product SKUs with short history of sales data

Solution

Software developed in R Shiny
K-means and hierarchical clustering and time series forecasting methods were used
Batch code is developed in R with input and output link to client database

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Real Estate Valuation Analytics

Real Estate Valuation Analytics

Issues & Objectives

Loans by a large financial company in India to real estate developers are repaid in a mix of cash and inventory
The projects under development will only be ready for launch (sale to buyers) a number of years after the loan is taken
The financial company therefore requires prediction of property prices at time of launch and a number of years post launch
They commissioned Smart to build customized software for such price analytics

Benefits

Property valuation at launch as well as comparison with competition in few clicks
View relative position of the project in the same locality
Valuation in next 5 years post launch is instantaneous

Project information

Techniques

Forecasting

Client

Large financial company in India

Industries

Supply Chain & Retail

Location

India

Challenges

Data missing for 50% projects
Data mismatches
One-third of the real estate project records could not be used for modeling due to missing price or inventory data
Project amenities specified using free text; same amenity could have multiple descriptions

Solution

Software developed on R Shiny platform
An NLP technique, Word2Vec was used for specification and amenity data
Clusters of micro markets were formed by hierarchical clustering method
Algorithm for forecasting velocity of sale was developed
Link with the database for comparison of any new project

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Application Scorecard

Application Scorecard

Issues & Objectives

SMEs play a crucial role in the Indonesian economy, contributing significantly to employment and economic growth. Our client, a non-bank financial institution based in Indonesia, offers a range of financing services to individuals and businesses
The objective of this project was to develop application scoring model for SME. The SME portfolio was new to the client which had been recently acquired by a multinational bank headquartered in Australia. The scoring model was used for SME loan origination decisions

Benefits

Process Automation
Ensures Consistency in decision making
Predictive modelling replaces gut feel
Scores recalibrated with default data after sometime

Project information

Techniques

Risk Analytics

Client

Bank Lending to SMEs

Industries

Financial Services

Location

Indonesia

app-scorecard

Challenges

Scaling up operations in accordance with Indonesian government directive
Very small number of data points ≈ 400 making it difficult to obtain reliable results through predictive modelling
The SME portfolio was new so the history of defaults had not been well established

Solution

Bootstrapping was used to overcome the limitation of a small sample
Reject rates were taken as a surrogate for default rate
High quality scorecard was developed
Model Gini = 66.74 (Gini > 55 indicates a high quality scorecard)
Model KS = 53.85 (KS > 45 indicates a high quality scorecard)

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Calibration of Expert Scorecard by ML Methods

Calibration of Expert Scorecard by ML Methods

Issues & Objectives

For the first time in India, a scorecard was developed for the client to keep vigil on the listed companies to avoid potential financial disaster
Scorecard was based on financial as well nonfinancial events such as auditors, board of directors, litigation, news etc
The task was to refine expert scorecard with ML methods

Benefits

Discriminatory power of the calibrate scorecard was found to be higher than the expert scorecard
Apply a decision overlay which enhanced the predictive power of the scorecard
Better separation between GOOD and BAD companies in modelled score

Project information

Techniques

Risk Analytics

Client

Corporate Data Aggregator

Industries

Financial Services

Location

India

GBM Score

Expert Score

Solution

Decision tree, Random forest and Gradient boosting were used to obtain weights of the events
ML methods were run in h2o
Models for listed and unlisted companies were built
Separate weights for listed and unlisted companies were used to arrive at the consolidated score of parent companies

Challenges

Listed and unlisted flag was incomplete in the database
Many companies had large number of missing data
Frequent modification of event logic
Running ML models and processing score with new weights took several hours posing a challenge to multiple iteration

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Application Scorecard For Auto Loan

Application Scorecard For Auto Loan

Issues & Objectives

Application scorecard for sub-prime customers
Review and recalibrate scorecard
Use insight data to improve alignment between underwriting rules and scores

Challenges

Methodology for current scorecard not well documented
Scores not aligned with underwriting rules
Data in batches – Credit history, product information, loan terms in different files from different time periods
Performance available only for 8% TTD population who take up loan from 55% approval

Project information

Techniques

Risk Analytics

Client

Auto Loan Provider

Industries

Financial Services

Location

Application-Scorecard-For-Auto-Loan

Solution

Data collation to align all variables from same time-period was carried out using R for analysis
Customers classification using domain knowledge and statistical methods – Decision tree and cluster analysis
Multiple scorecards each with superior performance than existing scorecard
All scorecards rescaled to have similar odds
Scorecard as a linear function for easy integration with loan origination system
Reviewed underwriting rule and corporate reporting system and recommended changes

auto-loan

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Credit Scoring for Leasing Company

Credit Scoring for Leasing Company

Issues & Objectives

A large company in UK finances lease of office equipment, primarily to small and medium companies with ticket size less than £10K
Leased items depreciates rapidly and seizure of collateral does not recover the debt
The company currently cherry picks customers who seldom go bad
They want to expand customer base while controlling risk
For this they want a scorecard to replace rule driven underwriting for better screening

Benefits

Scorecard developed by Statistical method
Scrutiny restricted to high scorers reducing manual work by a factor of 5 -10

Project information

Techniques

Risk Analytics

Client

Equipment Leasing Company

Industries

Financial Services

Location

Challenges

Company book identified only 2.5% bad lease – payment history data was fraught with inconsistent figures
After incorporating liquidation/insolvency/dissolution status and rating from credit bureau record, the incidence was boosted to 12%. The process classified non takers of loan to Good and Bad by a logical method and not by reject inference

Solution

Model developed by R program
2 scorecards with and without credit bureau ratings were delivered
Discriminatory power of the scorecards were high as seen from high KS and GINI

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Fraud Scoring for Insurance Claims

Fraud Scoring for Insurance Claims

Issues & Objectives

A major insurance company in Singapore used to manually examine each travel insurance claim to identify potentially fraudulent one
Suspicious claims were subject to a more detailed investigation
This involved considerable manual effort & inconsistent processes
The project objective was to develop a score to identify potentially fraudulent claims which would be subject to greater scrutiny

Benefits

Process automation, ensuring consistency, cost saving and increased accuracy
Scrutiny restricted to high scorers reducing manual work by a factor of 5 -10

Project information

Techniques

Risk Analytics

Client

Travel Insurance Provider

Industries

Financial Services

Location

Singapore

Challenges

Data included 77,445 claim records of which only 120 had been determined to be potentially fraudulent
So identified potentially fraudulent claims are rare events (0.15%) and therefore hard to detect
It was however expected that there could be a large number of undetected fraudulent claims

Solution

Gradient Boosting – a powerful machine learning algorithm was used for detecting potentially fraudulent cases
Substantial lift demonstrated. – on the client test data set it sufficed to examine 7.75% of all claims to identify 91.67% of all fraudulent claims

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Credit Scoring for Micro Finance Provider

Credit Scoring for Micro Finance Provider

Issues & Objectives

A Multi-finance company providing financing facilities for small and medium enterprises in Indonesia
The objective was to develop multiple portfolios for different loan types
Link the scorecard to core banking system to produce instant score at the time of application

Benefits

Process automation, ensured consistency, decreased manual work and increased accuracy
Generation of scorecards in real time with performance measures
Variable transformations are automatically accounted in scoring population
Deployment of the model and scorecard for scoring new applicants
Easy monitoring of score with the help of interactive reports

Project information

Techniques

Risk Analytics

Client

NBFC

Industries

Financial Services

Location

Indonesia

Challenges

On account of data security, the company developed the model in-house. No data was shared with Smart
Most of the data fields are in local language
Smart guided the whole process remotely with only one on-site visit and many hours of online consulting. This involved detailed analysis of numerous variables for each scorecard

Solution

The whole process is automated with Smart proprietary software, ACreS
4 portfolios were built one each for Retail SME, Corporate SME, Retail Vehicle Loan and Corporate Vehicle Loan
In absence of core banking system, webservice is created to input data and receive instant score
Newly entered data is stored in a database

What we build, how it performs – Explore our work!

Would love to hear your thoughts!

Scoring Return of COD Consignments

Scoring Return of COD Consignments

Issues & Objectives

To develop a scoring model to identify consignments likely to be returned, in case of Cash on Delivery (CoD) payments for one of the largest e-tailer distributors.

Benefits

*K-S 36 – 45	High separation for application scorecard
**Gini36 – 45	Average separation, definitely useful

Project information

Techniques

Risk Analytics

Client

E-COM Courier Company

Industries

Financial Services

Location

India

COD-Consignments

Challenges

Extremely large data – Over 2.5 million records
Data inconsistencies
Traffic variation at different time of the day

Solution

Univariate analysis to identify significant variables
Clustered clients based on number of orders from a specific vendor
Developed various models and recommended the most suitable one

What we build, how it performs – Explore our work!