Accelerating Underwriting with Computer Vision

Author: Chaimae Sriti

Abstract

This paper presents a novel computer vision pipeline designed to extract and utilize handwritten data from life insurance application forms—previously untapped sources of predictive information for underwriting models. The solution integrates classical image processing, template-based region extraction, and convolutional neural networks (CNNs) to automate data ingestion, boost model accuracy, and dramatically reduce manual review time in underwriting workflows.

1. Introduction

Life insurance underwriting traditionally relies on structured data, overlooking high-value, unstructured inputs such as handwritten application forms. These forms contain critical risk-related variables like tobacco usage and medical disclosures. Processing these at scale presents a compelling opportunity for improving pricing models and underwriting efficiency.

2. Data Context

The source data includes thousands of scanned, hand-filled insurance applications. These documents are inconsistent in layout and quality but share underlying templates. Extracting information from such sources requires spatial awareness and robustness to noise, distortion, and handwriting variation.

Fig: Example of a handwritten insurance application form showing various fields and handwritten responses

3. Methodology

3.1 Template Matching (Step 1)

To handle form variability, a high-resolution template of each form is maintained. Using OpenCV's template matching, the scanned image is aligned to its reference template. Preprocessing includes color normalization and scaling to ensure consistent matching quality.

Fig: Template matching process showing alignment and normalization

3.2 Region of Interest (ROI) Extraction (Step 2)

Each form template is associated with a configuration file mapping semantic fields (e.g., "Tobacco Use") to pixel coordinates. Once aligned, ROIs are automatically extracted for downstream recognition.

Fig: Region of Interest extraction showing field mapping and coordinate system

3.3 Labeling and Dataset Construction

Thousands of ROI images are collected automatically and manually labeled into class folders. This process constructs a labeled dataset for supervised learning without manual cropping.

Fig: Dataset organization and labeling workflow

4. Model Pipeline

4.1 Image Preprocessing

Images undergo transformations (resizing, normalization, format conversion) via OpenCV and Keras pipelines to make them CNN-compatible.

4.2 CNN Architecture and Training (Step 3)

A convolutional neural network is trained to classify binary answers (e.g., Yes/No checkboxes) or OCR regions (e.g., handwritten digits or short text). Training-validation splits were used to benchmark accuracy and F1-score performance across different model configurations.

4.3 Inference and Integration

Once trained, the model predicts classes for new forms in real time, feeding results into downstream risk models or actuarial pipelines.

5. Results & Business Impact

Prediction Accuracy: High CNN accuracy across most fields (numeric performance redacted).

Scalability: Fully automates ingestion of handwritten forms, reducing weeks of manual review to minutes.

ROI: Enables use of previously ignored but high-signal features in pricing and underwriting models.

6. Conclusion

This work introduces a robust, scalable pipeline for leveraging handwritten insurance applications through a combination of traditional vision techniques and deep learning. By operationalizing historically ignored data, we unlock new avenues for model sophistication, underwriting precision, and operational speed.