Our Approach

PageWise leverages a variety of data science techniques including optical character recognition, machine learning, natural language processing to classify documents. Our development journey consists of 7 phases from data collection, process to text extraction and modeling. To tailor the product to our target customers’ needs, PageWise is developed as a desktop application which is compatible with Window, Mac and Linux. 

Data Exploration

Data Sources

We use public data source from Kaggle and Philip Morris to train and test our machine learning models. The source documents came in a various types which are all commonly used in business environments.  We transformed the data into tabular format and added file creation timestamps and the actual document types for training and testing purposes. 

Sampling Distribution

The dataset have imbalanced class distribution which could lead to a poor model predictability. As part of model optimization, two methods are applied to balance the dataset:

1. Document classes with less than 15000 data points are dropped out of the training dataset; and 
2. SMOTE (Synthetics Minority Over-sampling Technique) is applied to create synthetics data points for the minority class to bring up the number of data points of a under-sampled data class.

Feature Engineering

Feature #1: 
Number of words

Length of word vector excluding numbers and special characters

illustrated as the blue bars (left y-axis) on the right side

Feature #2: 
Numeric text ratio

illustrated as the red line (right y-axis) on the right side

Feature #3: 
Word Uniquess Ratio

illustrated on the left

Model Specifications

Data Cleaning

The dataset is balanced to normalize the sample size by document type
 Stop words and non-ascii characters are removed

  Numeric text is removed after feature engineering

Tokenization & Vectorization

 CountVectorizer() is leveraged to tokenize training corpus
Training corpus is transformed into matrix X (illustrated on the right)
 The three engineered features are merged to the matrix X
 Document categories are encoded and form the vector Y

Prediction Model

 ​We selected naive bayes, logistic regression, decision tree and random forest as the baseline models.
 ​We randomly selected 80% of X and Y to be training data and the remaining to be test data in order to evaluate model performance.
 The baseline models are trained with the training data. We then selected the top performing models and fine-tuned hyperparameters to improve prediction accuracy.
 We explored and evaluated more advanced classfication models and a deep learning model BERT (Bidirectional Encoder Representations from Transformers)

Classification Model Evaluation

Logistic Regression

Pros
  • High training efficiency
  • Less prone to overfitting
Cons
  • Unable to solve non-linear problems

Random Forest

Pros
  • Amplify predictive capabilities through multiple decision trees
  • Ability to handle categorical and numerical features
Cons
  • Takes a long time to train the model

Naive Bayes

Pros
  • High training efficiency
Cons
  • Lack of predictability for continuous numerical value

BERT

Pros
  • Pre-trained on a large corpus in more than 100 languages 
Cons
  • High computational expense
  • High memory requirements

Model Performance

The Best Performing Model

We selected logistic regression not only because of its top prediction performance (shown on the left) but also because it possesses high computation effciency from both time and space complexity. We believe that the model would deliver the best user experience among all the models we have experimented.

SEE OUR Product Roadmap

A Closer Look at Performance

The confusion matrix on the right dissects the model performance based on document types. The model performs well in identifying most of the types such as resumes, scientific publication and functional specifications. News articles have more confused data points than other document types. This is within our expectation since some news articles in fact have similar content as scientific paper and advertisements.

Github Repo and References

  Github Repository

Basarkar, A. (2017). Document classification using machine learning. DOCUMENT CLASSIFICATION USING MACHINE LEARNING. https://doi.org/10.31979/etd.6jmu-9xdt

Guha, A., & Samanta, D. (2020). Real-time application of document classification based on machine learning. Learning and Analytics in Intelligent Systems, 366–379. https://doi.org/10.1007/978-3-030-38501-9_37

Vijay Kumar, G., Yadav, A., Vishnupriya, B., Naga Lahari, M., Smriti, J., & Samved Reddy, D. (2021). Text summarizing using NLP. Recent Trends in Intensive Computing. https://doi.org/10.3233/apc210179  

Python guis for humans. PySimpleGUI. (n.d.). Retrieved August 4, 2022, from https://www.pysimplegui.org/en/latest/  

Wagh, V., Khandve, S., Joshi, I., Wani, A., Kale, G., & Joshi, R. (2021, November 1). Comparative study of long document classification. arXiv.org. Retrieved August 4, 2022, from https://arxiv.org/abs/2111.00702  

The RVL-CDIP dataset. RVL-CDIP Dataset. (n.d.). Retrieved August 4, 2022, from https://www.cs.cmu.edu/~aharley/rvl-cdip/  

Gartner_Inc. (n.d.). Competitive landscape: Intelligent document processing platform providers. Gartner. Retrieved August 4, 2022, from https://www.gartner.com/en/documents/4008008