PageWise leverages a variety of data science techniques including optical character recognition, machine learning, natural language processing to classify documents. Our development journey consists of 7 phases from data collection, process to text extraction and modeling. To tailor the product to our target customers’ needs, PageWise is developed as a desktop application which is compatible with Window, Mac and Linux.
We use public data source from Kaggle and Philip Morris to train and test our machine learning models. The source documents came in a various types which are all commonly used in business environments. We transformed the data into tabular format and added file creation timestamps and the actual document types for training and testing purposes.
The dataset have imbalanced class distribution which could lead to a poor model predictability. As part of model optimization, two methods are applied to balance the dataset:
1. Document classes with less than 15000 data points are dropped out of the training dataset; and
2. SMOTE (Synthetics Minority Over-sampling Technique) is applied to create synthetics data points for the minority class to bring up the number of data points of a under-sampled data class.
Length of word vector excluding numbers and special characters
illustrated as the blue bars (left y-axis) on the right side
illustrated as the red line (right y-axis) on the right side
illustrated on the left
The dataset is balanced to normalize the sample size by document type
Stop words and non-ascii characters are removed
Numeric text is removed after feature engineering
CountVectorizer() is leveraged to tokenize training corpus
Training corpus is transformed into matrix X (illustrated on the right)
The three engineered features are merged to the matrix X
Document categories are encoded and form the vector Y
We selected naive bayes, logistic regression, decision tree and random forest as the baseline models.
We randomly selected 80% of X and Y to be training data and the remaining to be test data in order to evaluate model performance.
The baseline models are trained with the training data. We then selected the top performing models and fine-tuned hyperparameters to improve prediction accuracy.
We explored and evaluated more advanced classfication models and a deep learning model BERT (Bidirectional Encoder Representations from Transformers)
We selected logistic regression not only because of its top prediction performance (shown on the left) but also because it possesses high computation effciency from both time and space complexity. We believe that the model would deliver the best user experience among all the models we have experimented.
SEE OUR Product RoadmapThe confusion matrix on the right dissects the model performance based on document types. The model performs well in identifying most of the types such as resumes, scientific publication and functional specifications. News articles have more confused data points than other document types. This is within our expectation since some news articles in fact have similar content as scientific paper and advertisements.
Basarkar, A. (2017). Document classification using machine learning. DOCUMENT CLASSIFICATION USING MACHINE LEARNING. https://doi.org/10.31979/etd.6jmu-9xdt
Guha, A., & Samanta, D. (2020). Real-time application of document classification based on machine learning. Learning and Analytics in Intelligent Systems, 366–379. https://doi.org/10.1007/978-3-030-38501-9_37
Vijay Kumar, G., Yadav, A., Vishnupriya, B., Naga Lahari, M., Smriti, J., & Samved Reddy, D. (2021). Text summarizing using NLP. Recent Trends in Intensive Computing. https://doi.org/10.3233/apc210179
Python guis for humans. PySimpleGUI. (n.d.). Retrieved August 4, 2022, from https://www.pysimplegui.org/en/latest/
Wagh, V., Khandve, S., Joshi, I., Wani, A., Kale, G., & Joshi, R. (2021, November 1). Comparative study of long document classification. arXiv.org. Retrieved August 4, 2022, from https://arxiv.org/abs/2111.00702
The RVL-CDIP dataset. RVL-CDIP Dataset. (n.d.). Retrieved August 4, 2022, from https://www.cs.cmu.edu/~aharley/rvl-cdip/
Gartner_Inc. (n.d.). Competitive landscape: Intelligent document processing platform providers. Gartner. Retrieved August 4, 2022, from https://www.gartner.com/en/documents/4008008