Data extraction from Images!
The Table Structure Recognition (TSR) system is designed to automatically detect and extract structured data from tables in document images. It leverages advanced computer vision and machine learning techniques to identify table boundaries, rows, columns, and individual cells, converting complex visual information into machine-readable formats.
Our TSR system revolutionizes document analysis by accurately extracting structured data from tables in images, significantly reducing manual effort and enhancing data accessibility.
The Table Structure Recognition project addresses the challenge of extracting structured data from tables in document images, focusing on both bordered and borderless tables. The system employs a multi-faceted approach, utilizing different architectures for different table types to ensure accurate detection and extraction.
For bordered tables, a U-Net architecture is implemented to segment the table structure, effectively detecting rows and columns based on visible borders. The U-Net model is trained on pixel-wise annotations of table structures, enabling precise identification of cell boundaries.
Borderless tables present a unique challenge due to the absence of clear cell demarcations. To tackle this, a Faster R-CNN architecture with ResNet-101 backbone is employed. This model is trained to detect cells, rows, and columns based on the relative positioning of text and whitespace. A crucial aspect of this approach is the separate detection of rowspan and column span, which are then merged in a post-processing step to create an accurate final schema for the table.
The project incorporates advanced image preprocessing techniques using OpenCV, including grayscale conversion, thresholding, and morphological operations, to enhance table lines and remove noise. After structure detection, Pytesseract OCR is applied to extract text content from each identified cell. To add semantic understanding, a Named Entity Recognition (NER) model classifies the content of table columns, identifying specific data types such as names, dates, and amounts.
The final output of the system is a fully structured representation of the table data, which can be exported in various formats like CSV or JSON, making it ready for integration into business workflows or further analysis. This comprehensive approach ensures accurate handling of complex table structures, including those with cells spanning multiple rows or columns, even in the challenging case of borderless tables.