Work Project

Table Structure Recognition (TSR)

Data extraction from Images!

Client
Intellect Design Arena Ltd.
Year
2024

Transforming Document Analysis with Table Structure Recognition (TSR)

The Table Structure Recognition (TSR) system is designed to automatically detect and extract structured data from tables in document images. It leverages advanced computer vision and machine learning techniques to identify table boundaries, rows, columns, and individual cells, converting complex visual information into machine-readable formats.

Our TSR system revolutionizes document analysis by accurately extracting structured data from tables in images, significantly reducing manual effort and enhancing data accessibility.
Key Features:
  • Advanced image preprocessing for table detection
  • U-Net architecture for bordered table structure recognition
  • Faster R-CNN with ResNet-101 for borderless table layout detection
  • Rowspan and column span detection for complex table structures
  • Pytesseract OCR for text extraction from table cells
  • Named Entity Recognition (NER) for column classification
  • Post-processing for accurate table structure reconstruction

The Table Structure Recognition project addresses the challenge of extracting structured data from tables in document images, focusing on both bordered and borderless tables. The system employs a multi-faceted approach, utilizing different architectures for different table types to ensure accurate detection and extraction.

For bordered tables, a U-Net architecture is implemented to segment the table structure, effectively detecting rows and columns based on visible borders. The U-Net model is trained on pixel-wise annotations of table structures, enabling precise identification of cell boundaries.

Borderless tables present a unique challenge due to the absence of clear cell demarcations. To tackle this, a Faster R-CNN architecture with ResNet-101 backbone is employed. This model is trained to detect cells, rows, and columns based on the relative positioning of text and whitespace. A crucial aspect of this approach is the separate detection of rowspan and column span, which are then merged in a post-processing step to create an accurate final schema for the table.

The project incorporates advanced image preprocessing techniques using OpenCV, including grayscale conversion, thresholding, and morphological operations, to enhance table lines and remove noise. After structure detection, Pytesseract OCR is applied to extract text content from each identified cell. To add semantic understanding, a Named Entity Recognition (NER) model classifies the content of table columns, identifying specific data types such as names, dates, and amounts.

The final output of the system is a fully structured representation of the table data, which can be exported in various formats like CSV or JSON, making it ready for integration into business workflows or further analysis. This comprehensive approach ensures accurate handling of complex table structures, including those with cells spanning multiple rows or columns, even in the challenging case of borderless tables.

Chat with Anish