OMR PDF to MySQL with PHP

Paul · December 3, 2024, 8:41am

Does anyone know a simple way to extract a pdf with OMR sheet that contain Roll Number, Registration Number and Answer of a students, etch page contain 1 single student. I have well over 200 pages Now is there any way to insert this pdf to my SQL database with PHP. If impossible then how can I do it with python? [I’m very begainer at python]

… I try to make some input feild (according to questions) then mennually input data to then then submit this to database, and It’s so weird…

Daniel · December 19, 2024, 8:31am

Extracting data from a PDF with OMR sheets and inserting it into an SQL database can be challenging, but it is possible with the right approach. I’ll guide you through how you can achieve this both using PHP and Python.

Python provides several libraries that can help automate the extraction of data from PDFs and then insert that data into a SQL database. Here’s a step-by-step breakdown:

Step 1: Install Required Libraries

PyPDF2: A Python library that helps extract text from PDF files.
pytesseract: A Python wrapper for Tesseract OCR (Optical Character Recognition), useful if the OMR sheet is not directly readable as text but rather as an image.
pandas: Useful for handling data and organizing it into a structured format.
mysql-connector-python: Allows you to connect to a MySQL database.

To install these, use pip:

pip install PyPDF2 pytesseract pandas mysql-connector-python

Additionally, you’ll need to install Tesseract OCR on your system. Follow the instructions from here.

Step 2: Extract Data from PDF (if it contains images)

If your PDF contains scanned OMR sheets (images), you’ll need to use OCR to extract the data. Here’s a basic approach:

import pytesseract
from PIL import Image
import PyPDF2

# Path to your PDF
pdf_file_path = "your_pdf.pdf"

# Open PDF
with open(pdf_file_path, 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)
    total_pages = len(pdf_reader.pages)

    for page_number in range(total_pages):
        page = pdf_reader.pages[page_number]
        # Convert the page to an image (this can be done using pdf2image or similar libraries)
        # For simplicity, assume you have the page image ready.
        image = page_to_image(page)  # Use appropriate method to convert PDF to image
        text = pytesseract.image_to_string(image)

        # Parse the text (extract Roll Number, Registration, and Answer)
        roll_number = extract_roll_number(text)
        registration_number = extract_registration_number(text)
        answers = extract_answers(text)

        # Insert into your database (next step)

You might need to convert the PDF pages into images using a tool like pdf2image or Poppler. Here’s how you could do that:

pip install pdf2image

And in Python:

from pdf2image import convert_from_path

pages = convert_from_path(pdf_file_path, 300)  # Convert at 300 DPI
image = pages[0]  # Extract first page as image

Step 3: Extract Roll Number, Registration, and Answers

You’ll need custom functions to extract the relevant fields like Roll Number, Registration Number, and Answers from the OCR text. Use regular expressions (re module) to capture this information.

import re

def extract_roll_number(text):
    roll_number_match = re.search(r"Roll Number: (\d+)", text)
    if roll_number_match:
        return roll_number_match.group(1)
    return None

def extract_registration_number(text):
    registration_number_match = re.search(r"Registration Number: (\d+)", text)
    if registration_number_match:
        return registration_number_match.group(1)
    return None

def extract_answers(text):
    # Adjust according to your OMR sheet format
    answer_match = re.findall(r"Answer: (\w+)", text)
    return answer_match

Step 4: Insert Data into MySQL

Finally, connect to your SQL database and insert the data:

import mysql.connector

# Database connection
conn = mysql.connector.connect(
    host="localhost",
    user="your_db_user",
    password="your_db_password",
    database="your_db_name"
)
cursor = conn.cursor()

# Insert data into the database
def insert_data(roll_number, registration_number, answers):
    query = "INSERT INTO students (roll_number, registration_number, answers) VALUES (%s, %s, %s)"
    cursor.execute(query, (roll_number, registration_number, ",".join(answers)))
    conn.commit()

# Example usage
insert_data(roll_number, registration_number, answers)

# Close connection
cursor.close()
conn.close()

Step 5: Put Everything Together

You can loop through all the pages, extract the data, and insert it into the database.

Method 2: Using PHP (if you want to use PHP)

While PHP can handle PDF parsing and database insertion, the process is typically less straightforward than in Python, especially when it comes to working with OCR for image-based PDFs.

PHP Libraries: Use fpdf or TCPDF for reading PDF files, and Tesseract for OCR (you can call Tesseract from PHP using shell_exec).
PHP MySQL: Use MySQLi or PDO to connect to your database and insert data.

Here’s an outline of how you might use PHP for inserting data into a MySQL database:

Extract text from the PDF using a library like TCPDF or FPDF.
Use Tesseract OCR (via command line) to read the images if necessary.
Parse the extracted text for roll number, registration number, and answers.
Insert the parsed data into MySQL using PHP.

For example, calling Tesseract from PHP:

$output = shell_exec('tesseract path_to_image.png -');  // Use the correct image path
echo $output;

Once you have the text, you can extract the necessary fields using regular expressions and insert the data into your MySQL database.