Web Scraper Pipeline · Vinay Kumar

01 — Overview

What This Project Does

This project implements a simple but complete end-to-end backend data pipeline using Python. The system collects structured data from a public website, stores it in a relational database, and exposes it through a REST API with pagination and search filtering.

The goal was to build a clean, modular backend workflow demonstrating key backend engineering concepts: web scraping, data persistence, API development, pagination, and query filtering — all in a single cohesive project.

Layer 1

Data Ingestion

requests BeautifulSoup scraper.py — Download, parse, extract book data from HTML

Layer 2

Data Persistence

SQLite SQLAlchemy db.py — Store, manage, and query structured records

Layer 3

API Delivery

FastAPI Uvicorn app.py — Serve data via paginated REST endpoints

02 — Architecture

How It Works

1 Scraper — scraper.py

The scraper sends an HTTP GET request to books.toscrape.com using Requests. The HTML response is parsed with BeautifulSoup, which traverses the DOM to extract each book's title and price. Extracted records are then passed to the database layer for storage.

2 Database — db.py

A lightweight SQLite database stores the scraped records. SQLite was chosen for its zero-configuration setup, making it ideal for a self-contained backend prototype. The db.py module handles connections, table creation, and insert operations.

// DATABASE SCHEMA

📋 books

PK id INTEGER

title TEXT

price REAL

3 API — app.py

A FastAPI application exposes two endpoints: a paginated books list and a title-search endpoint. FastAPI auto-generates interactive docs at /docs and uses Python's type hints for built-in validation.

GET /books

Returns paginated list of all books.

?page=1&limit=10

GET /books/search

SQL LIKE pattern match on title.

?title=travel

          Example Response — GET /books/search?title=travel
          JSON
        

// Structured JSON response with metadata
{
  "searched_title": "travel",
  "returned": 1,
  "data": [
    {
      "id":    5,
      "title": "It's Only the Himalayas",
      "price": 45.17
    }
  ]
}

03 — Structure

Project Structure

The project is split into three core modules — each with a single clear responsibility — plus the database file and config.

backend/ │ ├── scraper.py ← web scraping (requests + BS4) │ ├── db.py ← database layer (connect, create, insert) │ ├── app.py ← FastAPI app + REST endpoints │ ├── books.db ← SQLite database (auto-created) │ ├── requirements.txt ← Python dependencies │ └── README.md ← project documentation

04 — Stack

Technology Stack

Data Collection

Python 3 Requests BeautifulSoup 4

Data Storage

SQLite SQLAlchemy ORM

API Server

FastAPI Uvicorn Pydantic

05 — Features

Key Features

01

Web Scraping

Sends HTTP GET requests to books.toscrape.com and uses BeautifulSoup to parse the HTML DOM. Extracts book title and price from every article element on the page.

02

Database Integration

A dedicated db.py module handles all database concerns: opening connections, creating the books table, and inserting records. Clean separation from the API and scraper logic.

03

Pagination

The /books endpoint accepts page and limit query params. SQL OFFSET and LIMIT are used to return the correct page slice — improving performance and scalability.

04

Search Filtering

The /books/search?title= endpoint uses SQL LIKE pattern matching to filter books by title keyword. Returns metadata: the search term, count, and matched records.

05

Auto API Docs

FastAPI automatically generates interactive Swagger UI at /docs — making every endpoint explorable in the browser without any extra tooling.

06

Error Handling

Database exceptions are caught and returned as structured HTTP error responses — ensuring the API never crashes with an unhandled exception and always returns a meaningful response to the client.

06 — Results

What This Demonstrates

01

End-to-end backend ownership — from raw HTML to structured JSON API response, every layer built and connected independently.

02

Clean modular architecture — scraper, database, and API are fully decoupled; each file has a single responsibility.

03

Practical API design — pagination, filtering, structured error responses, and auto-generated Swagger documentation.

04

Real data pipeline experience — ingestion → parsing → storage → delivery, the same pattern used in production data systems at scale.

05

Zero-dependency frontend — the API is fully client-agnostic, consumable by any frontend framework or HTTP client.

07 — Roadmap

Future Improvements

Prevent duplicate book entries with a unique constraint and upsert logic

Add automated pytest test suite covering all endpoints and edge cases

Replace SQLite with PostgreSQL for production-grade concurrency and scale

Deploy to Railway / Render with a public live URL

Add scheduled background scraping using APScheduler or Celery

Web Scraper
Data Pipeline