Multi-Retailer Price Intelligence & Arbitrage Analytics Platform
A production-ready web scraping and analytics system for comparing product prices across Amazon, Walmart, Kohl's, Kmart, and other major retailers. Features intelligent anti-bot bypass with Bright Data Site Unblocker API, API reverse-engineering, cloud-native data pipelines, and a modern web application for real-time price comparison and arbitrage opportunity detection.
- Flexible Architecture: Supports both single-instance and distributed modes (scrapy-redis ready)
- Bright Data Integration:
- Site Unblocker API: Token-based API access (recommended) for automatic anti-bot bypass
- Traditional Proxy: Username/password proxy support with automatic failover
- Residential Proxy: Optional residential proxy support for additional resilience
- API Discovery: Automatic detection of hidden JSON APIs from Network Tab data
- Multi-Retailer Support: Spiders for Amazon, Walmart, Kohl's, and Kmart
- Resilience: Exponential backoff for 429 errors, automatic retries, and comprehensive proxy logging
- Multi-Tiered Storage:
- Raw HTML archived to Google Cloud Storage (
raw/{site}/{date}/{product_id}.html) - Structured data streamed to Google BigQuery for analytics (batch inserts, auto schema creation)
- Raw HTML archived to Google Cloud Storage (
- BigQuery Schema: Comprehensive schema with product details (brand, model, category, SKU, ratings, reviews)
- Redis Caching: Fast response times with intelligent caching for API endpoints
- FastAPI Backend: RESTful API with automatic OpenAPI documentation
- Product Search: Advanced search with filters (brand, retailer, price range, pagination)
- Price Comparison: Side-by-side comparison across multiple retailers
- Arbitrage Dashboard: Find profitable price differences automatically
- Price History: Visualize price trends over time with interactive charts
- Spider Management: Trigger and monitor scraping jobs via API
- Modern UI: Built with Next.js 14, React 18, TypeScript, and Tailwind CSS
┌─────────────────────┐
│ Frontend (Next.js) │
│ - Search │
│ - Compare │
│ - Arbitrage │
└──────────┬──────────┘
│
┌──────▼──────┐
│ Backend │
│ (FastAPI) │
└──────┬──────┘
│
┌──────▼──────┐ ┌──────────────┐
│ BigQuery │ │ Redis Cache │
│ Analytics │ │ │
└──────┬──────┘ └──────────────┘
│
┌──────▼──────────────────────────┐
│ Scrapy Spiders │
│ (Amazon, Walmart, Kohl's, Kmart)│
└──────┬───────────────────────────┘
│
┌──────▼──────────────┐
│ Bright Data │
│ Proxy Middleware │
└──────┬──────────────┘
│
┌──────▼──────────────┐
│ Target Retailers │
└─────────────────────┘
- Python 3.11+
- Node.js 20+
- Docker and Docker Compose
- Google Cloud Platform account with BigQuery and GCS access
-
Clone the repository
git clone <repository-url> cd WebScrapeAMZN
-
Configure environment variables
cp config/env_template.txt config/.env # Edit config/.env with your credentials -
Start all services with Docker Compose
docker-compose up -d
This will start:
- Redis (port 6379)
- Backend API (port 8000)
- Frontend (port 3000)
cd backend
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reloadcd frontend
npm install
npm run devpython -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cd scrapy_projectCopy config/env_template.txt to config/.env and configure the following:
-
Bright Data API Access (Recommended):
BRIGHT_DATA_API_TOKEN: Your Bright Data API token (Bearer token)BRIGHT_DATA_ZONE: Your zone name (e.g.,webscrape_amzn)BRIGHT_DATA_API_ENDPOINT: API endpoint (default:https://api.brightdata.com/request)BRIGHT_DATA_PROXY_TYPE: Set tosite_unblockerfor API mode
-
Bright Data Traditional Proxy (Alternative):
BRIGHT_DATA_USERNAME: Site Unblocker usernameBRIGHT_DATA_PASSWORD: Site Unblocker passwordBRIGHT_DATA_ENDPOINT: Proxy endpoint (e.g.,zproxy.lum-superproxy.io:22225)
-
Bright Data Residential Proxy (Optional - for failover):
BRIGHT_DATA_RESIDENTIAL_USERNAME: Residential proxy usernameBRIGHT_DATA_RESIDENTIAL_PASSWORD: Residential proxy passwordBRIGHT_DATA_RESIDENTIAL_ENDPOINT: Residential proxy endpoint
-
Redis: Connection details (currently optional, required only for distributed mode)
REDIS_HOST: Redis host (default:localhost)REDIS_PORT: Redis port (default:6379)REDIS_PASSWORD: Redis password (if required)
-
Google Cloud Platform: For data storage
GOOGLE_APPLICATION_CREDENTIALS: Path to GCP service account JSON key fileGCS_BUCKET_NAME: Google Cloud Storage bucket name for raw HTMLBQ_DATASET: BigQuery dataset nameBQ_TABLE: BigQuery table name
-
Resilience Settings:
BACKOFF_BASE_DELAY: Base delay for exponential backoff (default:1second)BACKOFF_MAX_RETRIES: Maximum retry attempts (default:5)BACKOFF_MAX_WAIT: Maximum wait time (default:300seconds)
See config/env_template.txt for detailed setup instructions and examples.
Amazon Spider:
cd scrapy_project
scrapy crawl amazon -a start_urls="https://www.amazon.com/s?k=laptop"Walmart Spider:
cd scrapy_project
scrapy crawl walmart -a start_urls="https://www.walmart.com/search?q=laptop"Kohl's Spider:
cd scrapy_project
scrapy crawl kohls -a start_urls="https://www.kohls.com/search.jsp?search=laptop"Kmart Spider:
cd scrapy_project
scrapy crawl kmart -a start_urls="https://www.kmart.com/search=laptop"To enable distributed scraping with multiple workers:
-
Enable Redis scheduler in
scrapy_project/retail_intelligence/settings.py:# Replace lines 59-61 with: SCHEDULER = 'scrapy_redis.scheduler.Scheduler' DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' SCHEDULER_PERSIST = True
-
Configure Redis connection in
config/.env:REDIS_HOST=localhost REDIS_PORT=6379 REDIS_PASSWORD= -
Start Redis (if using Docker Compose):
docker-compose up -d redis
-
Start multiple workers in separate terminals:
# Terminal 1 cd scrapy_project scrapy crawl amazon # Terminal 2 cd scrapy_project scrapy crawl amazon # Terminal 3 cd scrapy_project scrapy crawl amazon
All workers will share the same Redis queue, automatically distributing work.
The middleware supports multiple proxy access methods:
1. Bright Data Site Unblocker API (Recommended)
- Uses token-based API access for automatic anti-bot bypass
- Configure with
BRIGHT_DATA_API_TOKENandBRIGHT_DATA_ZONE - Set
BRIGHT_DATA_PROXY_TYPE=site_unblocker
2. Traditional Proxy Access
- Uses username/password authentication
- Configure with
BRIGHT_DATA_USERNAMEandBRIGHT_DATA_PASSWORD - Works with Site Unblocker and Residential proxies
3. Proxy Selection Strategy
Set BRIGHT_DATA_PROXY_TYPE in .env:
site_unblocker: Use only Site Unblocker (API or proxy)residential: Use only Residential Proxyauto: Try Site Unblocker first, automatically fallback to Residential Proxy on failures
Features:
- Automatic retry with exponential backoff on API timeouts
- Automatic failover from Site Unblocker to Residential Proxy (in
automode) - Comprehensive logging of proxy usage and performance metrics
WebScrapeAMZN/
├── backend/ # FastAPI Backend
│ ├── app/
│ │ ├── main.py # FastAPI application entry point
│ │ ├── api/ # API route handlers
│ │ │ ├── products.py # Product search and retrieval
│ │ │ ├── comparison.py # Price comparison endpoints
│ │ │ ├── arbitrage.py # Arbitrage opportunity detection
│ │ │ └── spiders.py # Spider job management
│ │ ├── services/ # Business logic services
│ │ │ ├── bigquery_service.py # BigQuery data access
│ │ │ ├── cache_service.py # Redis caching
│ │ │ └── gcs_service.py # GCS file operations
│ │ └── models/ # Pydantic data models
│ │ ├── product.py
│ │ ├── comparison.py
│ │ └── arbitrage.py
│ ├── scripts/ # Utility scripts
│ │ ├── check_and_fix_bigquery_schema.py
│ │ └── create_bigquery_views.sql
│ ├── Dockerfile
│ └── requirements.txt
├── frontend/ # Next.js 14 Frontend
│ ├── app/ # Next.js App Router (pages)
│ │ ├── page.tsx # Home page
│ │ ├── search/ # Product search page
│ │ ├── compare/ # Price comparison page
│ │ ├── arbitrage/ # Arbitrage dashboard
│ │ └── product/[id]/ # Product detail page
│ ├── components/ # React components
│ │ ├── ProductCard.tsx
│ │ ├── ComparisonTable.tsx
│ │ ├── ArbitrageCard.tsx
│ │ ├── PriceChart.tsx
│ │ └── ...
│ ├── lib/ # Utilities
│ │ └── api.ts # API client
│ ├── Dockerfile
│ └── package.json
├── scrapy_project/ # Scrapy Scraping System
│ ├── scrapy.cfg # Scrapy project configuration
│ └── retail_intelligence/
│ ├── settings.py # Scrapy settings (loads from config/.env)
│ ├── items.py # ProductItem schema definition
│ ├── pipelines.py # Data processing pipelines
│ │ ├── GCSRawHTMLPipeline # GCS upload pipeline
│ │ └── BigQueryAnalyticsPipeline # BigQuery insertion pipeline
│ ├── middlewares.py # Request/response middlewares
│ │ ├── BrightDataProxyMiddleware # Bright Data proxy routing
│ │ ├── ExponentialBackoffMiddleware # 429 error handling
│ │ └── ProxyLoggingMiddleware # Proxy statistics
│ ├── spiders/ # Spider implementations
│ │ ├── amazon_spider.py
│ │ ├── walmart_spider.py
│ │ ├── kohls_spider.py
│ │ └── kmart_spider.py
│ └── utils/ # Utility modules
│ ├── api_discovery.py # API endpoint discovery
│ ├── curl_cffi_client.py # TLS fingerprint client
│ └── schema_mapper.py # Data normalization
├── config/
│ ├── env_template.txt # Environment variable template
│ └── gcp-credentials.json # GCP service account key (not in git)
├── docker-compose.yml # Docker Compose configuration
├── requirements.txt # Python dependencies (Scrapy)
└── README.md
GET /api/products/search- Search products with filtersGET /api/products/{product_id}- Get single productGET /api/products/brands/list- Get available brands
POST /api/comparison/compare- Compare multiple productsGET /api/comparison/{product_id}- Get all retailers for a product
GET /api/arbitrage/opportunities- Get arbitrage opportunitiesGET /api/arbitrage/price-history/{product_id}- Get price history
POST /api/spiders/trigger- Trigger scraping jobGET /api/spiders/status/{job_id}- Get job status
API documentation available at http://localhost:8000/docs when backend is running.
- Discovery: Spiders discover hidden APIs from Network Tab data
- Scraping: Requests routed through Bright Data proxies with exponential backoff
- Storage:
- Raw HTML → Google Cloud Storage
- Cleaned data → Google BigQuery (with brand, model, category, SKU)
- Analytics: BigQuery materialized views for fast queries
- API: FastAPI backend serves data to frontend
- Frontend: Next.js app displays comparisons and arbitrage opportunities
- Caching: Redis caches frequent queries for performance
Proxy statistics are logged periodically and on spider close:
- Request counts per proxy type
- Success/failure rates
- Average response times
- Error tracking
- Ensure Redis is running:
docker-compose ps - Check
REDIS_HOSTandREDIS_PORTin.env
- Verify Bright Data credentials in dashboard
- Check proxy endpoint format matches Bright Data documentation
- Ensure service account has required permissions
- Verify dataset and table exist (or enable auto-creation)
- Run
backend/scripts/create_bigquery_views.sqlto create optimized views
- Ensure backend API is running on port 8000
- Check
NEXT_PUBLIC_API_URLenvironment variable - Clear browser cache if seeing stale data
- Check Redis connection:
docker-compose ps redis - Verify GCP credentials path in environment variables
- Check API logs:
docker-compose logs backend - Verify BigQuery table exists: Check
BQ_DATASETandBQ_TABLEin.env
- API Token Invalid: Verify
BRIGHT_DATA_API_TOKENmatches your Bright Data dashboard - Zone Not Found: Ensure
BRIGHT_DATA_ZONEmatches your zone name exactly - API Timeouts: The middleware automatically retries with exponential backoff (up to 3 retries)
- Fallback to Proxy: If API fails after retries, the system will fallback to traditional proxy if configured
[Your License Here]
Contributions are welcome! If you'd like to improve this project, fix bugs, or add new features, feel free to fork the repository, make your changes, and submit a pull request. Your efforts will help make this trading application even better!
If you found this project helpful or learned something new from it, you can support the development with just a cup of coffee ☕. It's always appreciated and keeps the ideas flowing!