masinideteren.ro
Overview
A comprehensive web scraping system designed to aggregate offroad car listings from Romanian automotive marketplaces (OLX and Autovit). The platform uses a distributed worker architecture to efficiently collect, process, and analyze vehicle listings.
How it started
I'm passionate about offroad cars and have owned multiple vehicles over the years, including 2 Suzuki Vitara and 2 VW Touareg. During this time, I found myself constantly searching for the perfect offroad vehicle, spending countless hours browsing through listings on various platforms. I realized I needed a more efficient solution - an automated crawler that could help me find and track offroad cars across multiple marketplaces. This project was born out of that personal need, combining my passion for offroad vehicles with my technical skills to create a tool that makes car hunting much easier.
Architecture
The system is built using Docker Swarm for orchestration and consists of three specialized workers:
1. Listing Scraper Worker
- Runs every 30 minutes
- Searches and processes listing pages from OLX and Autovit
- Identifies new offroad car listings
- Queues individual ads for detailed scraping
2. Single Ad Scraper Worker
- Processes individual listing URLs
- Extracts detailed information including descriptions
- Downloads and stores vehicle images
- Uses Cheerio for static content and Puppeteer for dynamic content
3. LLM Processing Worker
- Analyzes scraped ad content using a Large Language Model
- Extracts structured data from unstructured descriptions
- Enhances listings with additional metadata
Tech Stack
Frontend:
- React
Backend:
- Express
- Node.js
Scraping Tools:
- Cheerio - for parsing static HTML
- Puppeteer - for dynamic content and JavaScript-heavy pages
Infrastructure:
- Docker Swarm - for distributed worker orchestration