VC / Portfolio Extractor
Automatically extract portfolio companies from VC sites—robust parsing, cleaning, and dedup.
Project Snapshot
Web scrapingHeuristicsDedupAutomation
Overview
VC portfolio pages are messy: some link out to company sites, some use Webflow tiles, some hide entries behind “load more”. This tool aims to extract company names + websites reliably, then apply strong post-processing to remove noise and merge duplicates.
Key Design Choices
- Prefer external link domains to identify companies; fall back to tile text/anchor labels when needed.
- Standardized cleaning: filter logo assets, screenshot-like tokens, and decorative labels.
- Post-processing: dedup by base-domain, merge near-duplicate names, keep root/homepage URLs when possible.
CLI Example
Replace with your real command + output snippet for a “product” feel.
python extract_portfolio.py https://example-vc.com
# output: companies.json (name + website)
Next Steps
- Add site-adapter templates for common CMS/component patterns.
- Output confidence scores per extracted entry.
- Optional: Playwright support for dynamic “load more” pages.