VC / Portfolio Extractor

Automatically extract portfolio companies from VC sites—robust parsing, cleaning, and dedup.

Project Snapshot
Web scrapingHeuristicsDedupAutomation

Overview

VC portfolio pages are messy: some link out to company sites, some use Webflow tiles, some hide entries behind “load more”. This tool aims to extract company names + websites reliably, then apply strong post-processing to remove noise and merge duplicates.

Key Design Choices

  • Prefer external link domains to identify companies; fall back to tile text/anchor labels when needed.
  • Standardized cleaning: filter logo assets, screenshot-like tokens, and decorative labels.
  • Post-processing: dedup by base-domain, merge near-duplicate names, keep root/homepage URLs when possible.

CLI Example

Replace with your real command + output snippet for a “product” feel.

python extract_portfolio.py https://example-vc.com
# output: companies.json (name + website)
        

Next Steps

  • Add site-adapter templates for common CMS/component patterns.
  • Output confidence scores per extracted entry.
  • Optional: Playwright support for dynamic “load more” pages.