skip to content
← Posts

Scraping LinkedIn

This is Part 2 of a series on MonitorIntent. In Part 1, we built an AI outreach tool that nobody trusted and pivoted to intent-based lead generation. This part covers building the MVP and the technical challenges we hit.

Our MVP took 2 months from first commit to something we could show people. That “something” was a Replit dashboard displaying an Excel file produced by a local Python script I ran on my laptop. No hosting or “deployment pipeline” (lmao). Quite literally a massive Python project running locally that spat out CSV files.

But before we could build anything, we had to figure out what we were actually looking for.

Two flavors of intent

We identified 2 types of buying intent worth tracking:

Indirect Intent meant tracking our customers’ competitors - finding them through recursive semantic search over LinkedIn companies (Exa made this possible) - then monitoring who was liking, commenting, and interacting with their content. The key filter: exclude employees of the competitor. Anyone else engaging heavily with a competitor’s posts is a potential buyer shopping around.

Direct Intent meant finding people talking about, posting about, or interacting with topics relevant to our customers’ business. Complaints were gold. Someone publicly frustrated with a tool in your category? That’s a lead. We also tracked people engaging with our customers’ own content - likes, comments, reposts.

It may be simple in theory, but as it turns out, it was brutal in practice.

The backend was just me and a script

Let me be honest about the architecture. The backend was a behemoth of a script running entirely on my machine. It scraped, processed, scored, and spat out a CSV that I manually forwarded to our intern or my co-founder. That was the pipeline.

The frontend was handled by our intern and the CEO using a combination of Replit, Framer, and custom scripts - lots of creative hacks to get something usable fast. We were doing it the founder way: paying a variable amount of attention to code quality depending on how much that code needed to be reliable.

About 50-70% of our codebase was AI-generated. But “AI-generated” doesn’t mean “unsupervised.” The backend was built with reuse in mind. It had coordination primitives for rate limiting, distributed locks, circuit breakers, and a caching layer I was genuinely proud of - whenever we modified a function, the cache entries corresponding to that function’s old version would auto-invalidate. No redeployment needed, no rebuilding the entire cache from scratch.

We cared about engineering, but we cared about speed more.

Scraping at scale is where dreams go to die

We realized quickly that scraping the internet en masse was a monumental task. Finding a single good lead required a chain of operations:

  1. Automated search across various topics to find a relevant post
  2. Scrape who’s interacting with that post
  3. Analyze what they’re saying - is it a comment, a repost, something meaningful?
  4. Scan their profile for relevance
  5. Check their employer for signals - recent funding, growth, news

Every step was a scrape. Every scrape was a potential rate limit, a blocked request, a stale page. Multiply that by thousands of posts and tens of thousands of profiles.

We threw everything we had at the optimization problem. We cached scrape results for a couple of days. We pre-filtered people who interacted with a post by skimming their headline instead of scraping and processing their entire profile upfront. We built detailed observability into the system so we could see step-by-step decisions made by our AI agents and their justifications - which let us work with early customers to tune the scoring.

The biggest win was our LLM cost optimization. We used models of varying intelligence at different stages - cheap models for simple classification, expensive ones only when nuance mattered. Within a month, we cut LLM spend by 5x. That mattered because we had to offer free trials to convince anyone to stick around, and the per-lead cost was exorbitant before that optimization.

Social media has a terrible signal-to-noise ratio

Here’s the thing about LinkedIn intent data that we learned the hard way: people just don’t complain hard enough on LinkedIn. Not even in the lead generation space, where you’d expect some frustration to surface publicly.

LinkedIn isn’t Twitter. People curate. They stay professional. They don’t post “this tool is garbage and I’m switching” - they post “grateful for the learning experience” and quietly evaluate alternatives. So our direct intent signal - the complaints, the public frustration - was sparse.

We had to fall back on likes as a primary signal. A like. The lowest-effort interaction on the platform. Trying to infer genuine buying intent from someone double-tapping a post about sales automation is… not ideal.

The physical cost

The MVP development was so intense that I developed carpal tunnel in both hands from typing too much. That’s not a brag. That’s a warning about what happens when you’re a solo backend engineer trying to build a scraping infrastructure that rivals what funded companies staff entire teams for.

We had a working product. We had customers trying it. We had the cost under control. But the foundation - LinkedIn as a reliable source of buying intent - was shakier than we wanted to admit.

Post 2 of 4 in On MonitorIntent