The Intelligence Feed That Builds Itself

One Command. Done.

"Add this article to our feed."

npm run url-news "https://techcrunch.com/article-url"

That's it. Article extracted, analyzed, tagged, and added.

The Architecture: Hybrid Local + LLM

We initially tried two extremes:

  1. Pure LLM: Sending raw HTML to GPT-4. Expensive, slow, and prone to hallucinating metadata.
  2. Pure Scraper: Regex and Cheerio. Fast, but brittle and terrible at summarizing nuance.

The solution was a hybrid architecture.

Phase 1: The Local Extractor (Fast & Cheap)

Before we waste a single token, we process the content locally.

We built a robust extractor that runs right on the machine. It handles:

  1. Fetching & Rendering: Handles redirects and basic cleanup.
  2. Metadata Extraction: Pulls og:title, authors, dates, and site names using standard meta tags.
  3. Content Cleaning: Uses a cascade of selectors to find the actual article body, stripping away navigation, ads, and footers without needing an AI to "read" the page.
  4. Heuristic Relevance Scoring: A weighted keyword algorithm immediately discards irrelevant noise (spam, ads, off-topic posts) before they reach the expensive steps.

Phase 2: The LLM Enrichment (Smart)

Once we have clean, high-signal text, then we bring in the heavy guns. We pass the cleaned JSON to an LLM (Claude or OpenAI) for the tasks that actually require intelligence:

  • Synthesis: "Summarize this for a CTO worried about grid reliability." (Something regex can't do)
  • Sentiment Nuance: Distinguishing between "investment announced" (good) and "project delayed" (bad) in complex sentences.
  • Structured Extraction: Converting "two billion dollars" and "$2.5B" into standardized numbers for our database.

Why This Pattern Matters

1. Cost Control
By cleaning the HTML locally, we reduce the token count by 60–80% before the API call. We pay to process information, not <div> tags.

2. Speed
Local relevance scoring means we can discard low-value URLs in milliseconds without network latency.

3. Reliability
If the LLM is down or hallucinates, we still have the locally extracted title, date, and raw content. The system degrades gracefully.

The Extraction Logic

Title & Content

We use a "waterfall" strategy. Try the most reliable method (Open Graph tags); if missing, fall back to semantic HTML (<article>); if missing, use heuristics (largest text block). This ensures we get something usable from almost any site.

Relevance Scoring

We maintain a weighted dictionary of domain-specific terms ("PPA", "interconnection", "H100"). An article must cross a point threshold to be considered "intelligence." This simple filter saves us from filling our database with generic tech news.

The Bottom Line

You don't need AI for everything.

The most effective AI systems are often 20% AI and 80% solid engineering. By letting code do what code does best (scraping, filtering, formatting), we free up the AI to do what it does best (reasoning and synthesis).


See our Intelligence Feed in action at /intelligence-feed. The source code for our extractor is available in our repository.

More Insights

Sustainability

Is your AI training cluster thirsty? Let's talk water.

A practical look at AI cooling water demand, where the risk concentrates, and how teams can mitigate it.

AI Architecture

Why We Stopped Building a 'Platform'

Why we moved from traditional SaaS patterns to a multi-agent operating model for infrastructure intelligence.

Technical

The 'Context Tax': How We Slashed Agent Costs by 99%

How code-first skills and tighter context routing drove major cost reductions without quality loss.

Industry

Google Maps for Electrons: Why 'Tapestry' Matters

Why grid-visibility tooling may become the limiting factor for AI data center expansion.

Investment

Why We Trust Prediction Markets More Than Tech News

Where market-implied probabilities beat headlines for timing-sensitive energy and infrastructure decisions.

Compliance

The Hidden Climate Clause in the EU AI Act

What the EU AI Act means for AI energy reporting, compliance timelines, and exposure management.

AI Architecture

Six Agents, One Room, No Agreement

How structured disagreement between specialist agents produced better portfolio decisions.

Finance

LCOE: The Baseline 'Truth' in Energy Investing

Why LCOE remains a core metric for comparing technologies and underwriting long-horizon energy risk.

Sustainability

The Schedule That Waits for the Wind

How carbon-aware workload scheduling reduces both emissions and compute cost volatility.

Investment

AI Data Center Energy Crisis: Investment Risks and Opportunities

A portfolio-level briefing on grid constraints, power costs, and capital-allocation implications.

Finance

AI Hyperscalers and the Data Center Financing Boom

Who is funding hyperscale buildout, where structures are changing, and what risk shifts to lenders.

Sustainability

Building Sustainable AI in Enterprise Environments

A practical playbook for lowering AI energy intensity without sacrificing delivery speed.