From TF-IDF to Related Posts in 40 Lines of Python

Practical Python walkthrough of TF-IDF and cosine similarity for blog related-post systems, using scikit-learn and a real WordPress dataset.

From TF-IDF to Related Posts in 40 Lines of Python

Updated May 2026. Rewritten with cleaner scikit-learn code, a working WordPress example, honest notes on when sentence embeddings beat TF-IDF, and production observations from running this on our own site.

TF-IDF and cosine similarity is the cheapest way to build a related-posts system on a content site, and it still works well in 2026. It runs in milliseconds on a laptop, requires no external API, and the output is usually good enough that readers do not notice they are reading machine-picked recommendations.

We use it on this site. We are Osher Digital, a Brisbane-based AI and automation consultancy, and the related-posts widgets at the foot of every blog page are powered by exactly the recipe below. This guide is the working code, plus the observations from running it across our 200-plus articles and updating it nightly.

If you have arrived here looking for the cosine similarity formula or the TF-IDF maths, those are in section two, kept short. The longer half is how to wire it together end-to-end and where it falls down.


Why TF-IDF Still Beats Sentence Embeddings (Here)

The temptation in 2026 is to reach for sentence-transformers, embed every post, store vectors in pgvector, and call cosine similarity on those. That is the right answer for a lot of problems and the wrong answer for related-posts on a content site. Five reasons.

Embedding 200 posts with an API model costs about $0.20 USD; running it on a self-hosted MiniLM costs nothing but adds a model dependency and a deploy step. TF-IDF needs no model and no API. The dependency surface is scikit-learn, which you almost certainly have.

Sentence embeddings reward semantic similarity, which is sometimes the wrong objective. For technical blogs, you want posts about Docker to link to other posts about Docker, not posts about “container ecosystems” in general. TF-IDF rewards literal vocabulary overlap, which on a tightly-themed blog is what you actually want.

TF-IDF is interpretable. When the suggested related post looks wrong, you can print the top terms by TF-IDF weight and immediately see why. Debugging is a one-liner, not an embedding-space visualisation.

The cold-start problem is real for small blogs. A new post embedded by a transformer model has no relationship to your existing corpus’s term distribution until you re-vectorise. With TF-IDF you re-fit the vectoriser in milliseconds and ship the new vectors.

And finally: for content under 5,000 posts, the quality difference does not justify the operational cost. We have measured it. Move to embeddings when your corpus crosses 5,000-10,000 documents or when you specifically need semantic recall over vocabulary overlap.


The Maths, Quickly

Two ideas. They are simple enough that you can hold them in your head without notes.

TF-IDF scores a word’s importance to a document. Term frequency (how often the word appears in this document) multiplied by inverse document frequency (the log of how rare the word is across the corpus). Common words like “the” score low because they appear everywhere. Rare-but-on-topic words like “kubernetes” score high in posts about Kubernetes.

TF(t, d) = count of t in d / total terms in d
IDF(t)   = log(N / number of documents containing t)
TF-IDF(t, d) = TF(t, d) * IDF(t)

Cosine similarity measures the angle between two vectors. If two posts have very similar TF-IDF vectors (the same rare words showing up with similar weights), the cosine of the angle between them approaches 1. If they share no important vocabulary, the cosine approaches 0.

cos(A, B) = (A . B) / (||A|| * ||B||)

For non-negative vectors (which TF-IDF always produces) the result sits between 0 and 1. Identical content scores 1. Unrelated content scores near 0. That is the entire mechanism.


The 40-Line Working Implementation

This is the production version, simplified to the essentials. It runs on a CSV of posts (id, title, content) and produces a JSON file with the top 5 related posts for each entry. Drop it into any static-site or CMS build pipeline.

import json
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup

def html_to_text(html: str) -> str:
    return BeautifulSoup(html, "lxml").get_text(separator=" ", strip=True)

def build_related(csv_path: str, output_path: str, top_n: int = 5):
    df = pd.read_csv(csv_path)
    df["text"] = (df["title"] + " " + df["content"]).map(html_to_text)

    vectoriser = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.85,
        sublinear_tf=True,
    )
    matrix = vectoriser.fit_transform(df["text"])
    similarities = cosine_similarity(matrix)

    related = {}
    for i, post_id in enumerate(df["id"]):
        scores = similarities[i].copy()
        scores[i] = 0  # exclude self
        top_indices = scores.argsort()[::-1][:top_n]
        related[int(post_id)] = [
            {"id": int(df["id"].iloc[j]), "title": df["title"].iloc[j], "score": float(scores[j])}
            for j in top_indices if scores[j] > 0.05
        ]

    with open(output_path, "w") as f:
        json.dump(related, f, indent=2)

if __name__ == "__main__":
    build_related("posts.csv", "related.json")

That is the entire system. About 35 working lines once you strip the import statements. The four scikit-learn parameters that actually matter are explained below.

ngram_range=(1, 2)

Captures unigrams (“docker”) and bigrams (“docker compose”). Bigrams matter on technical blogs because “machine learning” and “learning machine” should not be confused with each other. Going to trigrams rarely improves quality and explodes vocabulary size.

min_df=2, max_df=0.85

Terms appearing in only one post are noise (typos, one-off names). Terms appearing in over 85% of posts are noise too (your brand name, your domain). Cutting both ends tightens the signal.

sublinear_tf=True

Applies a log transformation to term frequency so a word appearing 50 times in a post does not dominate over a word appearing 5 times. Particularly important if your post lengths vary a lot (ours range from 1,500 to 4,500 words).

scores[j] > 0.05

The threshold below which we discard suggestions entirely. Better to show three good related posts than five with the last two being random. We set it at 0.05 on our corpus; the right value for yours depends on the term distribution.


Preprocessing Decisions That Actually Matter

Most TF-IDF tutorials spend three pages on tokenisation, stop-word lists, and stemming. The honest take: scikit-learn’s defaults handle 90% of it, and the remaining 10% is corpus-specific. Four decisions where we override defaults.

Strip HTML, do not feed it raw. The example above uses BeautifulSoup. The scikit-learn tokeniser otherwise treats <p> and </p> as terms. Quick fix, huge quality improvement.

Weight the title. The title is the most distinctive 5-10 words in a post. We duplicate it into the input text three times before vectorising. This is a hack and it works.

Skip stemming. Porter stemmer collapses “automate”, “automation”, and “automating” to “autom”. Sometimes useful. Often not, on technical content where “n8n” and “n8ns” should stay separate. We tested with and without and saw no quality difference on our corpus. We left it off.

Drop code blocks if your blog has many. Code blocks are full of one-off variable names, library imports, and syntax tokens that pollute the vocabulary. We strip <pre> and <code> sections before vectorising, then add them back into the visible post. Quality went up noticeably after we did this.


Connecting It to WordPress

The pattern we use on this site. The Python script runs nightly via a scheduled task, pulls every published post via the WordPress REST API, computes the related-posts JSON, and writes it back as post metadata. The theme template reads the metadata at render time.

import requests
import os

WP_BASE = "https://example.com/wp-json/wp/v2"
WP_AUTH = (os.environ["WP_USER"], os.environ["WP_APP_PASSWORD"])

def fetch_all_posts():
    posts, page = [], 1
    while True:
        resp = requests.get(f"{WP_BASE}/posts", params={"per_page": 100, "page": page, "_fields": "id,title,content"})
        if resp.status_code == 400:
            break
        batch = resp.json()
        if not batch:
            break
        posts.extend(batch)
        page += 1
    return posts

def write_related_to_post(post_id: int, related: list):
    requests.post(
        f"{WP_BASE}/posts/{post_id}",
        json={"meta": {"_related_posts": json.dumps(related)}},
        auth=WP_AUTH,
    )

The _related_posts meta field is registered on the theme side as show_in_rest => true so the REST API can write it. The footer template loops over it and renders the cards. The whole pipeline runs in about 12 seconds for our 200-post corpus, including the API calls.

If you are running a static site (Bridgetown, Eleventy, Jekyll, Astro, Hugo), the same script writes to a JSON or front-matter file at build time. Even simpler. The original Ruby plugin we ship for Bridgetown is at bridgetown-related-posts on GitHub.


Production Observations

Three years of running this on our blog have taught us a few things the tutorials skip.

The system gets worse before it gets better. On a 20-post corpus, every post is “related” to every other post. The IDF term cannot do its job. We did not see good related-posts output until we crossed 60 posts.

Bad related posts almost always trace to thin content. If a 600-word post matches a 3,500-word post poorly, the issue is that the short post has too few distinctive terms. The fix is either to fatten the short post (better for everyone) or to weight by post length when ranking (faster but a band-aid).

A new high-traffic post can suddenly become the suggested related-post for too many other posts because its distinctive vocabulary overlaps with the rest of the corpus. We cap the number of times any single post can appear as a recommendation across the corpus. A simple post-processing pass.

Click-through on related-posts widgets, in our experience, plateaus around 8-12% on technical content. That is the realistic ceiling. If you are getting 2%, the related posts are wrong. If you are getting 20%, you are either lucky or your content all genuinely belongs to the same narrow topic.


When TF-IDF Is the Wrong Tool

TF-IDF struggles in a few specific shapes of content.

Heterogeneous corpora. If your blog covers cooking, software, and travel with equal weight, TF-IDF will cleanly separate the three clusters but will struggle to find good cross-cluster relations even when they exist (a post on “recipes in code” matching a cooking post). Sentence embeddings handle this better.

Synonyms that matter. If your customers search for “self-hosted” and your post uses “on-premise”, TF-IDF will not connect them. A sentence-embedding model trained on tech content will. For technical blogs where the vocabulary varies, this is the strongest case for moving up.

Very short content. Tweets, product descriptions, FAQ entries. The TF inside a 30-word document is too noisy. Use embeddings or a different similarity metric.

Cross-lingual matching. TF-IDF cannot match an English post to a French post about the same thing. Multilingual sentence-transformers can. We have not needed this on our blog, but clients running global content sites do.

If none of those apply, TF-IDF is the right tool. Resist the temptation to over-engineer.


Scaling Notes

At our scale (200 posts) the full pipeline runs in 12 seconds. Below are the points where the naïve approach starts hurting.

Over 10,000 documents: the full cosine similarity matrix is N-squared and gets memory-hungry. Use sklearn.neighbors.NearestNeighbors with metric='cosine' to compute only the top-k for each document. Drops memory from O(N²) to O(N*k).

Over 100,000 documents: approximate nearest neighbours becomes useful. annoy, hnswlib, or pgvector with HNSW indexing. The cost is some accuracy in the top-k; you trade exact matching for sub-millisecond query time.

Incremental updates: the natural pattern is to refit the vectoriser nightly and recompute everything. That is fine until your corpus is big or your build pipeline is slow. For huge corpora, freeze the vectoriser’s vocabulary on a periodic schedule (monthly) and only re-vectorise new documents in between. Quality drifts slightly between full refits; for most use cases it is invisible.


Frequently Asked Questions

What is the cosine similarity formula?

The cosine of the angle between two vectors A and B is cos(A, B) = (A · B) / (||A|| × ||B||), where A · B is the dot product and ||A|| is the magnitude of A. For non-negative vectors like TF-IDF the result sits between 0 and 1: identical content scores 1, unrelated content scores near 0. In Python, scikit-learn’s cosine_similarity function computes it pairwise across a matrix.

What is the difference between cosine distance and cosine similarity?

Cosine distance is 1 minus cosine similarity. Similarity of 1 (identical) equals distance of 0 (zero apart). Most libraries return one or the other depending on convention; scikit-learn returns similarity, while many nearest-neighbour libraries return distance. Worth double-checking which one you have before sorting.

Should I use TF-IDF cosine similarity or sentence embeddings in 2026?

For a corpus under 5,000 documents, with consistent topic coverage and stable vocabulary, TF-IDF cosine similarity is faster, cheaper, more interpretable, and produces comparable quality. For larger corpora, mixed-topic content, or matching across synonyms and paraphrases, sentence embeddings (sentence-transformers, OpenAI embeddings, Cohere embeddings) usually pull ahead. Start with TF-IDF unless you have a specific reason not to.

How does cosine similarity work for text matching?

You represent each document as a vector — TF-IDF weights, sentence embeddings, or any other numerical representation. Cosine similarity measures the angle between two of these vectors. Documents that emphasise the same terms with similar weights produce vectors that point in similar directions, and their cosine similarity is close to 1. Documents that emphasise different vocabulary produce vectors pointing in different directions, with cosine similarity near 0.

Where is cosine similarity used in information retrieval?

Document ranking against a query, near-duplicate detection, related-content recommendations, clustering of documents, semantic search backends, recommender systems for non-text content (anywhere you can vector-encode items), and RAG retrieval over chunked documents in LLM applications. Different vector representations (TF-IDF, BM25, embeddings) under the same cosine metric cover most retrieval tasks.

Does using TF-IDF and cosine similarity help SEO?

Indirectly. Good related-content recommendations increase pages per session and time on site, both of which correlate with stronger search performance. The internal linking structure that TF-IDF helps you build is genuinely useful for SEO because it spreads link equity to deeper pages and gives search crawlers signals about topical relationships. The act of computing TF-IDF does nothing for SEO; the structure it lets you build does.

How much does a TF-IDF related-posts system cost to run?

Essentially nothing. The CPU time to compute related-posts for a 200-post corpus is under 15 seconds; we run it nightly inside an existing build job at no marginal cost. For larger corpora a small VPS handles tens of thousands of documents for a few dollars a month. Compare to sentence-embedding APIs which can run into hundreds of dollars per re-embed across a large corpus.

Three usual culprits: your corpus is too small (under 60 posts the IDF signal is too weak), your similarity threshold is too low (raising the cutoff drops the noise at the cost of returning fewer suggestions), or one or two posts have very thin content that produces low TF-IDF magnitudes and matches everything weakly. Print the top-10 TF-IDF terms for the misbehaving post and the answer is usually obvious within seconds.


TF-IDF and cosine similarity is one of those techniques that survives every generation of fancier tools because it solves a specific problem cheaply and well. If you are building anything more ambitious — semantic search, multi-modal retrieval, RAG over a large corpus — we build that kind of infrastructure for clients in Brisbane and across Australia. Get in touch if you want a second opinion on the right tool for your shape of data.

Ready to streamline your operations?

Get in touch for a free consultation to see how we can streamline your operations and increase your productivity.