Legal Cluster Search

Creating the Legal Services Matcher was born out of a simple frustration: how do regular people know what type of lawyer they actually need? I mean, if you're dealing with a messy divorce, do you need a family lawyer? A divorce specialist? What if there are property disputes involved? It's confusing!

So I thought, what if we could just describe our legal problem in plain English and let AI figure out which specialist we need? That's exactly what this project does, and it was a blast to build.

The "Aha!" Moment

Picture this: you're sitting at your computer, stressed about a legal issue, and you don't even know where to start. You type something like "my landlord won't fix the broken heater and it's freezing" into the app, and boom - it tells you that you probably need a Real Estate Lawyer specializing in tenant rights. That's the magic I wanted to create.

The cool part? It's not just matching keywords. The system actually understands what you're asking about using the same technology that powers modern chatbots and search engines.

How It Actually Works (The Non-Boring Version)

The Brain: DistilBERT

At the heart of this project is something called DistilBERT - think of it as a compact, lightning-fast version of BERT, which is basically the Swiss Army knife of natural language understanding. DistilBERT has been trained on massive amounts of text and knows how words relate to each other semantically.

When you type "car accident injury," it doesn't just see three separate words. It understands concepts like "vehicular collision," "bodily harm," and "compensation claims" are all related. Pretty neat, right?

The Memory: Embedding Vectors

Here's where it gets a bit sci-fi. Every legal description in our system gets converted into a 768-dimensional vector. I know, I know - "What the heck is a 768-dimensional vector?"

Think of it like this: if we were describing lawyers in 3D space, we might use axes like "corporate vs. personal," "criminal vs. civil," and "litigation vs. advisory." Now imagine we have 768 of these axes instead of 3, capturing incredibly nuanced differences between legal specializations. Each lawyer description becomes a point in this high-dimensional space.

When you search for something, we convert your query into the same kind of point, then find the closest lawyer-points using basic geometry (cosine similarity, if you want to get technical). Birds of a feather flock together, and in our case, similar legal needs cluster in the same regions of this 768D space!

The Storage: Not Fancy, But It Works

Right now, everything gets saved in pickle files. Are they the most elegant solution? Nope. But they work perfectly for a proof-of-concept! Every time someone adds a new legal specialization, we:

Convert the description to an embedding
Save it to description_pool.pkl
Save the actual text to description_list.pkl
Move on with our lives

Simple and effective. In production, you'd definitely want a real database, but for now? This gets the job done.

Features That Make Life Easier

1. Smart Semantic Search

This isn't your grandpa's keyword search. Type "my neighbor's dog bit my kid" and the system knows you're probably looking for a Personal Injury Lawyer, even though you never used those exact words.

The k-Nearest Neighbors algorithm looks at your query's embedding and finds the top N most similar legal specializations. You can choose how many results you want - whether it's the top 1 match or top 5 alternatives.

2. Growing Knowledge Base

Started with 29 legal specializations, but you can add more anytime! Each new addition gets its own embedding and becomes part of the searchable pool. The system learns and grows without needing retraining.

Want to add "Cryptocurrency Fraud Lawyer"? Just POST the description to the API and it's instantly searchable. The model already knows what cryptocurrency and fraud mean, so it'll automatically place this new specialty in the right neighborhood of our 768D space.

3. RESTful API

Built with Flask, the API is clean and straightforward:

Adding descriptions:

POST /submit
{
  "description": "Specializes in tech startup litigation and venture capital disputes"
}

Searching:

GET /api/search
{
  "to_search": "my startup co-founder is stealing money",
  "n_items": 3
}

No authentication (yet), no complicated setup. Just JSON in, JSON out.

4. Pre-loaded with Real Specializations

The system comes with descriptions for:

Divorce Lawyers
Criminal Defense
Personal Injury
Real Estate
Corporate Law
Intellectual Property
Tax Law
Immigration
And 21 more!

Each one carefully crafted to give the model good semantic understanding of what that specialist does.

The Tech Stack (For the Curious)

Backend: Flask - because sometimes you just need a lightweight framework that doesn't overthink things

AI/ML:

PyTorch for the heavy lifting
HuggingFace Transformers for easy model access
DistilBERT specifically because it's fast and doesn't need a GPU

Similarity Search: scikit-learn's NearestNeighbors with cosine distance (the classic approach that just works)

Data Handling: Pandas for reading that initial CSV, pickle for persistence

The Journey: What I Learned

Challenge #1: Model Loading Time

First version took like 10 seconds to load the model on every request. Terrible user experience. Solution? Load once, save to pickle, reuse forever. Now loading is instant.

Challenge #2: Embedding Pooling

DistilBERT gives you embeddings for each token in your text. But we need ONE vector per description. Mean pooling (averaging all token embeddings) turned out to be the sweet spot - simple but effective.

Challenge #3: Keeping Embeddings and Descriptions in Sync

Had a bug early on where I'd add embeddings but forget to add the corresponding descriptions. Users would get back index numbers instead of actual lawyer descriptions. Facepalm moment. Fixed by always updating both lists together in tokenise_append_to_pool().

Real-World Use Cases

Scenario 1: The Confused Client Someone types: "my boss fired me because I got pregnant" System returns: Labor and Employment Lawyer specializing in discrimination

Scenario 2: The Startup Founder Query: "need help with patent for my AI invention" Result: Intellectual Property Lawyer with patent expertise

Scenario 3: The Homeowner Input: "contractor abandoned my kitchen renovation halfway" Output: Construction Lawyer who handles contractor disputes

What's Next?

This is very much a V1. Here's what I'm thinking for the future:

Better Storage: Move from pickle files to PostgreSQL or MongoDB. Maybe add caching with Redis.

Smarter Matching: Could implement hybrid search combining semantic similarity with keyword matching. Best of both worlds.

User Accounts: Let law firms create profiles, add their specializations, get matched with clients.

Multi-language Support: DistilBERT has multilingual versions. Could help people find lawyers in Spanish, French, etc.

Location Awareness: Add geographic filtering. Find similar specialists in your area.

Feedback Loop: Let users rate matches. Use this data to fine-tune the system over time.

Why This Matters

Legal help is expensive and intimidating. The first step - just figuring out what kind of lawyer you need - shouldn't be another barrier. This project is a tiny step toward making legal services more accessible.

Plus, it's a great demonstration of how modern NLP can solve real-world problems without needing massive computing resources or complicated infrastructure. The entire thing runs on a basic laptop.

Want to Try It?

Clone the repo, run pipenv install, initialize the models, and fire up the Flask server. Within minutes you'll have your own legal matching engine running locally.

The codebase is straightforward - no architectural astronautics here. Just clean, readable Python that gets the job done. Perfect for learning about NLP, embeddings, and similarity search without drowning in complexity.

Final Thoughts

Building this project taught me that sometimes the best solutions aren't the most complex ones. A relatively small model (DistilBERT), a simple similarity search algorithm (k-NN), and basic web framework (Flask) combined to create something genuinely useful.

The real magic isn't in any single technology - it's in how they work together to solve a real problem. And honestly? That's what makes software engineering fun.

If you're interested in NLP, semantic search, or just want to see how modern language models can be practically applied, this project is a great starting point. The code is accessible, the concepts are fascinating, and the potential applications are endless.

Now go forth and match some lawyers! ⚖️✨