How I created a policy reader, and how it works.

Most people rarely read privacy policies, but that's where most services hide weird behavior in plain sight. So maybe, just maybe, if there was a way you could quickly get the important parts of any policy without having to go through the whole thing, maybe you'd make better, more informed choices, and just maybe, making this knowledge more accessible would help in fighting back against these weird behaviors.

The name of the project is ... drum-roll please ... Privacy Peek.

Underwhelming, I know. It's a work in progress.

The Idea (Put Simply).

The goal is quite simple - allow you to get an idea of how respectful of your privacy a website or service is, without having to go through the privacy policy yourself. A couple of years ago, this would have been a complex problem to solve, because it would require building a tool that goes through said policies and extracts relevant clauses, probably with regex matching, with some custom machine learning model. But I only started working on this in 2024 - a ripe time in the AI age. This meant that instead I could make use of existing AI models to have them do exactly what they do best - read and write text. With that, the idea was kind of solid - give the model a policy and tell it to give you a privacy score on a scale of 1-100.

Now if you know a little something about AI, you're probably screaming at me right now, "Come on, these models are non-deterministic. You ask the same question twice and it'll give you a different answer each time." And you would be right. It would be foolish to hand a model a set of clauses you extracted and tell it to give you a score from 1 to 100, 1 being bad and 100 being good. That's vague to say the least. So here's my workaround for now.

I've defined five categories across which the system extracts clauses from policies:

Data Collection
Data Sharing
Data Retention and Security
User Rights and Controls
Transparency and Clarity

This gives us a bit of structure to work with. Once we have all relevant clauses categorized neatly, we then send each set of clauses to the model along with a set of rubrics. This is the tricky part though, because it dictates how a string of words is converted to a simple number. Category scores run on a scale of 1 through 10. So the rubric for each category is an array of 10 objects. Each object has two fields: score and description. For example:

{
  score: 10,
  description: 'Only collects data absolutely essential for core service functionality with explicit, granular consent for each data type'
}

With this, the model would then compare the provided clauses against the rubric, much like a switch statement. If the behavior observed in the set of clauses matches the description for a certain score, then that's the score that is assigned for that category.

This would then need to be done five times, the result of which would be an array of five items, each taking this shape:

{
  category: [Category Name],
  score: [1-10],
  reasoning: "..."
}

The reasoning field would be a simple sentence in which the model explains why a particular score was awarded for that category.

With that, it would now be easy to calculate an overall score. For that, we need to assign weights to all five categories. This allows us to play around with each category's influence on the final overall score, since the alternative route, getting the average of all five, would be quite naive in my opinion.

Think of it like this; a service could score poorly in Data Collection, meaning they collect too much data, but then score well on Transparency and Clarity, meaning they are not afraid to say that they do collect too much data. The high score on Transparency would give such a service an unfair edge on another that scores just a little higher on Data Collection but much worse on Transparency.

This would lead to unfair judgement, since the relatively more privacy-respecting service would painted as less trustworthy.

To counter this, we let weights influence how much a category contributes to the overall score. The mathematical function goes something like this:

weights_total = sum of all weights (i.e., sum of Data_Collection.weight + Data_Sharing.weight + ...)
weight_x_score_sum = sum of each category's score multiplied by its weight
overall_score = 10 * (weight_x_score_sum / weights_total)

The resulting overall_score is a number between 1 and 100.

This combined with proper weighting allows us to cater for situations such as the one described earlier. Once we have that, we can then do one final pass to the model. We provide all our findings: category scores, reasoning behind each score and the computed overall score. The model then gives us an output of a simple sentence describing why the service got that overall score, and how each category influenced it.

There's definitely a lot I have not covered here, like how we go from simple input on the website to url discovery, getting policy documents via web search tools, persisting metadata in the database to avoid having to re-discover sites.

The full details are available in the project repository: https://github.com/KigoJomo/privacy-peek

The Challenge

It's evident that this project relies heavily on AI for its core functionality. But here is where things get rough. AI inference is not free, and it's not cheap either.

So there's a lot of improvements to be done in how analysis is done. Possible routes are self-hosting small models that can run queued jobs 24/7, or implementing a Bring Your Own Key (BYOK) model where power users can offset much of the analyses and have the results always available for everyone else.

All in all, this is mostly a proof of concept - corpo talk for "I am readily welcoming all ideas and contributions".
I'm available on X, discord and github.
The journey to a more private internet begins with one AI-heavy analysis website ... I guess 😂😂