Tag: fastapi

Dingo Engine Evolution

Dingo Engine Evolution

Has it really been two years since the first post? Wow. Ok. It’s been an interesting couple of years! Since the initial tranche of posts RD has changed course a little – I quickly realised that rather than just simply a decision engine, I could leverage all the useful information about decisions and put spam to good use for a change.

We’ve seen all the massive benefits of legislation such as GDPR (DPA 2018 for us in the UK), and some of the side effects where organisations have probably overreacted significantly e.g. WHOIS data. I’ve seen most groups err on the side of caution by cloaking all WHOIS data in case they miss something that could be classified as PII.

Even where the registrant is an organisation – data protection regulations apply to data subjects, not companies and a data subjects records as a director are publicly available in the registers of companies worldwide (some free, some behind a paywall). My personal view, for what it’s worth, is that if you’re a director and you’re already on the public record as such, the WHOIS entry need only contain the corporate registration detail.

Now there’s ways round this if you have a legitimate query – I’ve had positive outcomes from conversations with every registrar or domain host I’ve needed to speak to. Of course, each has required proof of offences (duly provided), and verified I am who I say I am. One organisation considered asking me to apply for a Norwich Pharmacol order – which I completely understand given their predicament.

WHOIS made it immensely easy to track spammers and their behaviours, but it’s by far the only marker. I suspect organisations who sell anti-spam & security products have likely faced similar dilemmas and evolved to remove reliance on such markers.

And that is exactly where Ringo Dingo has gone.

Rule Types

In very high level terms, emails contain a plethora of markers which allow us to route deterministically. Assessments can be made of active flags – sender emails, dodgy sender domains etc; passive flags – derived from other secondary layers of information beyond just emails; geo-blacklisting and of course proprietary factor analysis based on other indicators.

Each rule type is now configurable separately and the decision engine will allow use of regular expressions (which means header-specific rules).

Decision Engine Performance

Fairly early on it became apparent that caching base factors for re-runs e.g. domain meta-data, IP variations, etc would mean massive reductions of analysis times in many cases. Some early code was … well… rubbish experimental code frankly and needed refactoring.

Having moved away from MySQL and it’s family, I’ve elected for PostgreSQL for relational and Mongo for non-relational data stores. Both are performing well but I do miss some aspects of the MSSQL access model – and full stored procedure capabilities.

Over the last two years the average scan time for emails has gone from about 3.1s to roughly 0.75s. Obviously the decision engine portion of that is small but the enrichments take varying amounts of time.

Change In Architecture

In the background there are now two types of decision engine housing – one based on a Postfix milter which provides rule-based decisions, and another designed as a configurable Thunderbird extension. The idea is that your choice of email provider is agnostic and you can opt for whatever level of response is deemed appropriate:

  • Had an email address compromised and only ever get spam on it? Hey, let’s just reject all emails to that email address via an active capture rule
  • Maybe we note a string of similar sender factors that vary only by subdomain or other similar string – regex-based capture rules now deal with those reducing
  • Have a sender who is a persistent offender? Probably easier to apply an active rule, or use passive rules if they try and use multiple sender domains or addresses
  • Perhaps we’re a bit tired of Nigerian princes and UN Beneficiary funds from Vietnam. Probably easier just to block all emails relating to those locations

There are many more scenarios, and the detection sophistication has increased with smarter spammers trying to hide the obvious markers. It’s been quite an interesting challenge and to suit this the focus has moved from Gnome Evolution on the client side to Mozilla Thunderbird to make integration development easier (allegedly).

The (assumed) hosted email solution feels like it works in every model and putting an MTA to act as an initial filter is definitely an option. I’ve gone through two different email providers in the last year, eventually settling on one which provides expression-based, domain-based and indicator (SPF, DMARC) based blocking rules. More importantly the latest provider allows org-level choices about whether to quarantine or simply reject email entirely based on your rules.

In this scenario I didn’t feel the RD MTA was needed – plus that’s one less component to maintain.

High Level View

It’s a pretty simple layout, and all spam-detection events are carried through to a dedicated topic for later use in analysis. You don’t need the whole emails for this and attachments aren’t necessary at all but security is key here. Later iterations will use the event stream for real-time analysis, but I don’t have the time or the driver to complete that just yet (other projects are now taking priority).

Given the ready availability of ML in all three of the big cloud vendors, it won’t be too difficult to provision ML-ops to do the job. Defining the logic for the learning models… well that’s a lot more difficult!

For now though, simple grouping of events by recipient, date and sender email will show the patterns of distribution of data. I can easily discern who sold which data-set to whom, and roughly when they started using it.

Dingo’s Future

A plateau has now been established with the core rating operating as a systemd service, hosting a plain API, callable from any client – Thunderbird extension, Postfix or another plugin type. The caching tier is currently being moved from dev into cloud ops, and this very blog will likely follow suit. Having effectively ditched Ionos to go back to a combination of Azure and AWS should make this a lot more manageable (and cheaper too).

It’s fully operational and has been cataloguing events for some time now, so I suspect I’ll let it carry on in the background for a whole whilst getting some of my other projects back into shape. Next steps? Well that’s absolutely shiny toy territory…. automated generation of new detection rule sets, based on real-time analysis of potentially undetected spam events. The decision engine will be allowed to operate from automatically generated rule sets.

It’s been very satisfying to see the Dingo decision engine quietly push all the Trumptard spam, phishing scam and data trader-initiated emails into the deleted folder without needing to check with me – the only reason I knew they were there was because I just couldn’t resist checking up on the results!