Tag: RDBMS

Dingo Engine Evolution

Dingo Engine Evolution

Has it really been two years since the first post? Wow. Ok. It’s been an interesting couple of years! Since the initial tranche of posts RD has changed course a little – I quickly realised that rather than just simply a decision engine, I could leverage all the useful information about decisions and put spam to good use for a change.

We’ve seen all the massive benefits of legislation such as GDPR (DPA 2018 for us in the UK), and some of the side effects where organisations have probably overreacted significantly e.g. WHOIS data. I’ve seen most groups err on the side of caution by cloaking all WHOIS data in case they miss something that could be classified as PII.

Even where the registrant is an organisation – data protection regulations apply to data subjects, not companies and a data subjects records as a director are publicly available in the registers of companies worldwide (some free, some behind a paywall). My personal view, for what it’s worth, is that if you’re a director and you’re already on the public record as such, the WHOIS entry need only contain the corporate registration detail.

Now there’s ways round this if you have a legitimate query – I’ve had positive outcomes from conversations with every registrar or domain host I’ve needed to speak to. Of course, each has required proof of offences (duly provided), and verified I am who I say I am. One organisation considered asking me to apply for a Norwich Pharmacol order – which I completely understand given their predicament.

WHOIS made it immensely easy to track spammers and their behaviours, but it’s by far the only marker. I suspect organisations who sell anti-spam & security products have likely faced similar dilemmas and evolved to remove reliance on such markers.

And that is exactly where Ringo Dingo has gone.

Rule Types

In very high level terms, emails contain a plethora of markers which allow us to route deterministically. Assessments can be made of active flags – sender emails, dodgy sender domains etc; passive flags – derived from other secondary layers of information beyond just emails; geo-blacklisting and of course proprietary factor analysis based on other indicators.

Each rule type is now configurable separately and the decision engine will allow use of regular expressions (which means header-specific rules).

Decision Engine Performance

Fairly early on it became apparent that caching base factors for re-runs e.g. domain meta-data, IP variations, etc would mean massive reductions of analysis times in many cases. Some early code was … well… rubbish experimental code frankly and needed refactoring.

Having moved away from MySQL and it’s family, I’ve elected for PostgreSQL for relational and Mongo for non-relational data stores. Both are performing well but I do miss some aspects of the MSSQL access model – and full stored procedure capabilities.

Over the last two years the average scan time for emails has gone from about 3.1s to roughly 0.75s. Obviously the decision engine portion of that is small but the enrichments take varying amounts of time.

Change In Architecture

In the background there are now two types of decision engine housing – one based on a Postfix milter which provides rule-based decisions, and another designed as a configurable Thunderbird extension. The idea is that your choice of email provider is agnostic and you can opt for whatever level of response is deemed appropriate:

  • Had an email address compromised and only ever get spam on it? Hey, let’s just reject all emails to that email address via an active capture rule
  • Maybe we note a string of similar sender factors that vary only by subdomain or other similar string – regex-based capture rules now deal with those reducing
  • Have a sender who is a persistent offender? Probably easier to apply an active rule, or use passive rules if they try and use multiple sender domains or addresses
  • Perhaps we’re a bit tired of Nigerian princes and UN Beneficiary funds from Vietnam. Probably easier just to block all emails relating to those locations

There are many more scenarios, and the detection sophistication has increased with smarter spammers trying to hide the obvious markers. It’s been quite an interesting challenge and to suit this the focus has moved from Gnome Evolution on the client side to Mozilla Thunderbird to make integration development easier (allegedly).

The (assumed) hosted email solution feels like it works in every model and putting an MTA to act as an initial filter is definitely an option. I’ve gone through two different email providers in the last year, eventually settling on one which provides expression-based, domain-based and indicator (SPF, DMARC) based blocking rules. More importantly the latest provider allows org-level choices about whether to quarantine or simply reject email entirely based on your rules.

In this scenario I didn’t feel the RD MTA was needed – plus that’s one less component to maintain.

High Level View

It’s a pretty simple layout, and all spam-detection events are carried through to a dedicated topic for later use in analysis. You don’t need the whole emails for this and attachments aren’t necessary at all but security is key here. Later iterations will use the event stream for real-time analysis, but I don’t have the time or the driver to complete that just yet (other projects are now taking priority).

Given the ready availability of ML in all three of the big cloud vendors, it won’t be too difficult to provision ML-ops to do the job. Defining the logic for the learning models… well that’s a lot more difficult!

For now though, simple grouping of events by recipient, date and sender email will show the patterns of distribution of data. I can easily discern who sold which data-set to whom, and roughly when they started using it.

Dingo’s Future

A plateau has now been established with the core rating operating as a systemd service, hosting a plain API, callable from any client – Thunderbird extension, Postfix or another plugin type. The caching tier is currently being moved from dev into cloud ops, and this very blog will likely follow suit. Having effectively ditched Ionos to go back to a combination of Azure and AWS should make this a lot more manageable (and cheaper too).

It’s fully operational and has been cataloguing events for some time now, so I suspect I’ll let it carry on in the background for a whole whilst getting some of my other projects back into shape. Next steps? Well that’s absolutely shiny toy territory…. automated generation of new detection rule sets, based on real-time analysis of potentially undetected spam events. The decision engine will be allowed to operate from automatically generated rule sets.

It’s been very satisfying to see the Dingo decision engine quietly push all the Trumptard spam, phishing scam and data trader-initiated emails into the deleted folder without needing to check with me – the only reason I knew they were there was because I just couldn’t resist checking up on the results!

MariaDB & MySQL

MariaDB & MySQL

The transport layer security dance. Like it?

I was pretty disappointed with both these RDBMS, having been catching up on the latest recently. For almost the last fifteen years I’ve been involved with high-transaction rate, low-latency platforms and solutions – almost entirely focusing on MS SQLS, Oracle and PostgreSQL in the relational world. Other non-relational solutions provide in other areas of capability, but MySQL isn’t really capable of the “big stuff” on it’s own (without some serious middleware a.k.a. Google and Wikipedia) but it’s been a favourite of mine in the past where SME-level solutions have needed open-source approaches.

You may remember also that Big Red scooped up Sun in 2009, catching MySQL in the same net – much to the consternation of users of MySQL. Oracle’s reputation isn’t good in the vendor space, nor has it got a great track record on the technical / developer support.

In fact Oracle’s concern at an open-source contender such as MySQL gaining so much traction in Oracle’s own enterprise RDMBS space, it tried pulling the rug from underneath MySQL’s feet a few times – most notably by buying the BerkleyDB engine a few years prior to the Sun acquisition. Shortly after the sale took place MySQL dropped the engine as it was realised it was barely used anyway.

New feature requests and ideas being mulled over on the MySQL dev side would also potentially mean that MySQL could potentially go toe-to-toe with Oracle RDBMS in most scenarios.

Oracle weren’t going to give up so easily and the development community were acutely aware of this.

So much so that Michael Widenius led the “Save MySQL” campaign, petitioning the European Commission to stop the merger. However that effort was largely unsuccessful so they forked MySQL the day the sale was announced, created the MariaDB moniker and wandered out of the Sun enclosure. A number of Sun’s MySQL developers followed him such was the concern at Oracle’s actions.

That tale being told, I’m still bewildered as to why repository builds of both MySQL Community and MariaDB are build with YaSSL – an ageing TLS library from WolfSSL. WolfSSL don’t even link to the product directly on their website anymore, nor include it in their product comparison chart.

OpenSSL, WolfSSL or BoringSSL are just three transport protection protocol libraries available, it seemed counter intuitive to compile with a library that doesn’t even support TLS1.2. From what I can see it doesn’t support EC ciphers at all either. Strange that we as a community would want to continue using it.

But with some digging I found Debian bug 787118 from 2014. Digging further into that it became clear that this was a legal issue:

“OpenSSL cannot be linked to from GPL code without the authors providing an exception to clause 6 of the GPL, since the OpenSSL license imposes restrictions incompatible with the GPL.”

I just learned something new. There’s a bit of back-and-forth on that same Debian bug and an inference that the YaSSL reference will be replaced by CYaSSL. Sounds better?

Nope. The first remark visible on the CYaSSL github repo is to warn people to use WolfSSL instead. In fact it looks like CYaSSL hasn’t been updated since 2015 / 3.3.2 and the patch to it was never implemented.

Default MDB Install SSL status
YaSSL??

So no OpenSSL or BoringSSL, and no sign of MariaDB maintainers looking to compile with WolfSSL, instead of obsolete libraries? It’s also an embedded TLS library which supports TLS 1.3 and EC ciphers. I know there’s a lot of talk about how clunky & unwieldy the OpenSSL code base has become, so can see mileage in opting for something more efficient.

Frustratingly, I can see that there has been work to this end on the MariaDB JIRA, which appears linked to 10.4. Looks like the work is done in upstream versions but Debian unstable hasn’t seen a whiff of this yet, with mariadb-10.3 still hogging the limelight. Unfortunately this highlights one of the few disadvantages with widely-used FOSS products, in that time is required to properly evaluate and then commit to replacing something as important as security libraries.

AWS enforces certificate-based authentication which is really useful in this context, and AWS do offer MariaDB RDS instances so I suspect the underlying OS is not Debian Stable. However this shows that it is possible and operational. For now I just I don’t see the value in setting up stunnel to workaround the problem as this essentially means the mechanism used for development won’t replicate the mechanism on prod.

The problem with not developing against a standard security design is that we should never develop anything that is instantly & fundamentally different outside of the dev environment.

I’m not comfortable using non-EC transport-layer security for something as serious as security data & meta-data, especially in data centre scenarios. Think I’ll let MariaDB get this under control before exploring any further, but I fully intend to write both PostgreSQL and MariaDB-compatible data layers for Ringo Dingo eventually.

For now though, I’ll continue to focus the work on PostgreSQL & OpenSSL (for it’s sins).