Category: General

Dingo Engine Evolution

Dingo Engine Evolution

Has it really been two years since the first post? Wow. Ok. It’s been an interesting couple of years! Since the initial tranche of posts RD has changed course a little – I quickly realised that rather than just simply a decision engine, I could leverage all the useful information about decisions and put spam to good use for a change.

We’ve seen all the massive benefits of legislation such as GDPR (DPA 2018 for us in the UK), and some of the side effects where organisations have probably overreacted significantly e.g. WHOIS data. I’ve seen most groups err on the side of caution by cloaking all WHOIS data in case they miss something that could be classified as PII.

Even where the registrant is an organisation – data protection regulations apply to data subjects, not companies and a data subjects records as a director are publicly available in the registers of companies worldwide (some free, some behind a paywall). My personal view, for what it’s worth, is that if you’re a director and you’re already on the public record as such, the WHOIS entry need only contain the corporate registration detail.

Now there’s ways round this if you have a legitimate query – I’ve had positive outcomes from conversations with every registrar or domain host I’ve needed to speak to. Of course, each has required proof of offences (duly provided), and verified I am who I say I am. One organisation considered asking me to apply for a Norwich Pharmacol order – which I completely understand given their predicament.

WHOIS made it immensely easy to track spammers and their behaviours, but it’s by far the only marker. I suspect organisations who sell anti-spam & security products have likely faced similar dilemmas and evolved to remove reliance on such markers.

And that is exactly where Ringo Dingo has gone.

Rule Types

In very high level terms, emails contain a plethora of markers which allow us to route deterministically. Assessments can be made of active flags – sender emails, dodgy sender domains etc; passive flags – derived from other secondary layers of information beyond just emails; geo-blacklisting and of course proprietary factor analysis based on other indicators.

Each rule type is now configurable separately and the decision engine will allow use of regular expressions (which means header-specific rules).

Decision Engine Performance

Fairly early on it became apparent that caching base factors for re-runs e.g. domain meta-data, IP variations, etc would mean massive reductions of analysis times in many cases. Some early code was … well… rubbish experimental code frankly and needed refactoring.

Having moved away from MySQL and it’s family, I’ve elected for PostgreSQL for relational and Mongo for non-relational data stores. Both are performing well but I do miss some aspects of the MSSQL access model – and full stored procedure capabilities.

Over the last two years the average scan time for emails has gone from about 3.1s to roughly 0.75s. Obviously the decision engine portion of that is small but the enrichments take varying amounts of time.

Change In Architecture

In the background there are now two types of decision engine housing – one based on a Postfix milter which provides rule-based decisions, and another designed as a configurable Thunderbird extension. The idea is that your choice of email provider is agnostic and you can opt for whatever level of response is deemed appropriate:

  • Had an email address compromised and only ever get spam on it? Hey, let’s just reject all emails to that email address via an active capture rule
  • Maybe we note a string of similar sender factors that vary only by subdomain or other similar string – regex-based capture rules now deal with those reducing
  • Have a sender who is a persistent offender? Probably easier to apply an active rule, or use passive rules if they try and use multiple sender domains or addresses
  • Perhaps we’re a bit tired of Nigerian princes and UN Beneficiary funds from Vietnam. Probably easier just to block all emails relating to those locations

There are many more scenarios, and the detection sophistication has increased with smarter spammers trying to hide the obvious markers. It’s been quite an interesting challenge and to suit this the focus has moved from Gnome Evolution on the client side to Mozilla Thunderbird to make integration development easier (allegedly).

The (assumed) hosted email solution feels like it works in every model and putting an MTA to act as an initial filter is definitely an option. I’ve gone through two different email providers in the last year, eventually settling on one which provides expression-based, domain-based and indicator (SPF, DMARC) based blocking rules. More importantly the latest provider allows org-level choices about whether to quarantine or simply reject email entirely based on your rules.

In this scenario I didn’t feel the RD MTA was needed – plus that’s one less component to maintain.

High Level View

It’s a pretty simple layout, and all spam-detection events are carried through to a dedicated topic for later use in analysis. You don’t need the whole emails for this and attachments aren’t necessary at all but security is key here. Later iterations will use the event stream for real-time analysis, but I don’t have the time or the driver to complete that just yet (other projects are now taking priority).

Given the ready availability of ML in all three of the big cloud vendors, it won’t be too difficult to provision ML-ops to do the job. Defining the logic for the learning models… well that’s a lot more difficult!

For now though, simple grouping of events by recipient, date and sender email will show the patterns of distribution of data. I can easily discern who sold which data-set to whom, and roughly when they started using it.

Dingo’s Future

A plateau has now been established with the core rating operating as a systemd service, hosting a plain API, callable from any client – Thunderbird extension, Postfix or another plugin type. The caching tier is currently being moved from dev into cloud ops, and this very blog will likely follow suit. Having effectively ditched Ionos to go back to a combination of Azure and AWS should make this a lot more manageable (and cheaper too).

It’s fully operational and has been cataloguing events for some time now, so I suspect I’ll let it carry on in the background for a whole whilst getting some of my other projects back into shape. Next steps? Well that’s absolutely shiny toy territory…. automated generation of new detection rule sets, based on real-time analysis of potentially undetected spam events. The decision engine will be allowed to operate from automatically generated rule sets.

It’s been very satisfying to see the Dingo decision engine quietly push all the Trumptard spam, phishing scam and data trader-initiated emails into the deleted folder without needing to check with me – the only reason I knew they were there was because I just couldn’t resist checking up on the results!

Thoughts on TikTok

Thoughts on TikTok

Updated: The current attention to TikTok appears to be largely politically motivated from the Trump administration, so please fact-check all assessments on this topic.

TikTok’s sister app – Douyin – is only available within The Great Firewall of China but seems to retain a number of similarities (unconfirmed directly). However one of the key issues are such things as deep fakes propagated on the platform, prior to the evidence collected in an analysis done on the apps traffic and reverse-engineered codebase.

Love it or hate it you cannot deny that the platforms meteoric success generated massive popularity of the mobile app. Content on the app emerged from it’s lip sync-ing origins into staged comedy and more, gaining more and more popularity.

Extrapolations from the codebase are more difficult due to the obfuscation used, so some of the guesses in this area are trickier to confirm. However those inferences are backed up by behavioural analysis done on the calls made by the app in sandbox environments by Talal Bakry and Tommy Mysk.

Firstly suspicion is raised because the app checks the clipboard frequently – bear in mind that this is not a word processor or IM platform so there are very few reasons why this action could be justified.

Whilst unconfirmed there is some anecdotal evidence of concern relating to a U.S. lawsuit filed in California. The claimant in lawsuit states that TikTok created a user profile without her permission and without any action from her, alleging that the firm sent all sorts of PII back to China. Whilst this case is ongoing and there is no preliminary finding and due to the fact that TikTok has removed content offensive to the Chinese government, it appears that the platform has the capability to lock out devices belonging to those posting content it feels inappropriate.

In the case of Feroza Aziz there is a debate to be had on whether a string of previous content was appropriate – there’s too little information to make a judgement. However on balance it does appear that TikTok moderation is far more heavy-handed than US platforms such as Facebook.

That being said, we could also theorise that the current global political and economic climate – combined of course with the anti-China rhetoric from the U.S. administration – is the largest driver of the efforts to find problems with the platform.

That being said, I’ve built a mechanism to block TikTok from your network based on Debian Linux and unbound (combined with an appropriate configurations for your wireless and edge routers). The script could easily be modified for PiHole-based DNS (FTLDNS), although I suspect PiHole may add TikTok-based blocks in the near future.

You can read about that blocking mechanism here.

Diversion

Diversion

So whilst there’s been substantial progress on RD across all tiers (currently doing data architecture and working on PostgreSQL), a problem that keeps cropping up across all enterprises…cropped up again. For me this is a smaller piece of design and development work, which has benefits for a wide range of user groups – and is perfect for the open source model.

Almost feels like a distraction that I’d meant to do something about a few years ago.

Often where organisations want to take the next stage in maturity of their architecture practise, they need look at how they manage their overall enterprise continuum. The starting point is a large volume of flat Visio drawings, a tome of Word documents and probably a whole chunk of Powerpoint presentations. I’d say Writer and Draw but am unsure of how many people would get the reference 🙂

If you’re fortunate enough you may be operating a “living document” approach in platforms like Confluence, but even with some of the diagrammatic markdown there inevitably most of the drawings will still be in Visio. Living document platforms allow you to ditch disperate single document files in order to design & deliver change wiki-style. With the linkage to JIRA it really comes alive of course.

Even really powerful Visio drawings are just drawings. It’s not like there’s a data dictionary or much meta-data behind the shapes – unlike DWG, iServer store or Sparx EA repositories.

What do you have against Visio then?

Nothing at all. It’s a great tool – I remember working for a small architecture & surveying practise in the late 90’s, building CAD tools for them to use in AutoCAD and Microstation. AutoDesk were trying to get their developer network (the ADN) interested in developing on their flat drawing product, Atrix Technical. Intended to be a template & stencil-based CAD tool (sound familiar?), focusing on simplicity vs. AutoCAD-level drawing dictionaries, it was struggling to gain traction with the ADN.

It looked like the biggest problem was that, in an industry already using AutoCAD it was difficult to sell the benefits of a simplified tool. Surveyors just weren’t that interested. At the company I worked for at the time we hit upon the idea of using it for estate agents – when they’re assessing a property it might be a good way of quickly drawing a layout. We import a survey DWG into Atrix as the home layout, then give the estate agents a bunch of to-scale stencils with household items e.g. furniture and appliances.

Sounded good enough that we approached AutoDesk to see if we could get some help with funding for the project – they even came to Nottingham to see us.

Read More Read More

And so it begins…

And so it begins…

The journey from product roadmap to backlog prioritisation is always challenging…

I’ve already written a good portion of the middle-ware for the product post-pilot, but am now realising it’s a pretty big job for one person. Building servers and selecting appropriate EU IaaS providers, adhering to legislation and keeping technically relevant are only small parts of the roadmap.

In 2016 there was a lot of spam going about. At one point I was getting thousands of emails a week of which approximately half were to addresses I’d used explicitly to send SARs (subject access requests).

It wasn’t just appalling re-use of data, it was wanton flaunting of data protection and privacy laws for profit. Some of the attitudes were Facebook-level appreciate of data protection law – including one response to a standard ASA complaint which the ASA uncomfortable and gave a few of us a laugh:

In response to a complain to the ASA; which related to an advertiser (AdView & Roxburghe) buying lists of names and email addresses from Indian data traders; failing to verify or ask for explicit consent and then using those peoples details to send them spam advertising the Roxy services

Fortunately two things happened:

  1. I had acquired a lot of skills & some significant experience in technical fields over the years
  2. The more spam I received, the more data & meta-data I was accumulating on spammers

After a couple of years of trying direct action in the UK County Courts (with mixed success) I realised I could use the meta-data to build an email security product which I could then distribute on open-source. I started tinkering with Python and my usual email client, Gnome Evolution – as it allows you to easily create mail filters which call a script.

That evolved into a much wider capability that I’ve piloted on my own mailbox for the last two years or so. All seems to work reasonably well and efficiently. However after visiting a few of the stands at the InfoSec Europe 2019 trade show at Olympia, I realised there’s a lot of companies selling the same or similar platforms for a lot of money.

However none of them seemed to interact with spam email the way I was designing my product.

Which brings us to the subject of this blog – an email security product code-named “Ringo Dingo”, after asking for suggestions from everyone at home. Next time perhaps I’ll pick a random word from a dictionary.

So I needed a way of tracking the random thoughts crossing my brain about it, rather than forget something critical or unusual that would be good to add to the overall capability. I started using my usual kanban, Trello, to log ideas and triage the good stuff from the crap, and have progressed all workable options to Gitlab.com boards.

The pilot is pretty much done and dusted, so I started redesigning Ringo Dingo as middleware – which would enable the access from any mail client or MTA. What was missing was a book of progress on the overall piece… which is where this blog comes in.

My other blogs focuses on other topics relating to data protection whereas the this record is purely R&D.