Building an automatic fraud prevention system from scratch
During the last half of a year I spent my efforts on building an automatic analysis system without having any prerequisites in-place. Today ideas, which we have found and implemented in our system, help us find a lot of fraudulent activity and analyse it. In this article I’d like to share principles which we followed and what we did to get to the current state of our system.
Principles of our system
When you hear terms like “automatic” and “fraud” you, most likely, start thinking about Machine Learning, Apache Spark, Hadoop, Python, Airflow and other technologies of the Apache foundation ecosystem and Data Science area. I think there is one aspect of using these tools which is usually not mentioned: they require certain prerequisites to be in-place in your enterprise system before starting to use them. In short, you need to have an enterprise data platform which comprises data lake and warehouse. But what if you don’t have such a platform and still need to develop this practice? The following principles which I’m telling about below, helped us reach the moment when we can focus on improving our ideas rather than finding a working one. Nevertheless, it’s not “a plateau” of the project. There are still a lot of things in our road map from technological and business standpoints.
Principle 1: Business value at first
In the head of all our efforts we have put “business value”. In general, any automatic analysis system belongs to a group of sophisticated systems with high levels of automation and technical complexity. Building a complete solution will take tremendous time if you build it from scratch. We decided to put the bringing of a business value at first and the technological completeness aspect at second. In real life it means that we don’t take technological best practices as a dogma. We chose technology which works the best for us at the current moment. With the time it may appear that we’ll need to re-implement some modules. This is a trade off which we accepted.
Principle 2: Human augmented intelligence
I bet the majority of people who are not involved deeply in developing ML solutions may think that the replacing of humans is the goal. In reality ML solutions are far from perfect and only in certain areas the replacement is possible. We dropped that idea from the very beginning because of several reasons: imbalanced data of fraudulent activity and impossible to provide an exhaustive list of features for ML models. In contrast we chose to go along with augmented intelligence. It is an alternative conceptualization of artificial intelligence that focuses on AI’s assistive role, emphasizing the fact that cognitive technology is designed to enhance human intelligence rather than replace it. 
Bearing this in mind, the development of a complete ML solution from the beginning would have required enormous effort which would have delayed bringing a value to our business. We decided to build the system iteratively growing ML aspect having guidance from our domain experts. A challenging part of developing such a system is that it should provide cases to our analysts not only from perspective whether it’s fraudulent activity or not. In general, any anomaly in customer behavior is a suspicious case which experts need to investigate and somehow react. Only some portion from these captured cases can be really categorized as fraud activity.
Principle 3: Extensive analytics data platform
The most difficult part of our system is the end-to-end verification of system workflow. Analysts and developers should easily obtain datasets for the past periods with all metrics which were used for analysis. In addition, the data platform should provide an easy way to enhance the existing set of metrics with new one. The processes which we build, and they are not only programmable processes, should enable easy re-calculation upon previous periods, adding new metrics and changing of data projection. We could achieve that by accumulating all data which our production system generates. In such a case data slowly would become a liability. We would need to store the growing volume of data which we don’t use and protect it. In such a scenario, with time the data would become more and more non relevant, but still requiring our efforts to manage it. For us the data hoarding would make no sense and we decided to use another approach. We decided to organize real-time data warehouses around target entities which we want to classify and store only that data which allows verification of the most recent and relevant periods. The challenging side of this effort is that our system is a heterogeneous one with multiple data storages and programming modules which require careful planning to work consistently.
Constructive concepts of our system
We have four main components in our system: ingestion system, computational, analysis (BI) and tracking system. They serve for specific isolated purposes and we keep them isolated following certain approaches in development.
Contract first design
First of all, we agreed that components should rely only on defined data structures which are transmitted between them. It allows easy integration between them and not forcing a concrete composition (and an order) of components. For example, it allows us to integrate the Ingestion system with the Alerts tracking system directly in some cases. In such a case, this would be done along the agreed contract for alerts. It means that both components would be integrated using the contract which any other component can use. We won’t add an additional contract for ingestion of alerts into the alerts tracking system. This approach enforces the use of a predefined minimal number of contracts and keeps the system and communications simple. Basically, we use the approach which is called “Contract First Design” and apply it on streaming data contracts. 
Streaming all way long
The keeping and managing of state in the system will inevitably lead to complications of its implementation. The state should be shared to allow access from any component, it should be consistent and provide the most actual value to all components, and it should be reliable with correct values. In addition, having calls to a persistence storage to obtain the last state would increase I/O operations and complexity of algorithms used in our real-time pipelines. Because of that we decided to keep transmissions as stateless as possible. This approach enforces transmitting all necessary data along with the transmission. For example, if we need to calculate a total number of some cases (count of operations or cases with specific characteristics) we calculate that in memory and generate a stream of such values. Dependent modules will use partitioning and batching to read required data. This approach removed the requirement to have a persistent disk storage for such data. Our system uses Kafka as a message broker and it’s possible to use it as a database with KSQL.  But using it would tightly couple our solution with Kafka and we decided not to use it. The approach which we chose enables us to replace Kafka with another message broker without heavy change of integrations.
This concept doesn’t mean we’re not using disk storage and database. For verification and analysis of system performance we need to store on a disk a significant portion of data, which represents different metrics and states. The important point here is that real-time algorithms are not dependent on such data. In most cases we use stored data for offline analysis, debugging and tracing of specific cases and results which the system produces.
Challenges of our system
There are certain problems which we solved on some level, but they require better solutions. I’d like just to mention them here now because each is worth a separate article.
- We still need to define processes and policies that foster accumulation of meaningful and relevant data for our automatic analysis, data discovery and exploration.
- Introducing a human in the process of automatic system adjustment for recent incoming data. This is not only updating our model, but also updating processes and learning our data.
- Finding a balance between deterministic IF-ELSE approach and ML. Someone said: “ML is the tool for desperate”, — meaning that you will want to use ML when you don’t understand anymore how to optimize and improve your algorithms. On the other hand the deterministic approach prevents finding anomalies which were not defined.
- We need to have an easy way to check our hypotheses about new feature, data metric dependencies.
- The system should have several levels of true positive results. The fraudulent cases are just a portion of all cases which could be considered as true positive for the system. For example, analysts capture cases for verification and only a small part of them is fraud. The system should effectively provision all cases to analysts regardless whether it’s pure fraud or just suspicious behavior.
- The data platform should allow obtaining datasets from past periods with computations created and calculated on the fly.
- Easy and automatic deployment of any of the system’s components on, at least, three different environments: production, experimental (beta) and development.
- And the last, but not the least. We need to build an extensive performance verification platform where we can analyse our models.