The Proceedings of the 9th Annual International Digital Government Research Conference
Uses of Artificial Intelligence in the Brazilian Customs
Fraud Detection System
Luciano A. Digiampietri∗ Norton Trevisan Roman∗ Luis A. A. Meira∗
Institute of Computing Institute of Computing Institute of Computing
Av. Albert Einstein, 1251 Av. Albert Einstein, 1251 Av. Albert Einstein, 1251
13084-971 Campinas, SP 13084-971 Campinas, SP 13084-971 Campinas, SP
(BRAZIL) (BRAZIL) (BRAZIL)
Jorge Jambeiro Filho† Cristiano D. Ferreira∗ Andreia A. Kondo∗
Brazil’s Federal Revenue Institute of Computing Institute of Computing
Rodovia Santos Dummont, Av. Albert Einstein, 1251 Av. Albert Einstein, 1251
Km 66 13084-971 Campinas, SP 13084-971 Campinas, SP
13055-900 Campinas, SP (BRAZIL) (BRAZIL)
(BRAZIL)
ABSTRACT order to pay a smaller tax. However, product misclassifica-
There is an increasing concern about the control of customs tion is only one of several frauds related to customs opera-
operations. While globalization incentives the opening of tions [14]. We highlight other kinds of fraud: overvaluation,
the market, increasing amounts of imports and exports have undervaluation, smuggling and drug traffic. All these kinds
been used to conceal several illicit activities, such as, tax of fraud can be used to support terrorists, drug traffickers
evasion, smuggling, money laundry, and drug traffic. This and organized crime in general.
fact makes it paramount for governments to find automatic
or semi-automatic solutions to guide the customs’ activities Each country is responsible to inspect the customs opera-
in order to minimize the number of manual inspections of tions in order to identify frauds and punish the transgressors.
goods. In this context, this paper presents an overview of Given the limited amount of available resources, it became
some approaches developed in the HARPIA project that is impossible to inspect all the customs operations and iden-
a partnership between universities and the Brazilian Federal tify all frauds. The goal of this paper is to describe part
Revenue for the development of computational intelligence of an ongoing project, called HARPIA3 . This project is a
solutions to the management of customs risk. partnership between Brazilian universities and the Brazilian
Federal Revenue for detecting several types of fraud through
Categories and Subject Descriptors the application of artificial intelligence. In this paper we de-
H.4 [Information Systems Applications]: Miscellaneous; scribe two aspects of this project: (i) an outlier based detec-
J.1 [Administrative Data Processing]: Government tion system that helps customs officers to identify suspicious
customs operations; and (ii) a product and foreign exporter
information system that aims to help the importers in the
Keywords registration and classification of their products and corre-
E-government, fraud detection, outlier detection sponding exporters.
1. INTRODUCTION The rest of this paper is organized as follows. Section 2
Imports and exports are fundamental aspects of the global presents the related work. Section 3 describes our approach
economy. Goods are typically taxed proportionally to their to the problems of identifying suspicious customs operations
value, varying according to the type of product. Each prod- and registering goods and exporters. Section 4 presents the
uct is classified following a specific classification system. For conclusions and future steps.
Mercosul [12], this classification system is called NCM (Mer-
cosul Common Nomenclature), which is similar to the “Har-
monized Commodity Description and Coding System” used
2. RELATED WORK
Detecting fraud using normal audit procedures is an expen-
by World Customs Organization (WCO) [15]. This clas-
sive and a laborious task. There are few customs officers
sification system has approximately ten thousand different
that have the necessary expertise and hundreds (or some-
codes that describe products categories (instead of specific
times, thousands or even millions) of operations that must
products). Often, it is not trivial to assign a category code to
be verified. This brings up a new challenge: how to con-
a product due to the great number of categories and the fact
struct computational solutions to automatically or semi-au-
that descriptions of some categories are abstract. Moreover,
tomatically identify suspicious operations. Data mining and
many importers assign an incorrect category to products in
statistical approaches are being applied to try to identify
1
{luciano.digiampietri,nortontr,augustomeira, these fraudulent operations.
crferreira,andreia.kondo}@gmail.com
2 3
[email protected] HARPIA: Risk Analysis and Applied Artificial Intelligence
181
The Proceedings of the 9th Annual International Digital Government Research Conference
The detection of suspicious activities is a problem in sev- tions are written. Since there is no global database of for-
eral domains, such as, credit card fraud, telecommunica- eign companies and products, each importer must re-type
tions fraud, terrorism detection, financial crime detection, the name, description and classification of the products and
and computer intrusion detection. Detecting fraud is essen- the name of the company (exporter) that sold them. This
tial as prevention mechanisms fail [17] and a good detection process is susceptible to several kinds of errors. We high-
system must be self-adaptive to detect new fraudulent be- light (i) the misclassification of products (because it is a
haviors. laborious work to assign one of the ten thousand categories
to each product), and (ii) the registration of companies or
There are several approaches to deal with fraud detection. products with mistakes such as misspelling. To avoid these
We highlight the use of neural networks [3, 5], bayesian net- two problems a common approach is the development of spell
works [11], expert systems [2], rule based systems [1] and the verification systems and/or approximate search engines that
detection of statistical outliers [6, 14, 16]. These approaches try to identify what the user is trying to type.
can be subdivided in two groups: supervised and unsuper-
vised. In the supervised approaches there is a training set The first approach that was used in the HARPIA project
of operations that are labeled either as fraudulent or nor- to avoid redundancy in the Brazilian’s foreign companies
mal. These operations are used as input to some systems, database was based on a modified edit-distance
such as neural network systems, that need labeled inputs to algorithm [13]. This solution extended the edit distance
construct the model that will be used to detect frauds. proposed by Levenshtein [10]. The main idea of the mod-
ified algorithm is to break the strings into words, compare
The use of supervised learning by Brazilian customs to select and compute the distance between them, and search for the
goods for human verification was originally described in [4]. minimum cost “path” that links them together. See [13] for
Alternative strategies have been employed in [7] without details about this approach.
benefits, but a novel approach, described in [8], achieved sig-
nificant improvements in some performance measures. The The edit-distance based approach presented good initial re-
unsupervised approaches do not need labeled inputs, as they sults but it was not robust enough to deal with all problems
use a set of rules to classify an operation as a fraud or com- in the products and foreign exporter database. Section 3.2
pare each one with the previous operations to identify those presents a more complex approach using Markov Chain and
that might be considered suspicious (outliers). n-grams [9].
Rule based systems are unsupervised approaches that use a
3. OUR APPROACH
set of rules to classify the operations as fraudulent or nor-
Our approach to identifying possible frauds is based on the
mal, or to assign a value to each operation corresponding
interaction between the customs officer and the decision sup-
to the chance an operation has to be a fraud. The rules are
port system we developed [14]. This system, called Caran-
typically constructed following the advises of experts. These
cho, highlights suspicious operations through outlier detec-
systems have the advantage of being unsupervised and tak-
tion. It assumes that the majority of the international com-
ing account of the experts’ knowledge to construct the rules
merce operations are correct, i.e., they are in accordance to
that evaluate each operation. One of the disadvantages of
the law and the products are correctly classified.
these systems is the fact that the rules frequently need to be
updated to deal with new fraudulent behaviors. Otherwise,
Due to the great amount of products and companies (ex-
the rules will eventually become obsolete.
porters, importers, transporters, etc) it is very difficult to
ensure that the products and the companies are correctly
The identification of frauds using outlier detection (e.g. [14,
classified, avoiding misclassification or multiple registration
16]) is an unsupervised approach that identifies suspicious
of the same company. To solve this problem, we are devel-
operations comparing each operation with the previous ones.
oping a Product and Foreign Exporter Information System
One advantage of this approach is the capability to adapt
(PFEIS) that uses features from orthographic verification to
(and identify) new behaviors while new operations are stored
suggest possible duplicities (i.e. when the user tries to reg-
in the system. Another advantage is the clear statistical
ister an already registered company or product) and to help
meaning that is assign to each suspicious operation. For
on their classification.
example, the system can calculate that one operation is four
standard deviations away from its expected value and that
Although both systems may seem only loosely related, they
this happens only once in one thousand operations. This
actually draw on a bigger picture, as shown in Figure 1,
operation is an outlier and deserves to be inspected (as it is
which presents some of the main modules that build up the
a suspicious operation).
artificial intelligence part of the HARPIA architecture. This
figure also illustrates the strategies followed by the HARPIA
One important prerequisite of outlier detection systems for
project to tackle the problem of customs fraud detection.
fraud detection is that the majority of the operations stored
These strategies, in turn, concentrate mainly on (i) build-
in the system must be normal (not fraudulent). Moreover
ing a reliable database of products and foreign exporter
it is important to emphasize that being an outlier does not
(PFEIS), (ii) trying to identify suspicious operations before
mean to be a fraud. Besides this assumption, it is also im-
(Carancho) and after (ANACOM) clearance, and (iii) con-
portant to ensure that the importers, exporters and products
trolling for small imports coming to the country through the
are correctly registered and classified.
express mailing service. Thus, as it can be seem, both PFEIS
and Carancho are linked together by the former building the
Every day, hundreds or even thousands of import declara-
dataset needed in the later. This paper describes only the
182
The Proceedings of the 9th Annual International Digital Government Research Conference
Carancho and the Product and Foreign Exporter Informa- As one may notice, this representation lacks information
tion System modules. about the relative amount of import operations for a spe-
cific pair of dimensions, thereby lacking the very information
needed to give the user some insight about the importance
of a specific point (like an outlier, for example). Even worse,
3.1 THE CARANCHO SYSTEM the overlapping points might generate some distortion in the
Instead of trying to formulate an exhaustive set of rules to
coloring scheme, hiding some points out and perhaps render-
cover the broadest number of frauds possible, the approach
ing the whole visualization less reliable.
we followed relies upon the graphical visualization of histor-
ical import/export data (see Carancho [14]). In a nutshell,
To avoid these drawbacks, the user can tick the “Densidade
it takes the historical record of import operations as a start
2D” (2D Density) box, bringing the density of operations
point and presents it to the user. The user then can check
on to the picture. The system, in turn, colors each point
whether some specific transaction can be considered an out-
according to the relative amount of imports it might contain,
lier according to a number of predefined dimensions.
from yellow (the groups with fewer import operations) to red
(the groups with the higher number of operations among the
As our main interest is detecting under and overvaluation,
data). When the system colors some point, it does so using
we have chosen a set of dimensions thought to be sensitive to
a Normal curve for intensity, i.e., the color smoothes out as
such problems, according to the customs officers’ practical
it moves away from the point, as illustrated in Figure 5.
experience and expertise. Then, in a sense, this approach
combines the visual detection of outliers with the officers’
Although this new representation seems to sort out most
empirical knowledge.
of the problems with the data visualization, it still suffers
from a fundamental difficulty, namely, the considerably high
The main advantage in using this approach comes up more
degree of subjectivity brought to the system by the current
clearly when the trading system changes, as when the goods
goods classification scheme (i.e. the NCM). This subjectiv-
classification scheme changes, for instance. While these
ity, which lets considerably different products be correctly
changes would demand the set of rules mapping conditions
classified in the same category, has the undesired property
to consequences to be updated, some dimensions (like weight
of grouping together very sparse data, thereby making it dif-
and price, for example) remain untouched, i.e., they still
ficult for the user to determine what an outlier would look
can be used to characterize any import operation. That
like, given such a dataset.
fact makes them a naturally long-lasting choice for detecting
any abnormal behavior. The same way, changes in the im-
The solution we found to this problem was to develop a regis-
porters’ behavior, that otherwise would also demand updat-
tration system to identify each foreign exporter and his/her
ing the set of rules, are naturally captured by this approach,
corresponding exported goods. This system, described in
as it accounts for the whole amount of import operations
the next section, should be able to evolve over time, natu-
that took place in some time range.
rally adapting to the new products brought forth by the mar-
ket (and to new exporters coming into it), without any in-
To verify the practical applicability of this idea, we have de-
tervention from the customs office. Once it is accomplished,
veloped a computer system capable of analyzing the whole
the system would give every exporter and product a unique
set of data and show it to the user in a way s/he can clearly
identifier, allowing Carancho to group together only prod-
spot any outliers (Figure 2). The rationale behind this ap-
ucts that are really close to each other, thereby increasing
proach is that it allows for an automatic outlier detection
the reliability of its output.
technique to be used alongside the user’s decisions, either
concurrently or giving them support.
3.2 PRODUCT AND FOREIGN EXPORTER
Originally designed to deal with only three dimensions, the INFORMATION SYSTEM
first version of this system shows the data distribution ac- It is a difficult task for the Brazilian Federal Revenue to cre-
cording to the predefined dimensions, along with the oper- ate unique identifiers to companies situated out of Brazil.
ation under evaluation (portrayed as a thin horizontal red This requirement appears every time these companies buy
line in Figure 3). However, such an approach presents a ma- or sell goods across our frontiers. Without a unique identi-
jor limitation to the user, namely, it only allows for data to fier, a foreign company can be repeatedly fraudulent without
be analyzed in one single axis (as it is a histogram), thereby any special attention from the Federal Revenue and it can
losing any information concerning the relation that different be treated as if it was a new enterprise at each transaction.
dimensions might hold with each other. To cope with this problem, we are developing a catalog to
assign unique identifiers to each company. This catalog aims
To deal with this shortcoming, and once more based on the to minimize redundancy, by providing the importer with a
customs officers’ expertise, we have redesigned the way the search engine, so that s/he can search for previous registra-
system outputs the data. The new representation, as illus- tion of a company before registering it again.
trated in Figure 4, deals with pairs of dimensions, allowing
the user to determine any trend that might exist inside each The effort to keep foreign enterprises correctly registered can
pair. In this Figure, the four importers responsible for the be naturally extended to products commercialized among
highest amount of operations (numbered 0 to 3) are por- them. The goods that enter or leave the country have similar
trayed on different shapes and colors. A fifth shape (and demands for unique identifiers. These identifiers are desir-
corresponding color) is reserved for the rest of the data, i.e. able to facilitate automatic or semi-automatic fraud detec-
the data coming from all the remaining importers. tion system (see Section 3.1). Inside the HARPIA project,
183
The Proceedings of the 9th Annual International Digital Government Research Conference
Figure 1: Part of HARPIA’s AI architecture.
Figure 2: The system’s interface.
184
The Proceedings of the 9th Annual International Digital Government Research Conference
Figure 3: Output of the system’s first version.
Figure 4: Relationship between the data in each pair of dimensions.
Figure 5: Relationship between the dimensions (and their density).
185
The Proceedings of the 9th Annual International Digital Government Research Conference
two catalogs are being developed: the Product Catalog Sys- maximize the number of frauds detected. They are com-
tem and the Foreign Importer/Exporter Catalog System. plex systems that must deal with several problems, such as,
high cardinality attributes, imbalanced databases, and mis-
National enterprises that trade with other countries are u- spelling problems.
niquely identified in Brazil by the CNPJ number, which is
a unique identifier provided by the Federal Revenue. When In this paper, we presented some artificial intelligence ap-
these enterprises make an international transaction, the Bra- proaches used in the Brazilian’s customs fraud detection
zilian Federal Revenue will have them register their inter- system. The main contributions are (i) the ability to help
national partners, following a specific protocol. First, the identify outliers (suspicious operations), and (ii) the prod-
user designated by the national enterprise queries the Im- ucts and foreign exporters information system (including
porter/Exporter Catalog looking for the partner company. databases and tools to identify redundancies and to suggest
The system, in turn, looks up the database, returning any a category to each product).
match it finds, ranked according to a probability function.
As for future work, we are currently developing some au-
If, on the other hand, no satisfactory match is found, the tomatic outlier detection techniques, which will be used in
national company can create and register its foreign partner conjunction with the visual techniques to show the user both
in the system. Once this operation is confirmed, the new the graphics and the probability values. These values, in
foreign company is registered in the catalog and a unique turn, would represent the system’s confidence that, accord-
identifier is created. This identifier can then be used by ing to the current dataset, a given product actually costs
the national company whenever it makes an international the amount declared by the importer.
transaction with the foreign company it represents. The
same procedure will be followed whenever the importer tries
to register a new product.
5. ADDITIONAL AUTHORS
Additional authors:
Everton R. Constantino (Institute of Computing, UNICAMP,
There is, however, more about these catalogs than a sim-
email: [email protected]),
ple search engine. The users of a search engine are very
Rodrigo Rezende (Institute of Computing, UNICAMP,
interested in finding whatever they describe in their queries.
email: [email protected]),
Companies which want to commit frauds do not want their
Bruno C. Brandao (Institute of Computing, UNICAMP,
foreign partners or the products they are pursuing to be rec-
email: [email protected]),
ognized. To carry out this task in a proper way, we need to
Helder S. Ribeiro (Institute of Computing, UNICAMP,
care about spelling errors, i.e., we must take into account,
email: [email protected]),
among other things, the possibility that the user mistypes
Pietro K. Carolino (CLE-IFCH, UNICAMP,
his/her query. Also, the system must be able to identify and
email: [email protected]),
correct multiple instances of the same company or product.
Antonella Lanna (Brazil’s Federal Revenue,
To do so, the catalogs have a built-in probabilistic spelling
email: [email protected]),
checker, along with methods for insertion, deletion, merging
Jacques Wainer (Institute of Computing, UNICAMP,
and correction of records, in an attempt to keep the database
email: [email protected]) and
consistency.
Siome Goldenstein (Institute of Computing, UNICAMP,
email: [email protected]) .
The spelling checker’s implementation is based on Markov
Chains and n-grams. These techniques are used mainly for
calculating a word similarity value, based on string match- 6. REFERENCES
ing operations, and to calculate the probability that a given [1] A. Deshmukh and T. Talluru. A rule based fuzzy
string is, in fact, a valid word in a given domain. Ob- reasoning system for assessing the risk of management
serve that, in a multi-language domain of proper names, fraud. Journal of Intelligent Systems in Accounting,
new words can be considered neither wrong nor right, for Finance & Management, 4:669–673, 1997.
there is no proper lexicon to match them against. [2] M. M. Eining, D. R. Jones, and J. K. Loebbecke.
Reliance on decision aids: an examination of auditors
Under these constraints, the system must deal with unreli- assessment of management fraud. Auditing: A Journal
able information, that is, a dataset that might also contain of Practice and Theory, 16(2):1–19, 1997.
ill-formed strings, being potentially as problematic as the [3] K. Fanning and K. Cogger. Neural network detection
query string from the user. For this reason, our systems of management fraud using published financial data.
use a probabilistic model that takes into account the com- International Journal of Intelligent Systems in
monest misspelling errors, keyboard character position, and Accounting, Finance & Management, 17(1):21–24,
the semantics for a set of special words, like “international”, 1998.
“ltd” and “co”, for instance. Whenever a new product or [4] M. A. C. Ferreira. Uso de redes de crença para seleção
enterprise is inserted in the catalog, its words are added in de declarações de importação. Master’s thesis,
the vocabularies and the probabilities and frequencies are Instituto Tecnológico de Aeronáutica, 2003.
updated. [5] B. P. Green and J. H. Choi. Assessing the risk of
management fraud through neural network technology.
4. CONCLUSIONS AND FUTURE WORK Auditing: A Journal of Practice and Theory,
Fraud detection systems in customs operations are very im- 16(1):14–28, 1997.
portant to minimize the manual inspection of goods and [6] V. Hodge and J. Austin. A survey of outlier detection
186
The Proceedings of the 9th Annual International Digital Government Research Conference
methodologies. Artificial Intelligence Review,
22(2):85–126, 2004.
[7] J. Jambeiro Filho and J. Wainer. Analyzing Bayesian
networks with local structure and cardinality
reduction over a practical case. In Proceedings of the
Workshop on Computational Intelligence (WCI), 2006.
[8] J. Jambeiro Filho and J. Wainer. Using a hierarchical
Bayesian model to handle high cardinality attributes
with relevant interactions in a classification problem.
In Proceedings of the International Joint Conference
of Artificial Intelligence (IJCAI). AAAI Press, 2007.
[9] D. Jurafsky and J. H. Martin. Speech and Language
Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition. Prentice Hall, Englewood Cliffs, New
Jersey, 2000.
[10] V. I. Levenshtein. Binary codes capable of correcting
deletions, insertions and reversals. Soviet Physics
Doklady, 10(8):707–710, 1966.
[11] S. Maes, K. Tuyls, B. Vanschoenwinkel, and
B. Manderick. Credit card fraud detection using
Bayesian and neural networks. In Proceedings of the
1st International NAISO Congress on Neuro Fuzzy
Technologies, 2002.
[12] Mercosul/Mercosur – Southern Common Market.
http://www.mercosur.int/msweb/ (as of 2007-10-25).
[13] B. W. Paleo, C. G. G. Hita, J. C. Lima, C. H. Ribeiro,
and J. Jambeiro Filho. A modified edit-distance
algorithm for record linkage in a database of
companies. In Proceedings of the 2nd Workshop em
Algoritmos e Aplicações de Mineração de Dados
(WAAMD), 2006.
[14] N. T. Roman, E. R. Constantino, H. Ribeiro, J. J.
Filho, A. Lanna, S. K. Goldenstein, and J. Wainer.
Carancho – a decision support system for customs. In
Proceedings of ECML PKDD Workshop on Practical
Data Mining: Applications, Experiences and
Challenges, pages 100–103, September 2006.
[15] World Customs Organization.
http://www2.wcoomd.org/ie/index.html (as of
2007-10-25).
[16] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne.
On-line unsupervised outlier detection using finite
mixtures with discounting learning algorithms. Data
Mining and Knowledge Discovery, 8(3):275–300, 2004.
[17] D. Yue, X. Wu, Y. Wang, Y. Li, and C.-H. Chu. A
review of data mining-based financial fraud detection
research. In International Conference on Wireless
Communications, Networking and Mobile Computing
(WiCom), pages 5514–5517, September 2007.
187