Tải bản đầy đủ
3 Analysing administrative sources – input data quality

3 Analysing administrative sources – input data quality

Tải bản đầy đủ

272

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

coding system also has some weaknesses that lead to bad quality; categories such
as ‘other types of theft’ are easy for stressed police officers to use.
Example: A common opinion is that taxpayers only submit data that serve their
purposes and consequently tax data are of low quality. In the example below, taken
from a leading Swedish newspaper, most people would like to pay as little tax as
possible, so the deductions may be higher than are justifiable.
80 per cent of Swedish people’s tax deductions are pure tax evasion
Taxpayers submit errors worth billions in their tax declarations. Complicated rules and unclear legislation have made it hard for the country’s tax authorities to check all the deductions. Errors can be found
primarily in the deductions for share transactions, management fees and other share-related charges.

Deductions for the sales of shares
– 1/3 of all share sales contain errors
– 700 000 taxpayers report profit of around SEK 50 billion and losses of around SEK 10 billion
– Tax errors are difficult to judge and amount to billions of Swedish kronor
– Many inadvertent errors occur because of the complicated rules

Deductions for management fees
– 125 000 taxpayers claimed deductions of a total of SEK 515 million
– 66% of these deductions contain incorrect information
– Tax errors can in total be calculated at SEK 90 million
– A deduction for fees for fund managers is the most common error, the fee is deducted automatically

Deductions for other expenditure
– 700 000 taxpayers claimed deductions of a total of SEK 2.8 billion
– 82% of these deductions contain incorrect information
– Tax errors can in total be calculated to amount to around SEK 700 million

The headline exaggerates in several ways, ‘80 per cent’ is an exaggeration and ‘tax
evasion’ is often based on misunderstanding due to complicated rules:
– Deductions for share sales: the errors are largely unintentional.
– Deductions for management fees: the errors are on average 17% (90/515) and
the most common error may be unintentional due to misunderstanding.
– Deductions for other expenditure: 82% of these deductions contain errors but the
deductions are on average 25% (700/2800) incorrect.
Another perspective on these errors is when they are compared to the total income
for all those filing tax declarations, the error is 0.3%.
The fact that deductions in the declarations are too high, and that consequently
the tax is too low, does not mean that the statistics in the Income Register are of
low quality, even though they are based on these declarations. Assume that we
have data for a person who makes excessively high deductions on her/his tax
declaration, but otherwise declares correctly:
Income from employment
Deductions for other expenditure

257 600
25 500

The income is correct.
The deduction is too high but is accepted.

Taxable income

232 100

Taxable income is incorrect according to the tax rules
but is not used for the statistics.

Tax

100 000

The tax is incorrect and too low according to the tax
rules, but statistically it is correct, as this is the tax that
the person actually paid.

Disposable income

157 600

Statistically correct

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

273

Statistics Sweden’s statistics on earned income are not incorrect because of this
person’s data; neither are the statistics regarding disposable income incorrect, as
this is formed by calculating the difference between income and tax actually paid.
Our conclusion is that taxation data are quite suitable for statistical use, even
though some criminals submit false data. Sending statistical questionnaires to these
criminals would not give us better statistical data.
Systematic check of input data quality
Daas et al. (2010) suggest that each administrative source should be evaluated for
statistical purposes prior to use. However, if sources are evaluated one by one there
is a risk that sources that could be combined with other sources for a new survey or
be used for some improvement of the system will be overlooked.
As a rule, the understanding of how an administrative source should best be used
by a statistical office requires time to develop. New competence and new methods
must be developed, and as there may be many potential ways of using the source, it
may be necessary to evaluate the source more than once. Each administrative
register or source that a statistical office considers using for statistical purposes
should first be analysed to see if it is usable and how it could be used.
The quality indicators in Charts 7.3, 7.4, 9.4 and 15.7 below were developed by
Laitila, Wallgren and Wallgren (2012) and are used for quality assessment of an
administrative source. During the work with these indicators, the statistical usability of the source is analysed, many usages are considered, and many combinations
with other sources or surveys are evaluated.
In Laitila, Wallgren and Wallgren, the indicators above are used to analyse the
yearly income statements from all employers regarding gross pay, social security
contributions and preliminary tax during 2009. Individual data regarding all employees are delivered during January after the year in question.
From Chart 7.3
Analyse metadata

From Chart 7.4
Analyse the source

From Chart 9.4 Compare
source with base register

Indicator Quality factor
Relevance of
A1
population

Indicator Quality factor
Quality of identifying
B1
variables

Indicator Quality factor
Undercoverage in
C1
base register

A2

Relevance of units

B2

Quality of reference
variables

C2

Overcoverage in base
register

A3

Relevant matching
keys

B3

Duplicates

C3

Undercoverage in the
source

A4

Relevance
of variables

B4
B5

C4

Overcoverage in the
source

A5

Relevance of
reference time

B6

Missing values
Wrong values
Quality of preliminary data

C5

Can the source improve base register?

A6

Relevant study
domains

A7

Comprehensiveness

A8
A9

Updates, delivery
time and punctuality
Comparability over
time

The indicators A1–A9 measure the relevance of the
source, and B1–C5 measure aspects of accuracy.
The quality indicators D1–D4 in Chart 15.7 require
much work and analytical capability. The work should
be carried out by a team with subject-matter competence and methodological training.

274

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

Chart 15.7 Compare with a relevant set of surveys
Indicator Quality factor

Description

D1

Is the source good
or bad?

a) Compare populations
b) Compare units
c) Compare variables

D2

Is the production
a) Compare populations
system good or bad? b) Compare units
c) Compare variables

D3

Can the source
improve other
surveys?

a) Will population be better?
b) Will units be better?
c) Will variables be better?

D4

Can the source be
combined with other
sources?

a) Will population be better?
b) Will units be better?
c) Will variables be better?

What characteristics does the
source have? How can it be
used? There may be many
possible ways of using a
source. How should the
source be treated to make it
usable? Should it be combined with other sources?
The analysis can be done for
different combinations of
sources.

Metadata – information from the Administrative Authority 2
The relevance of income statements for statistical purposes should be assessed by
subject-matter specialists. As this is work with economic statistics, experts from
the unit working with the National Accounts should also be consulted. The tax
form with explanations and the brochure on income statements (about 40 pages)
that is available for all employers is the main source of information from the National Tax Agency that should be analysed. The results of this analysis regarding
the income statements are summarised in Chart 15.8.
Chart 15.8 Information from the administrative authority – relevance
Indicator Quality factor

Description

A1

Relevance
of population

The source contains information on jobs as employed, employed persons,
enterprises that are employers, and local units where employed persons
work. All of these are relevant as statistical populations.

A2

Relevance
of units

The source contains four kinds of relevant units (jobs, employees, legal
units that are employers, local units with employees).

A3

Relevant
matching keys

Three important keys are combined: Identity number of the employer,
Personal Identity Number of the employee, and Work site number.

A4

Relevance
of variables

Gross salary on the tax form plus benefits, correspond to the definitions
used by the National Accounts.

A5

Relevance of
reference time

The Income Statements give information on wages and salaries paid to
the employees during the calendar year. This definition is in accordance
with the needs of the National Accounts.

A6

Study domains

All kinds of study domains are possible.

A7

Comprehensiveness

The source covers all employees and all employers. The source is
comprehensive.

A8

Updates, delivery, The source is yearly. Income Statements are delivered to the Tax Board
punctuality
during January, but corrections are made during the whole year. Preliminary statistics can be produced before summer and final estimates during
the autumn.

A9

Comparability
over time

2

Comparability over time is good.

The presentation in the rest of Section 15.3 is based on the report by Laitila, Wallgren and Wallgren (2012).

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

275

The relevance of this source is very high. This data source is necessary for the
Employment Register that is a part of the register-based census. The income statements are also the best source for statistics on gross wages and are used by the
National Accounts. As three identities are combined in the income statements, this
source is a very important part of Statistics Sweden’s production system that makes
it possible to link records from many different sources with each other.
Analysis and Data Editing of the Source
The data in the income statements are in most cases generated by the employers’
computer systems or by an internet-based application. This means that editing is
done at the same time as the data are generated. Identities must be correct as tax
payments of persons and enterprises are administrated with income statements.
However, two variables in the income statements are not used by the National
Tax Agency but are used only by Statistics Sweden. These are employment time
and the work site number on the income statement. If these variables are analysed,
statistically important quality issues are found. These findings are described in
Chart 15.9.
Chart 15.9 Information from analysis and data editing of the source – accuracy
Indicator Quality factor

Description

B1

Quality of
identifying
variables
(Primary keys)

190 701 or 6.4% of all income statements from enterprises with more than
one local unit have missing establishment/local unit identities.

B2

Quality of
reference
variables

Link to the Population Register – PIN usable: Of employed persons
5 028 405 or 99.94% had a usable PIN, 3 107 had not a usable PIN.

B3

Doublets

Doublets are not a problem.

B4

Missing values

Employment time defined as the month from and month up to: 0.06%
values are missing.

B5

Wrong values

Employment time defined as the month from and month up to: Many
employers answer from January up to December even if the actual work
was done during a shorter period. Aggregate wage can be small but
employment period can be “long”, this indicates measurement errors.

B6

Quality of
Income statements are corrected by employers and this causes delay.
preliminary data Preliminary and final estimates were compared, and it was decided that
early estimates based on data that are available during September should
be used instead of final data that are available in December.

After a register maintenance survey of about 4 400 of these enterprises, the
local unit identity on 188 962 income statements was corrected.

Link to Business Register – BIN usable: All

On the whole, accuracy is good but the input data quality is not sufficiently good
for the local unit identity numbers. However, after a register maintenance survey,
where questionnaires were sent to more than 4 000 enterprises, the quality of this
variable is sufficient.
Integrate the Source with the Base Register
The Income Statement Register is an important source for the Activity Register,
one of the four base registers. Income statements can be linked with two other base
registers – the Population Register and the Business Register. Income statements

276

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

that cannot be linked with these indicate undercoverage in these base registers. The
quality indicators C1–C5 in Chart 15.10 are based on comparisons between the
Income Statement Register and the base registers.
Chart 15.10 Information from integrating the source with base registers – accuracy
Indicator Quality factor
C1

Description

Undercoverage In all, 57 905 foreigners who work and pay tax in Sweden were found in the
in base register Income Statement Register that were not found in the Population Register.
The fraction of undercoverage among the population of all employed
persons in the Employment Register is 1.4%.
In the Business Register, there were 315 380 enterprises that were classified as active employers during one calendar year. According to the
income statements, there were 31 393 more enterprises that were active
as employers during the year in question.

C2

Overcoverage Of the 315 380 enterprises that were classified as active employers,
in base register 11 301 enterprises or 4% were not active according to income statements.

C3

Undercoverage Black work is a problem.
in the source

C4

Overcoverage
in the source

C5

Can the source Comparisons with the Income Statement Register show that both the
improve the
coverage of the Population Register and the Business Register can be
base register? improved.

No problem

However, the income statements should not be used for these improvements; the Population Register should be improved with data from the
National Tax Agency; and the Business Register should be improved with
the monthly reports from employers that today are used for the Quarterly
Gross Pay survey that are available much earlier.

Using the Income Statement Register, it was possible to find potentially important
quality flaws in both the Population Register and the Business Register. Both base
registers suffer from undercoverage, and the Business Register also suffers from
overcoverage.
Integrate the Source with Surveys with Similar Variables
The Income Statement Register have been integrated with the following surveys:
– The Labour Force Survey, LFS – employment as employed can be compared
with employment indicated by the Income Statement Register. It is also important to compare employment by institutional sector and industry.
– The Quarterly Gross Pay Survey, QGP – based on monthly administrative data
with gross wages and salaries from all enterprises registered as active employers,
can be used to compare annual gross pay by sector and industry.
– The Structural Business Statistics Survey, SBS, contains aggregate wages that
can be compared with similar information in the Income Statement Register.
– The local units according to the Income Statement Register can be compared
with the local units in the Business Register.
The Income Statement Register can be directly linked with the Population Register
using the linking variable PIN. The Income Statement Register must first be aggre-

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

277

gated by employers before it can be linked to the Business Register. This gives the
Annual Gross Pay Register, AGP, with gross pay data for enterprises.
Chart 15.11 Information from integrating the source with related surveys – accuracy
Indicator Quality factor

Description

D1

Is the source
good or bad?

When compared with the LFS, the QGP and SBS surveys, the population,
units and variables in the Income Statement Register were found to be
without quality flaws except that black work is not covered in administrative
sources such as income statements.

D2

Is the production system
good or bad?

Many errors were found in the LFS, the QGP survey and the SBS survey
after comparison with the Income Statement Register. Coverage errors in
the LFS and SBS were found. Different enterprise units are used in different surveys and in surveys from different periods.
The sector and ISIC variables were not consistent between different
surveys. The coding system for sector and economic activity, ISIC, used in
the LFS should be improved. The method for adjusting for black work in the
National Accounts should be evaluated.

D3

Can the source Many potential problems were found thanks to the income statements in
improve other
the LFS and the QGP survey, but income statements and their aggregated
surveys?
version AGP are too late to be used as a source for improvement of these
surveys. But the AGP can be used to improve the SBS. The quality of the
SBS survey can be improved by selective editing and imputation models.

D4

Can the source
be combined
with other
sources?

The Income Statement Register is used for creating some of the variables
in the Income Register. Income statements must be combined with other
sources to give a full picture of disposable income.
Income statements alone do not give a complete picture of the
economically active population, but if they are combined with yearly
income declarations for enterprises, it is possible to cover employed and
self-employed persons. This combination is the basis for the Employment
Register.

Above, we tested the work process and the quality indicators on one administrative
source, the income statements. The Income Statement Register has been compared
with the Labour Force Survey and the aggregated version of the income statements,
the Annual Gross Pay survey, has been compared with the Quarterly Gross Pay
survey and the Structural Business Statistics survey. We have found many potential
problems and inconsistencies within this system of surveys. The surveys studied
are currently not coherent due to these inconsistencies. These quality problems are
summarised in Charts 15.8–15.11 above.
Our work with quality assessment is intimately related to design or improvements of the surveys in the system. We have found the causes of problems and
inconsistencies, and the next step should be to reduce the effects of these problems
so that coherence is improved. We refer to the simultaneous work of improving or
redesigning a system of surveys as survey system design. The system oriented work
with quality assessment that we have used here should be the first step in such
work with survey system design.
The systems orientation has proved to be important. Potential problems in a statistics production system can be detected when we compare many sources and
surveys. This is illustrated by the results presented above. The traditional way of
working is to consider one survey or one administrative source at a time. It is

278

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

necessary to abandon this tradition for quality and efficiency reasons and adopt a
statistical systems approach as the general method for working with administrative
data.
Conclusions
After analysing the indicators A1–D4, we conclude that the input data quality of
the income statements is very high. Indicators C1–D4 also provide information on
the quality of the production system. Coverage errors and lack of coherence are
errors that can be measured, but a better strategy is to use the information and
improve the system so that these errors are reduced. The errors of the improved
estimates will be smaller, but unknown.
This explains why error measures are rare regarding estimates from register surveys – in contrast to random errors, non-random errors can be measured and the
estimates can thereafter be corrected. But once we have made corrections, we no
longer have any quality measure:
1. Search for errors with quality indicators A1–D4 above and find the reasons for
the errors that have been found.
2. Redesign the surveys that have the errors you have found. Calculate new estimates and describe the errors of the old estimators by taking differences between
old and new estimates.
3. Be satisfied with the fact that the new estimates are the best possible. If you do
not have other sources or surveys for comparison, you cannot describe the
quality.
Berka et al. (2012) have developed a method for quality assessment regarding the
multiple sources that are used for the Austrian register-based census. They use
information based on judgement of metadata, proportion of usable values and
proportion of consistent records to obtain quality measures for the variables in the
census registers.

15.4 Output data quality
Output data are the final estimates that are produced with the statistical register that
has been created. The quality of these estimates can be described with the same
quality dimensions that are used for surveys in general:
– Relevance can be analysed with the indicators in Chart 15.8.
– Accuracy can be analysed with the indicators in Charts 15.9–15.11 and is discussed in Sections 15.5 and 15.6.
– Timeliness requires no special indicators for register-based statistics.
– Comparability and coherence requires that populations, units and variables are
comparable and with small integration errors. This is discussed in Section 15.5.
– Availability and clarity require no special indicators for register-based statistics.
Section 15.3 contains a case study where the work with analysing the input data
quality of one administrative source, the income statements, is described.

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

279

But the same method also gives a description of a number of nonsampling errors in
the Labour Force Survey and the Structural Business Statistics survey. Also errors
regarding the Population Register and the Business Register were found.
This illustrates that the basic method for quality assurance regarding both input
data quality and output data quality consists of systematic comparisons between
related registers and surveys.

15.5 The integration process – integration errors
Section 15.3 analyses the input data quality of administrative sources and investigates the quality of the register system. The third factor that determines the quality
of register-based statistics consists of the methods used to process the data when
new registers are created. This processing is often called micro integration.
Sampling errors have been regarded for a long time as the most important error in
sample surveys. Therefore, sampling designs and estimation methods have been
developed to reduce this kind or error. Twelve of the 13 chapters in Cochran
(1963) are devoted to these issues. The last chapter refers to measurement and
nonresponse errors.
There is no sampling phase in register surveys. Instead, this kind of survey is
dominated by the integration phase, where data from different sources are integrated into a new statistical register. The register population and derived objects are
created during the integration phase; variables are imported from different sources
and derived variables are created. The kinds of errors that have their origin in the
integration phase should be called integration errors. 3 This category includes
coverage errors, matching errors, missing values due to non-match and aggregation
errors.
When we discuss integration errors below, we should distinguish between three
different situations with regard to the possibility of improving or describing the
quality of register surveys.
1. Register surveys where we can obtain detailed measures of one or more kinds of
errors and correct or reduce these errors. After the correction we have no quality
measures for the corrected estimates, as there are no more sources that can be
used for comparisons.
2. Register surveys where we can obtain measures of one or more kinds of errors
(perhaps from a sample survey) but not at a detailed level. Errors can therefore
not be corrected or reduced, but we have quality measures on an aggregate level.
3. Register surveys where we have not been able to measure errors.
Group 1 should be maximised and group 3 should be minimised. The method for
quality assessment consists of comparisons with other sources or surveys in the
production system. We do not have to restrict ourselves to administrative sources
in this work. When we suspect that a certain category of a register population has
3
The term data processing errors could also be used, as for sample surveys; see Biemer and Lyberg (2003). We
prefer a different term for register surveys, as the processing is quite different compared with sample surveys.

280

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

quality flaws, we can conduct a register maintenance survey and send questionnaires to this category of units to measure and improve quality.
We should also use existing sample surveys to evaluate the quality of administrative sources. Sometimes we can conduct a sample survey with the primary aim of
evaluating registers and register surveys.
15.5.1 Creating register populations – coverage errors
There are five kinds of coverage errors that can occur when different sources are
integrated to create the population for a new statistical register:
– Overcoverage, discussed in Sections 9.1.2, 13.1 and 13.2.
– Undercoverage, discussed in Sections 9.1.2, 13.1 and 13.3.
– Missing values due to undercoverage in the base register that is used for the new
statistical register. One example of this is described in Section 1.5.5.
Two kinds of errors arise due to lack of coordination between surveys:
– Overcoverage due to double counting; the same units are included in more than
one survey but they should have been included only in one.
– Undercoverage because some units have been excluded from all surveys, but
they should have been included in one.
The role of the base registers is to define the populations of all surveys at the
statistical office: register surveys, censuses and sample surveys. Therefore, the aim
should be that the base registers be of the highest possible quality – all relevant
sources should be used and the methods used to create the base registers should be
the best possible.
Section 7.3.6 measures undercoverage errors and overcoverage errors for the
Business Register. The errors could be measured when the Business Register was
combined with all relevant sources that had not been used in the creation of the
Business Register.4 This is an example where we can obtain detailed measures of
the errors and correct or reduce the errors. After the correction, it is not possible to
have any measures of coverage errors for the corrected estimates.
Errors due to lack of coordination between surveys
Enterprises are difficult statistical units because they split and merge and change.
Thus it is difficult to produce economic statistics where different economic surveys
are consistent. Those who work with the National Accounts are accustomed to
obtaining inconsistent estimates and must make the necessary adjustments to
produce GDP estimates and other estimates for the different accounts.
The target population of enterprises for the Yearly National Accounts consists of
all enterprises that were active during some part of the calendar year. Section 7.3.6
describes this kind of enterprise population. A calendar year population can be
created if all relevant administrative sources are used, and this register can be used
to measure inconsistencies in the system of enterprise surveys. In Chart 5.10 the
4

The Business Register at Statistics Sweden is being revised. More sources will be used in the new IT system, so
coverage errors will be reduced.

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

281

calendar year population of all legal units by sector and economic activity is
shown. A number of surveys are used to measure different economic variables for
all these domains of study.
From Chart 5.10 Legal units by institutional sector and economic activity – which
units are included in each survey describing parts of this calendar year population?
Institutional sector: Non-financial
Economic activity:
enterprises
Agriculture, forestry, fishing
Manufacturing, mining, energy
Construction
Trade and transport
Hotels and restaurants
Information, communication
Financial intermediation
Real estate, business activities
Government
Education
Health and social work
Personal and cultural services

11 354
33 743
44 611
96 626
18 598
29 010
10 852
157 163
70
8 738
14 196
21 837

Financial
enterprises

Government

Sole
traders

Non-profit
organisations

0
1
0
1
2
1
2 060
15
0
0
0
1

13
13
0
5
0
1
10
49
298
120
256
94

236 467
23 717
49 161
61 606
10 966
25 807
683
112 719
61
14 277
17 847
80 281

546
139
62
246
255
318
1116
10 914
247
985
979
25 949

To achieve an estimate of GDP, all enterprises in this population should be measured once – no enterprise should be double counted and no enterprise should be
excluded. However, the population is not measured by one survey. Instead, a
number of economic surveys measure different parts defined by sector and industry. Different units at Statistics Sweden are responsible for some of the surveys and
a number of national institutes for agriculture, energy, etc. are responsible for their
respective parts.
These surveys are monthly, quarterly or yearly; some are sample surveys and
others are register surveys. All these factors make it very difficult to achieve consistent and coherent estimates. However, the inconsistencies can be measured
afterwards.
When all surveys have been completed and all administrative sources are available, all the economic surveys can be checked against the calendar year register.
Overcoverage due to double-counting can be measured as well as undercoverage
due to exclusions. We have found evidence of substantial inconsistencies in this
way. This kind of knowledge should be used to improve the system of surveys.
This is a difficult task as it involves many statistical agencies and managers.
Checking register populations with area sampling
The population of persons and households and the population of local units can be
measured with sample surveys based on area frames. Theoretically, these area
frame-based estimates are free of coverage errors.
This method, which is described in Section 13.3, enables the measurement of
coverage errors of register populations at an aggregate level. However, the information regarding errors is not sufficiently detailed so that all register-based estimates can be corrected.

282

THEORY AND QUALITY OF REGISTER-BASED STATISTICS

15.5.2 Creating statistical units – errors in units
Enterprise units and households are two kinds of statistical units that cause difficult
methodological issues and quality problems. They change over time and changes
due to mergers and splits are often not recorded in the administrative sources.
To find errors in units, data from different sources should be combined and similar variables should be compared by consistency editing, as described in Section
9.1.3. When similar variables differ significantly within the same record, this
indicates that something is wrong.
The problem here is that measurement errors and errors in units look alike and
there is a risk that errors in units are misinterpreted as measurement errors or errors
in variables. The symptom is the same, but the treatment should be completely
different. Errors in units should be treated by creating better derived units or rejecting false positive matches. Errors in variables should be treated by replacing discarded values by imputed values. It is important that the editing to find and correct
errors in units is completed before the work with editing to find and correct errors
in variables.
Enterprise units
In the chart below, we can eliminate two explanations for the extreme inconsistencies between turnover values for the records with the same identity numbers. As we
know that the quality of the BIN identities is very good, we can eliminate the
explanation that the inconsistencies for BIN 160001–160013 have been created by
false positive matches. We can also eliminate the explanation that the YIT and
VAT values have large measurement errors – taxation data for these big companies
should be trustworthy.
BIN = Business identity
number of each legal
unit/entity
SBS = Turnover according
to Statistics Sweden’s SBS
questionnaire
YIT = Turnover according
to the yearly income tax
returns
VAT = Turnover according
to 12 monthly VAT returns
Distance = an editing
function defined to find
records with inconsistent
turnover values

From Chart 2.3 Yearly turnover for the same enterprises in three sources, USD million
BIN
SBS
YIT
VAT
Distance
160001
7 179
11 941
8 089
3 175
160002
2 954
0
0
1 969
160003
843
3 561
918
1 812
160004
5 514
2 888
2 895
1 751
160005
26
538
2 536
1 673
160006
160007
160008
160009
160010
160011
160012
160013

2 301
2 211
1 316
638
456
141
113
65

0
0
1 316
638
0
141
0
0

0
2 239
0
0
435
0
127
63

1 534
1 493
877
425
304
94
85
43

164159
164160

34
19

34
19

34
19

0
0

The explanation instead is that the three columns with turnover values contain data
describing enterprise units that can be different, even if the same identity numbers