打开

Chapter 9

第九章

Types of Data Checks

数据核实的类型

By Ian Tebbutt

著:Ian Tebbutt,译:岚ind

In the last chapter, we looked at data cleaning and the checking processes that are necessary to make that happen. Here, we’ll take a more in-depth look at data checking and talk about other validation processes, both before and after cleaning occurs.

在上面的章节中,我们已经了解到数据清洗和数据检核查是数据分析中重要部分。这章,我们会深入探讨几种其他的数据核查及验证过程,包括在数据清理前及之后后进行。

Data checking is crucial if you and your audience are going to have confidence in its insights. The basic approach is quite straightforward: you have fields of data and each of those fields will have expected values. For instance, an age should be between 0 and 120 years (and in many cases will be less than 80 years). Transaction dates should be in the recent past, often within the last year or two, especially if you’re dealing with an internet-only data source, such as a Twitter stream.

如果你和你的用户相信数据核查的洞察力,那么它就至关重要。它的基本方法非常简单:你有多个字段的数据,每个字段都会有预期值或者说是范围。例如,年龄字段的值应该在0~120之间(很多情况下小于80).交易日期应该刚刚过去不久,通常在过去的1到2年内,尤其是你在处理像推特流一样的网络数据源时。

However, data checking, although easy to understand and important to do, is a complex problem to solve because there are many ways data can be wrong or in a different format than we expect.

然而,尽管数据核实容易理解且非常重要,它仍然是一个极需解决的复杂问题,因为导致数据错误的原因繁多,出错的形式可能超出意料。

When to Check

何时核实

Consider this dataset from a telecom company with information about customers who are changing phone numbers. Here, they provided the database but didn’t check the data which were aggregated from hundreds of smaller phone service providers. The database is still in daily use and is a great example of why checking is important. Imagine you’re tracking the types of phone charges by age. The example below shows a few of the issues.

想象一下,一个电信公司的数据集,涉及更换电话号码的客户信息。这里,他们提供了数据库,但并没有核实这个由上千个小型电话服务供应商汇集的数据。目前,这个数据库仍然每天在使用,恰好它可以作为说明数据核实为何如此重要的典型例子。想象一下,当你需要追踪不同年龄阶段的电话费时。下面的例子展示了其中的一些问题。

RecordId PhoneNumberType Age New Customer Price
1 MOBILE 0 NO $12.45
2 Mobile 47 Y 12 45
3 Land Line 34 YES 37
4 LandLine 23 YES 1.00
5 LL 112 Y $1K

The basic rule for data checking is check early, check often. By checking early, when data are first entered, there is a chance to immediately correct those data. For example, if your “New Customer” field expects the values YES or NO, but the user enters something different, such as an A or a space, then the user can be prompted to correct the data. If the validation isn’t done until later then incorrect values will reach the database; you’ll know they’re wrong but will be unable to fix the issue without returning to the user to ask for the information. Occasionally, it’s possible to compare incorrect fields with other linked datasets and then use that information to fix the original data. That can be complex and lead to further issues, since you have to decide which data source is correct.

数据核实的基本原则是尽早核实、经常核实。尽早核实,即第一次纳入数据时,可以尽快更正错误数据。例如,在”新客户“字段中你希望顾客填写”是/否“,但用户填写了其他内容,比如一个字母A或者一个空格,那么可以及时提醒用户纠正错误。如果没有这个验证过程,错误的数据就会进入数据集;你知道它们是错误的,但不将问题返还用户询问相关信息,就不能修正错误。偶尔也会将错误字段与其他关联数据集做对比,用这一信息修正原始数据。这会很复杂,而且会导致其他问题,因为你必须决定哪个数据源的信息是正确的。

If you’re in the happy position of controlling how the data are gathered, you have a great advantage, as one of the easiest forms of checking is done as soon as the data are entered. This type of data check is called a front-end check or a client-side check because it happens at the moment that the user enters the data, before the data are submitted to the database. This is most commonly done by making sure that your data collection application or web page is designed to only accept valid types of input. You have probably encountered this type of data validation yourself when filling out forms on the web before.

如果你能控制数据收集的方法,那么你有一个巨大的优势,数据核实最简单的一种形式是在数据进入数据集时立即检测。这种数据核实被称为前端审核或客户端审核,因为它在用户输入数据时、数据提交到数据库之前进行核实。我们可以编写只接收正确数据类型的数据收集应用程序或网页来实现前端审核(客户端审核)。也许以前在网页上填写表格时,你也接触过这种类型的数据验证。

Form validation on the web

For example, states and countries should be selected from a list and if you’re dealing with international data, the choice of country should limit which state choices are available.

例如,如果你在处理国际数据,国家名称应该从特定的列表中选择,接下来选择具体的州市时,其选择项跟着先前所选国家而变化。

Country and state drop down lists

In this way your system will be constrained to only allow good data as they are entered. The approach isn’t perfect though. A classic internet speed bump is data entry that waits until submission before letting on there was an issue in a field at the beginning of the form. A better approach is to check every field as it is entered, but that has other disadvantages as it can be harder to code and can result in a continual stream of checking requests being sent to the server and potential error messages being returned to the user. As a compromise, some simple checking and validation can be carried out entirely in the browser while leaving the more complicated work for server-side processing. This will be necessary as some checking will require data or processes that are only available on the server itself. This often occurs when secure values—credit card verification, for instance—are being checked.

通过这种方式使系统受限,以便数据输入时只允许正确数据进入。但这种方法并不完美。一个典型的网络减速状况在于数据输入等待提交环节。因为大多数情况下,都是在提交时才被提醒前面表格中某个字段填写有问题。一种较好的方法是输入数据时核实每一字段,但这种方法有其他缺点,它要求的编码难度更大而且会造成审核请求不断的发送到服务器,并将潜在的错误的信息返还用户。作为一种折衷方法,一些简单的核实与验证可以完全在浏览器端进行,而更为复杂的工作则留给服务器处理。这种方法很有必要,因为一些核实会要求只能在服务器可用的数据或程序支持。这种方法经常在核实安全系数时使用(如,信用卡认证)。

Other good survey design policies to consider to minimize data preparation time include:

为了最小化数据的准备时间,其他好的调查设计策略包括:

Trust Nobody

不要相信任何人

No matter how well you have designed your form and how much validation you have put into your front-end checks, an important rule of thumb is never trust user data. If it has been entered by a person, then somewhere along the line there will be mistakes. Even if there are client side checks, there should always be server side or back-end checks, too—these are the checks that happen after the data are submitted. There are many good reasons for this. You might not have designed the data gathering tools and if not, you could have different front end applications providing data. While some may have excellent client side checks, others might not. Unclean or unchecked data may arrive in your system through integration with other data services or applications. The example telecom database had too many owners with little oversight between them, resulting in a messy dataset. A small amount of extra effort up front saves us time and frustration down the road by giving us a cleaner dataset.

不管你设计的格式有多规范,你的前端检查中设置了多少审核程序,一条重要的原则是不要相信用户数据。即使是个人提交的数据,那某处数据便会有错误。即使经过客户端检查,也应改在服务器端或后台检查,这是数据提交后进行的检查。这样做有很多理由。你也许没有并没有设计数据采集工具,如果是这样,你需要使用不同的前端应用程序提供数据。也许有些会有完美的客户端检查,但另一些没有。不干净或未核实的数据也许会通过其他服务器或应用程序的集成进入你的系统。电信数据集的例子中,有太多小运营商没有对数据监管,导致数据集凌乱。前端少许额外工作可以向我们提供更为干净的数据,进而节省我们的时间,减轻数据处理过程中的无奈感。

A second golden rule is to only use text fields where necessary. For instance, in some countries it’s normal to gather address data as multiple fields, such as Line1, Line2, City, State, Country, postcode, but in the UK it’s very common to just ask for the postcode and the street number as those two pieces of information can be then be used to find the exact address. In this way the postcode is validated automatically and the address data are clean since they aren’t not entered by the user. In other countries, we have to use text fields, and in that case lots of checking should occur.

第二条黄金定律是只在必要时使用文本字段。例如,某些国家通常会使用多字段地址数据信息,如1号线、2号线、市、省、国家、邮编,但在英国,只收集邮编和街道号码很正常,因为这两条信息可以查找到精确的地址。这样,邮编是自动变量,而地址信息由于不是用户输入得到,所以很“干净”。在另外一些国家,我们必须使用文本字段,这样会产生大量核实工作。

Commas in data can cause all kinds of problems as many data files are in comma separated (CSV) format. An extra comma creates an extra unexpected field and any subsequent fields will be moved to the right. For this reason alone it’s good to not cut/paste data from an application; instead save to a file and then read into your next application.

数据中的逗号会产生各种问题,因为许多数据文件的不同字段都是以逗号分隔的(CSV)。一个额外的逗号会产生一个意想不到的字段,其后的字段会移到原本所在位置的右侧。仅仅因为这个原因,最好不要从应用程序中剪切/粘贴数据;而是保存到文件并读取到你的下一个应用程序中。

Title Name FamilyName Address1 Address2 Town State Country  
Bill Short   13 A The Street Hastings VIC AUS  
  Mr William Tall 27 The Close   Guildford VIC AUS

An additional comma in the second record has pushed the data to the right. This is an easy problem for a human to spot, but will upset most automated systems. A good check is to look for extra data beyond where the last field (“country”) would be. This example emphasizes the importance of combining computerized and manual approaches to data checking. Sometimes a quick visual scan of your data can make all the difference!

第二条记录中的额外逗号把数据向右推移。人们很容易发现这个问题,但大部分自动化系统发现这个问题却有一定难度。一个好的核查方法是查看在最后一个字段(“国家”字段)右侧的数据是什么。这个例子强调了数据核查过程中计算机和人工结合的重要性。有时对数据的快速浏览可能造成完全不同的结果。

Data Formats and Checking

数据格式与审核

When dealing with numbers there are many issues you should check for. Do the numbers make sense? If you’re handling monetary figures, is the price really $1,000,000 or has someone entered an incorrect value? Similarly, if the price is zero or negative, does that mean the product was given away or was someone paid to remove it? For accounting reasons many data sources are required to record negative prices in order to properly balance the books.

在处理数据的时候,你有很多方面需要去核查。这些数字有意义吗?如果你处理货币指数,物品的价格真的是$1,000,000还是某人输入了错误的数值?与此相似,如果价格是0或负数,这是否表明产品被送掉了,还是某人为移除它做了付款?因为审计原因,许多数据源要求记录负数价格,以便妥善处理收支平衡。

In addition, checking numeric values for spaces and letters is useful, but currency and negative values can make that hard as your data may look as follows. All of these are different and valid ways of showing a currency, and contain non-numeric character values.

此外,核实数值中的空格和字母是十分有用的,但货币和负数使这种操作变得困难,因为你的数据如下所示。货币数据的有效的表示方式多种多样,同时包含非数值型的数据。

$-1,123.45
(1123.45)
-US$1123.45
-112345E-02

Letters in numbers aren’t necessarily wrong, and negative values can be formatted in a variety of ways.

数值中的字母并不一定是错误的,负数有多种不同的格式。

Dates also exhibit many issues that we have to check for. The first is the problem of differences in international formatting. If you see the date 1/12/2013, that’s January 12, 2013 in America, but in the UK it’s December 1. If you’re lucky, you’ll receive dates in an international format such as 2014-01-12. As a bonus, dates in this standardized format (http://whatis.techtarget.com/definition/ISO-date-format) can be sorted even if they’re stored as text. However, you might not be lucky, so it’s important to check and make sure you know what dates your dates are really supposed to be, particularly if you’re receiving data from respondents in multiple countries with different date formats. A good way to handle this if you are designing the data entry form is to make the date field a calendar button field, where the user selects the date off a calendar instead of entering it manually. Alternatively, you can specify the format next to the entry box as a sort of instruction for the user.

日期也会展示出我们必须核实的各种问题。第一个问题是国际格式的不同。如果你看到写成1/12/2013的日期,在美国它表示2013年1月12日,但在英国它是12月1日的表述。如果你够幸运,你接收到的日期是国际通用格式,如2014-01-12。作为鼓励,这个标准格式网站(http://whatis.techtarget.com/definition/ISO-date-format)可以归类,即使原始数据是文本性。然而,你也许没有这么幸运,所以核实并确保你知道你要的日期格式非常重要,特别是当你的数据从多个国家的反馈者中收集而来,并以不同的日期格式展现。处理这个问题的一个好方法是在设计数据条目格式时将日期字段设置成一个日历按钮,使用户从日历中选择日期而不是手工输入。另一种方法是,你可以在临近输入框的地方向用户说明所指定的输入格式。

Examples of input fields

Another checking task that you may encounter is the analysis and validation of others’ work to make sure the visualizations and numbers actually make sense. This can happen in a work situation where you need to proof other people’s work of others or online where public visualizations will sometimes provide the underlying data so you can try your own analysis. In both instances the first check is to just recalculate any totals. After that, look at the visualization with a critical eye: do the figures make sense, do they support the story or contradict it? Checking doesn’t have to be just for errors. It can be for understanding, too. This will give you good experience when moving on to your own data checking and is the first thing to try when someone passes you a report.

你可能遇到的另一个核查任务是分析、验证其他人的工作,以确保可视化和数值具有实际意义。一般出现在你需要证明其他人的工作或者验证网络上暗含意义的公共可视化数据,这样你可以自己尝试进行分析。在这两个例子中第一次的核查仅仅需要统计各项的数据之和。在此之后,我们要以严苛的眼光看待可视化成果:数据是否有意义,它们是否支持结论或与结论相悖?核查并不仅仅针对数据错误,也可以针对数据的理解。当继续你的数据核查时,这会带给你良好的体验,这也是别人向你发送一份报告后你应该做的第一件事情。

Data Versions

数据版本

Another big source of data checking problems is the version of the data you’re dealing with.

另一个数据核查问题的主要来源是你所处理的数据版本。

As applications and systems change over the years, fields will be added, removed, and—most problematic—their purpose will be changed. For instance the Australian postcode is 4 digit and is stored in a 4 character field. Other systems have been changed to use a more accurate 5 digit identifier called the SLA. When data from those systems are combined, we often see the 5 digit values chopped down to fit into a postcode field. Checking fields for these kinds of changes can be hard: for postcodes and SLAs, the lookup tables are in the public domain, but it takes additional investigation to realize why a location field with 4 digits matches values from neither table.

由于应用程序和系统随着时间推移不断变化,字段可能会增加或移除而且它们的设定目的会发生改变。例如,澳大利亚的邮编是4位的,并存储在4个字符字段中。其他系统则使用SLA方法改用更为精确的5位识别码。当系统合并时,我常常看到5位的数值被减掉一位数以适合邮编字段。这类变动的字段核查非常困难,公共领域的邮编和SLA数值的速查表与任何值都不匹配,但需要额外调查以回答位置字段为何是4位。

You should consider collecting additional fields which won’t be part of the actual visualization or final report but will give you important info about your records, like when they were created. If new fields are added after the dataset is created, any existing records won’t have that field filled and if someone is improperly handling the dataset, the older records may have new numeric fields filled with zeroes. This will throw off averages and the effect on your visualizations would be huge. If you have the record creation date, you can go through and change the incorrectly added zeroes to a missing value to save your data. For those fields that have been removed, a similar issue might be seen. It’s relatively rare for unused fields to be removed from data but they can sometimes be repurposed, so figuring out the meaning of a specific piece of data can be challenging if the functional use of a field has changed over time.

你应该认真考虑采集附加字段,附加字段不是实际调查中的一部分,但当创建附加字段时,它们会提供与记录相关的重要信息。如果在数据集生成后添加了新的字段,任何现有的记录都不会填充这一字段,如果有人对数据集做了不当处理,原有数据的新增数值字段可能会填充为0。这会降低平均值,同时对你的可视化成果产生巨大影响。如果你有记录的创建日期,你可以回溯并将错误输入的0改为缺省值以便保存数据。对已经移除的字段,可以看到相似的问题。从数据中移除未使用的字段相对少见,只因为有时会将它们改作他用,所以如果字段的使用功能随着时间发生了改变,指出特定数据的含义是一件具有挑战性的事情。

Amount PaymentType ServerId CreatedOn
$100 CC    
$143 Cash    
$27 Amex 237 3/1/2013
$45 Cash 467 3/1/2013

Here you can see the effect of adding two new fields, ServerId and CreatedOn, to an existing data source. It’s likely that change was put into place 03/01/2013 (March 1, 2013), so if your analysis is only looking at data since that date then you can track sales/server. However, there’s no data before that in this source, so if you want to look at what was happening on January 1, 2013, you need to find additional data elsewhere.

这里你可以看到在已存在的数据中新增了“服务器ID”和“创建日期”字段后的影响。似乎数据集在2013年3月1日开始了变化,如果你的分析只限于那天之后的数据,那你就可以追踪销售额/服务人员。但是这一数据源没有此前的相关数据,所以如果你想查找2013年1月1日的数据,你需要在其他地方寻找更多的数据。

One of the most important checking issues is that the meaning of fields and the values in them may change over time. In a perfect world, every change would be well-documented so you would know exactly what the data means. The reality is that these changes are rarely fully documented. The next best way of knowing what the data in a field really represents is to talk the administrators and users of the system.

核实问题最重要的问题之一是字段的含义和数值也许会随时间而改变。在理想世界中,每一次的改变都有详细记录,这样你可以明了数据的真实含义。但实际情况是这些改变很少会有完整的记录文档。要了解某一字段的数据的真实含义,另一个好办法是与系统管理员和用户进行沟通。

These are just some of the steps that you can take to make sure you understand your data and that you’re aware of potential errors. In the next chapter, we’ll talk about some of the other sneaky errors that may be lurking in your data, and how to make sense of those potential errors.

这仅仅是一些可以确保你能理解数据并意识到潜在错误的步骤。在下一章里,我们会继续讨论可能隐藏在数据中的不明显错误,以及如何识别这些潜在错误。