打开

Chapter 10

第十章

What Data Cleaning Can and Can’t Catch

数据清洗能做什么不能做什么

By Dyanna Gregory

Now that we understand what data cleaning is for and what methods and approaches there are to shape up our dataset, there is still the question of what cleaning can and can’t catch.

现在我们了解了数据清洗的目的以及改进我们数据集的方法和途径,但仍然存在一个问题:数据清洗能做什么不能做什么?

A general rule for cleaning a dataset where each column is a variable and the rows represent the records is:

对列代表变量而行代表记录的数据集进行清理的一个常用规则是:

It should be made clear that exclusion is not the same as deletion! If you decide that you don’t want to include a row or column in your analysis or visualization, you should set them aside in a separate dataset rather than deleting them altogether. Once data are deleted, you can’t retrieve them any longer, even if you realize later on that there was a way to fill in the missing values. Unless you are absolutely certain that you will not use a record or variable again, do not just delete it.

需要弄清楚的是剔除并不等同于删除!如果你决定在你的分析或可视化中不使用某行或某列的话,你应该把它们放入备份数据集而不是完全删除。一旦数据被删除,你再也无法恢复它们,甚至以后你想到了填补该缺失值的方法也不行。除非你完全确定再也不会用这个记录或变量,否则不要删除它。

In the last few chapters, we have talked about several different processes for data cleaning and have seen the types of problems they can help identify and fix. When we’re searching for errors and mistakes, we are able to detect potential problems such as:

在前几章中,我们已经谈过几种不同的数据清洗过程,并且看到了它们能够帮助识别并解决的问题类型。当我们在查找错误时,我们能探测出这些潜在问题:

What we haven’t talked a lot about yet is what data cleaning can’t catch. There may be incorrect values that are nevertheless both within the acceptable range for the data and that make complete sense. For example, if someone enters the number 45 instead of 54 into your dataset and your valid range of numbers is 0-100, it will be unlikely that you’ll catch that error unless that field is one that you’re cross-checking with another field or you’re verifying the information with an outside source record.

对于数据清洗不能做什么,我们还没说太多。可能会有些不正确的值,但它们在可接受范围内且有完整意义。举个例子,如果有人在你的数据集里输入45来替代54,同时你的数据有效范围是0-100,除非是你用别的领域来检查这个领域,或者你用外部数据源的记录来验证信息,否则你不大可能发现这个错误。

Similar to that, you may be receiving information from an online survey form and the person filling it out may have selected the button for “Strongly Agree” when they actually meant to select “Strongly Disagree.” Again, unless this answer is somehow cross-checked with another variable or source, you will have no easy way to detect this error. Sometimes this type of error is more critical than others. If a person selects “Strongly Agree” instead of “Agree” on an opinion survey, that is unlikely to have the same impact on the results as if someone accidentally marks the wrong gender on a form for a research study where you are using gender as a grouping category for treatment assignments.

类似的,你也许会收到线上调查表格的信息,而填表人可能本来想选 “强烈反对”却点了“强烈同意”的按钮。还是那句话,除非这个回答用别的变量或数据源以某种方式进行核查,否则你想检测出这个错误是不太容易的。有时候这种形式的错误更为严重。如果一个人在一项意见调查中选择了“强烈同意”而不是“同意”,这种情况对结果造成的影响和如果有人意外地将用于研究的表格上的性别填错了而你又把性别作为处理任务的分类标签所造成的影响还不太一样。

Data cleaning also can’t tell if a missing value is truly missing (i.e. the question was accidentally skipped or the data were not collected for some reason) or the question was purposely skipped (i.e. the participant declined to answer) unless “Prefer not to answer” was an answer choice. This may be relevant in some cases (particularly in demographics), though in others, you may decide to just treat both as missing data. This is why, as mentioned before, you need to include a “Prefer not to answer” choice for any question of a personal nature where you want to know if the data are truly missing, since some people may actively choose to not answer questions about race/ethnicity, income, political affiliation, sexual orientation, etc.

数据清洗也不能告诉你缺失值是否真的缺失了(也就是,该问题被意外地跳答了或者因为某种原因数据没有被收集)还是问题被故意跳答了(也就是参与者谢绝回答),除非“不愿意回答”也作为一个回答选项。这个在有些情况下可能是意义重大的(尤其在人口统计中),尽管在其他情况下你可能决定把以上两种都当作数据缺失处理。这就是为什么,如前面提到的,如果你想知道数据是否真的缺失,就需要对任何有关个人本质的问题设计一个“不愿意回答”的选项,因为有些人可能不愿意回答关于人种/种族,收入,政治背景,性取向等方面的问题。