打开

Chapter 1

第一章

Basic Data Types

基本数据类型

作者:米歇尔·卡斯特罗

There are several different basic data types and it’s important to know what you can do with each of them so you can collect your data in the most appropriate form for your needs. People describe data types in many ways, but we’ll primarily be using the levels of measurement known as nominal, ordinal, interval, and ratio.

了解一些基本的数据类型非常重要,尤其是当你需要以恰当的格式来处理你所需要搜集的数据。人们有很多描述数据类型的方式,但我们主要采用以下几种度量的级别,如:标称型、序数型、区间型和比率型。

Levels of Measurement

度量级别

Let’s say you’re on a trip to the grocery store. You move between sections of the store, placing items into your basket as you go. You grab some fresh produce, dairy, frozen foods, and canned goods. If you were to make a list that included what section of the store each item came from, this data would fall into the nominal type. The term nominal is related to the Latin word “nomen,” which means “pertaining to names;” we call this data nominal data because it consists of named categories into which the data fall. Nominal data is inherently unordered; produce as a general category isn’t mathematically greater or less than dairy.

让我们假设你正踏上通往杂货铺的旅途。你游走于商店不同的区域之间,边走边把商品放到你的购物篮中,你抓了一些新鲜农产品、乳制品、冷冻食品和罐装食品,如果你列了一张标明每件货物来自哪个区域的清单,这些数据将归为标称型,术语”标称”源于拉丁语“命名”,意为“准备起名字”。将这类数据称为“标称型数据”是因为它们包含了数据划分的命名类别,标称型数据是天然无序型的,农产品作为一个总类别在数学上并不比乳制品大或者小。

Nominal

标称型

Nominal data can be counted and used to calculate percents, but you can’t take the average of nominal data. It makes sense to talk about how many items in your basket are from the dairy section or what percent is produce, but you can’t calculate the average grocery section of your basket.

标称型数据可数并且可以计算百分比,但你无法求其平均数,你的购物篮中有多少商品,乳制品在其中占多大比例是有意义的,但你无法计算出你购物篮中来自商店不同区域的平均值。

When there are only two categories available, the data is referred to as dichotomous. The answers to yes/no questions are dichotomous data. If, while shopping, you collected data about whether an item was on sale or not, it would be dichotomous.

当只有两个类别可用时,数据称为二分类型,只有”是或否”之类判断题的答案为二分类型的数据,比如,当购物时,你收集商品是否打折出售的数据,这就是二分类型。

购物篮比例

Ordinal

序数型

At last, you get to the checkout and try to decide which line will get you out of the store the quickest. Without actually counting how many people are in each queue, you roughly break them down in your mind into short lines, medium lines, and long lines. Because data like these have a natural ordering to the categories, it’s called ordinal data. Survey questions that have answer scales like “strongly disagree,” “disagree,” “neutral,” “agree,” “strongly agree” are collecting ordinal data. No category on an ordinal scale has a true mathematical value. Numbers are often assigned to the categories to make data entry or analysis easier (e.g. 1 = strongly disagree, 5 = strongly agree), but these assignments are arbitrary and you could choose any set of ordered numbers to represent the groups. For instance, you could just as easily decide to have 5 represent “strongly disagree” and 1 represent “strongly agree.”

最后,你准备结账并且选从哪个队列出店最快。没有确切数过每条队有多少人,你只是在头脑中粗略的将其分为短队、中队和长队。因为这类数据除了类别外还有含有天然的序列属性,所以叫序数型数据.比如问卷调查中要采集的答案”强烈反对”、“反对”、“中立”、“赞成”、“强烈赞成”等序数型数据。类别在序数规模上并没有实在的数量含义.通常为这些类别赋予数值,以方便数据录入和分析(如1=强烈反对,5=强烈赞成),但此类赋值是主观的,你可以选用任何有序数据集合来表示这个组合,比如你也可以简单的用5来表示”强烈反对”而用1来表示”强烈赞成”.

标量

Like nominal data, you can count ordinal data and use them to calculate percents, but there is some disagreement about whether you can average ordinal data. On the one hand, you can’t average named categories like “strongly agree” and even if you assign numeric values, they don’t have a true mathematical meaning. Each numeric value represents a particular category, rather than a count of something.

像标称数据,你可以数出有序数据个数并用它们来计算出百分比,但在能否计算有序数据的平均值时仍存在一些分歧。一方面,你不能均分像“强烈同意”的命名类别,即使你已经赋值,他们也没有真正的数学意义。相对于计数功能,每个数值更确切的说是代表一个特定的类别。

On the other hand, if the difference in degree between consecutive categories on the scale is assumed to be approximately equal (e.g. the difference between strongly disagree and disagree is the same as between disagree and neutral, and so on) and consecutive numbers are used to represent the categories, then the average of the responses can also be interpreted with regard to that same scale.

另一方面,如果在类别之间的连续程度上的差异被认为是近似相等的(例如”强烈反对”和”反对”的差异,与”反对”和”中立”之间的差异相同)连续的数字用来表示类别,同一标度下差值可以用平均数来解读。

Interval

区间型

Enough ordinal data for the moment… back to the store! You’ve been waiting in line for what seems like a while now, and you check your watch for the time. You got in line at 11:15am and it’s now 11:30. Time of day falls into the class of data called interval data, so named because the interval between each consecutive point of measurement is equal to every other. Because every minute is sixty seconds, the difference between 11:15 and 11:30 has the exact same value as the difference between 12:00 and 12:15.

此刻你有足够的有序数据..回到商店来!你现在排队等候结账已经有一会儿了,你看了下表。你开始排队时是在上午11:15,现在是上午11:30,一天中的时间可以划分为区间数据的类型,这样命名是因为每个连续的点之间的区间相等,因为每分钟有60秒,11:15到11:30的差和12:00到12:15的差是完全相等的。

Interval data is numeric and you can do mathematical operations on it, but it doesn’t have a “meaningful” zero point – that is, the value of zero doesn’t indicate the absence of the thing you’re measuring. 0:00 am isn’t the absence of time, it just means it’s the start of a new day. Other interval data that you encounter in everyday life are calendar years and temperature. A value of zero for years doesn’t mean that time didn’t exist before that, and a temperature of zero (when measured in C or F) doesn’t mean there’s no heat.

区间数据是数值型的,你可以对其进行数学操作,但没有实际意义上的零值-0并不表示空缺,当测量值是0:00 am时并非是指不存在的值,而是说这是新的一天的起点,其他你在日常生活会遇到的区间数据是日历、年和温度。纪年中的0年并不意味着那个时间点不存在,温度中的0度(当用摄氏或华氏度量时)也不意味着没有热度。

Ratio

比率型

Seeing that the time is 11:30, you think to yourself, “I’ve been in line for fifteen minutes already…???” When you start thinking about the time this way, it’s considered ratio data. Ratio data is numeric and a lot like interval data, except it does have a meaningful zero point. In ratio data, a value of zero indicates an absence of whatever you’re measuring—zero minutes, zero people in line, zero dairy products in your basket. In all these cases, zero actually means you don’t have any of that thing, which differs from the data we discussed in the interval section. Some other frequently encountered variables that are often recorded as ratio data are height, weight, age, and money.

看到时间已经是11:30,你会自言自语“我已经排队15分钟了…???”当你这样思考时间的话,就是看作比率数据.比率数据是数值型的,跟区间数据很相似,除了它(区间型)包含确实有实在意义的零值.在比率数据中,0值意味着没有你要测量的东西-0分钟,队列中0人,购物篮中有0个乳制品,0确切的意味着你没有任何东西,这与区间型数据中谈论的0是有区别的.其它经常遇到的作为比率型数据记录的变量有重量,年龄和货币.

Interval and ratio data can be either discrete or continuous. Discrete means that you can only have specific amounts of the thing you are measuring (typically integers) and no values in between those amounts. There have to be a whole number of people in line; there can’t be a third of a person. You can have an average of, say, 4.25 people per line, but the actual count of people has to be a whole number. Continuous means that the data can be any value along the scale. You can buy 1.25 lbs of cheese or be in line for 7.75 minutes. This doesn’t mean that the data have to be able to take all possible numerical values – only all the values within the bounds of the scale. You can’t be in line for a negative amount of time and you can’t buy negative lbs of cheese, but these are still continuous.

区间型和比率型数据可以是离散型或连续型的.离散意味着你测量的东西只能有特定数量的值(尤其是整数的)没有在此之间的值,比如队列中 只能有整数个人,不可能出现1/3人,你可以计算平均值,比如说每队平均4.25人,但准确的人数必须为整数,连续值意味着取值可以是区间内的任何一个数,你可以购买1.25磅的乳酪或是排队7.75分钟,但这不意味着数据能采用所有可能的数值,只有在特定范围边界之内的数据才行.你排队的时间不可能为负值.也不可能购买负几磅的乳酪,但它们仍然是连续型的.

To review, let’s take a look at a receipt from the store. Can you identify which pieces of information are measured at each level (nominal, ordinal, interval, and ratio)?

回顾一下,让我们看下商店的收据,你能识别出下面的信息分别都是那些度量级别的么?(标称、序数、区间和比率)?

Date: 06/01/2014 Time: 11:32am
Item Section Aisle Quantity Cost (US$)
Oranges—Lbs Produce 4 2 2.58
Apples—Lbs Produce 4 1 1.29
Mozzarella—Lbs Dairy 7 1 3.49
Milk—Skim—Gallon Dairy 8 1 4.29
Peas—Bag Frozen 15 1 0.99
Green Beans—Bag Frozen 15 3 1.77
Tomatoes Canned 2 4 3.92
Potatoes Canned 3 2 2.38
Mushrooms Canned 2 5 2.95

Variable Type Vs. Data Type

变量类型 vs 数据类型

If you look around the internet or in textbooks for info about data, you’ll often find variables described as being one of the data types listed above. Be aware that many variables aren’t exclusively one data type or another. What often determines the data type is how the data are collected.

如果你在网上或教科书上查找关于数据的信息,你经常发现变量被描述成上述的一种数据类型。注意许多变量并不只一种类型或有其它类型,经常决定数据类型的是数据怎样收集的。

Consider the variable age. Age is frequently collected as ratio data, but can also be collected as ordinal data. This happens on surveys when they ask, “What age group do you fall in?” There, you wouldn’t have data on your respondent’s individual ages – you’d only know how many were between 18-24, 25-34, etc. You might collect actual cholesterol measurements from participants for a health study, or you may simply ask if their cholesterol is high. Again, this is a single variable with two different data collection methods and two different data types.

考虑年龄变量。年龄是经常收集的比率数据,但也可以作为序数数据来收集,这种情况会发生在问卷调查时,他们问“你属于哪个年龄段?”这样,你就不会有调查对象个体的年龄,你只知道有多少在18-24,25-34之间等。你可能为了健康研究收集调查对象的具体胆固醇值,也可能只问他们是否高胆固醇。再者说,这是单变量用两种不同的数据收集方法,两种不同的数据类型。

The general rule is that you can go down in level of measurement but not up. If it’s possible to collect the variable as interval or ratio data, you can also collect it as nominal or ordinal data, but if the variable is inherently only nominal in nature, like grocery store section, you can’t capture it as ordinal, interval or ratio data. Variables that are naturally ordinal can’t be captured as interval or ratio data, but can be captured as nominal. However, many variables that get captured as ordinal have a similar variable that can be captured as interval or ratio data, if you so choose.

通用的规则是,你可以采用细化的度量级别而不是抽象的。如果可能需要收集区间型或比率型数据的变量,你也能以标称型或序数数据来收集。但如果变量本身只是名义上的性质,比如杂货店区域,你就不能把它作为序数、区间或比率型数据来获取。天然有序的变量就不能作为区间或比率数据来获取,但可以作为标称数据来获取。然而,如果可以选择的话,许多作为序数型的变量在获取时仍可以采用作为区间或比率型数据来获取。

Ordinal Level Type Corresponding Interval/Ratio Level Measure Example
Ranking Measurement that ranking is based on Record runners’ marathon times instead of what place they finish
Grouped scale Measurement itself Record exact age instead of age category
Substitute scale Original measurement the scale was created from Record exact test score instead of letter grade

It’s important to remember that the general rule of “you can go down, but not up” also applies during analysis and visualization of your data. If you collect a variable as ratio data, you can always decide later to group the data for display if that makes sense for your work. If you collect it as a lower level of measurement, you can’t go back up later on without collecting more data. For example, if you do decide to collect age as ordinal data, you can’t calculate the average age later on and your visualization will be limited to displaying age by groups; you won’t have the option to display it as continuous data.

记住通用规则”细化而非抽象”很重要,通用可以用在分析和可视化数据工作上.如果你以比率型数据来收集变量,之后还可以分组的形式展示数据.但如果你采集的数据测量级别很,那么除了回过头来收集更多的数据,你不可能有更深的进展.比如你以序数型数据来收集年龄信息,那过后你不可能计算出平均值,并且可视化时只限制在显示年龄组信息,你不可能把它作为连续的数据展示.

When it doesn’t increase the burden of data collection, you should collect the data at the highest level of measurement that you think you might want available later on. There’s little as disappointing in data work as going to do a graph or calculation only to realize you didn’t collect the data in a way that allows you to generate what you need!

在不增加采集数据负担的前提下,你应当以今后可能用到的数据的最高级别方式来收集.以免当要绘图或计算时才失望的发现数据有问题,那时才意识到你收集数据的方式并不能生成你想要的结果.

Other Important Terms

其他重要术语

There are some other terms that are frequently used to talk about types of data. We are choosing not to use them here because there is some disagreement about their meanings, but you should be aware of them and what their possible definitions are in case you encounter them in other resources.

还有一些其他的经常谈论数据类型的术语.我们这里并未采用,因为它们的含义有些分歧但你仍应该注意它们和可能的定义,以防你在其他材料中遇到.

Categorical Data

分类型数据

We talked about both nominal and ordinal data above as splitting data into categories. Some texts consider both to be types of categorical data, with nominal being unordered categorical data and ordinal being ordered categorical data. Others only call nominal data categorical, and use the terms “nominal data” and “categorical data” interchangeably. These texts just call ordinal data “ordinal data” and consider it to be a separate group altogether.

用于把数据分成类别时,我们谈到了标称型和有序型数据时,一些材料把他们都看做是分类数据,标称型作为无序分类数据,序数型作为有序分类数据.其它的只成为标称数据类别,用”分类数据”来替换”标称数据”.有些材料把序数数据叫”有序数据”,把它作为完全不同的一个组。

Qualitative and Quantitative Data

定性和定量数据

Qualitative data, roughly speaking, refers to non-numeric data, while quantitative data is typically data that is numeric and hence quantifiable. There is some consensus with regard to these terms. Certain data are always considered qualitative, as they require pre-processing or different methods than quantitative data to analyze. Examples are recordings of direct observation or transcripts of interviews. In a similar way, interval and ratio data are always considered to be quantitative, as they are only ever numeric. The disagreement comes in with the nominal and ordinal data types. Some consider them to be qualitative, since their categories are descriptive and not truly numeric. However, since these data can be counted and used to calculate percentages, some consider them to be quantitative, since they are in that way quantifiable.

定性数据,粗略地讲,是指非数值数据,而定量数据通常是数值型的数据,因此可以量化的。这是关于这些术语的一些共识。一些数据往往是定性的,它们需要预处理或用不同于定量数据的方法来分析。例子如直接观察的记录或采访录音脚本。类似的方式,区间型和比率型的数据一直被认为是定量的,因为他们是纯数值型的。分歧来自标称型和序数型的数据类型,有些人认为它们是定性的,因为他们的类别是描述性的,而不是真正的数值。然而,由于这些数据可以被计数和用于计算百分比,有些人认为它们是定量的,因为他们是能一定方式量化的。

To avoid confusion, we’ll be sticking with the level of measurement terms above throughout the rest of this book, except in our discussion of long-form qualitative data in the survey design chapter. If you come across terms “categorical,” “qualitative data,” or “quantitative data” in other resources or in your work, make sure you know which definition is being used and don’t just assume!

为了避免混淆,我们将在本书中始终使用以上的度量级别,除了我们在调查设计那章谈论到的长期形成的定性数据。如果你在其他材料或工作中遇到术语”分类的”,“定性数据,”或“定量数据”,确保您知道它们使用的真实定义,不要只是你以为!