Chapter 6
第六章
By Jane Foo
著:Jane Foo,译:眼神
Running your own study to collect data is not the only or best way to start your data analysis. Using someone else’s dataset and sharing your data is on the rise and has helped advance much of the recent research. Using external data offers several benefits:
运行你自己收集的数据可能不是进行数据分析的唯一或者最优的方法。利用别人的数据并且分享你的数据,这种情况越来越多,这也有助于发展越来越多的研究。利用外部数据有很多好处:
Time / Cost | Can decrease the work required to collect and prepare data for analysis |
Access | May allow you to work with data that requires more resources to collect than you have, or data that you wouldn’t otherwise have access to at all |
Community | Promotes new ideas and interesting collaborations by connecting you to people who are interested in the same topic |
时间 / 成本 | 能降低收集和准备所需数据的工作量 |
可访问性 | 让你有更多资源来收集数据,或者是那些你根本无法访问的数据。 |
社区 | 与那些和你有共同智趣的人交流可以激发你新的想法和令人感兴趣的合作。 |
All those benefits sound great! So where do you find external data? To help narrow your search, ask yourself the following questions:
这些好处听起来不错吧!那么在哪去找到这些外部数据呢?为了帮助你减少寻找的范围,先问一下自己以下的问题:
Scope | What is the scope of the data you’re looking for? What are the:
|
Type | What type of data are you looking for? Do you need:
|
Contribution | How will the data contribute to your existing data analysis? Do you need several external datasets to complete your analysis? |
范围 | 你在寻找的数据的范围是什么?他们是:
|
类型 | 你在寻找的数据类型是什么? 你需要的是:
|
贡献度 | 这些数据对于你现在的分析有多大的贡献? 你需要一些外部的数据集完成你的分析吗? |
Once you have a better idea of what you’re looking for in an external dataset, you can start your search at one of the many public data sources available to you, thanks to the open content and access movement that has been gaining traction on the Internet. Many institutions, governments, and organizations have established policies that support the release of data to the public in order to provide more transparency and accountability and to encourage the development of new services and products. Here’s a breakdown of public data sources:
一旦你有了去寻找外部数据的想法,你就可以开始去搜索哪些公开的数据源,这都要感谢开放内容及访问运动在网络上的推动。许多院校、政府、组织已经制定政策,支持数据公开发布,这可以提供更好的透明度和可问责性,并且能鼓励开发新的服务和产品。下面表格是一些公开数据源:
Source | Examples |
---|---|
来源 | 出处 |
Search Engines 搜索引擎 |
|
Data Repositories 数据仓库 |
re3data.org DataBib DataCite Dryad DataCatalogs.org Open Access Directory Gapminder Google Public Data Explorer IBM Many Eyes Knoema |
Government Datasets 政府数据 |
World Bank United Nations Open Data Index Open Data Barometer U.S. Government Data Kenya’s Open Data Initiative |
Research Institutions 研究机构 |
Academic Torrents American Psychological Association Other professional associations Academic institutions |
If you decide to use a search engine (like Google) to look for datasets, keep in mind that you’ll only find things that are indexed by the search engine. Sometimes a website (and the resource associated with it) will be visible only to registered users or be set to block the search engine, so these kinds of sites won’t turn up in your search result. Even still, the Internet is a big playground, so save yourself the headache of scrolling through lots of irrelevant search results by being clear and specific about what you’re looking for.
如果你打算使用像谷歌这样的搜索引擎来寻找数据,要知道你通过搜索引擎找到的只是相关的索引。有时,一个网站只对注册的用户开放或是设置阻止搜索引擎检索,所以这类网站的内容不会出现在你搜索的结果中。即使这样,在互联网这样的一个巨大的平台上,搜索引擎还把你从大量毫不相关的内容拯救到你需要的那些清晰具体的数据上。
If you’re not sure what to do with a particular type of data, try browsing through the Information is Beautiful awards for inspiration. You can also attend events such as the annual Open Data Day to see what others have done with open data.
如果你还不太清楚哪种集体类型的数据,试着去看看信息之美获奖作品来激发一下灵感。你也可以试着去参加一下类似于开放数据日年会这样的活动,看看其他人都是怎么处理开放数据的。
Open data repositories benefit both the contributors and the users by providing an online forum to share and brainstorm new ways to study and discuss data. In some cases, data crowdsourcing has led to new findings that otherwise would have developed at a much slower rate or would have not been possible in the first place. One of the more publicized crowdsourcing projects is Foldit from the University of Washington, a Web-based puzzle game that allows anyone to submit protein folding variations which are used by scientists to build new innovative solutions in bioinformatics and medicine. And recently, Cancer Research UK released a mobile game called Genes in Space that tasks users with identifying cancer cells in biopsy slides which in turn helps researchers cut down data analysis time.
开放数据库对于贡献者和使用者都是有好处的,它利用在线的论坛进行分享和头脑风暴去研究和讨论数据。在一些情况下,数据众包推动了许多新的发现,否则数据的研究发展仍将在较慢的速度,也不能是现在这种地位。其中一个参与比较广泛的众包项目是发自华盛顿大学的 Foldit (http://fold.it/portal/info/about) 。这是一个类似于拼图的游戏并且是基于网络的,允许任何人提交蛋白质折叠变体,科学家由此可以在生物信息学和生物医学创建创新的解决方案。并且最近,英国癌症研究部门发布了一款名为太空基因的手机游戏,玩家的任务是在活组织切片的幻灯片上识别癌症细胞,由此帮助研究人员缩减数据分析的时间。
Of course, not all data is public. There may come a time when you have access to a special collection of data because of your status within a particular network or through an existing relationship. Or maybe you come across a dataset that you can buy. In either case, you typically have to agree to and sign a license in order to get the data, so always make sure that you review the Terms of Use before you buy. If no terms are provided, insist on getting written permission to use the dataset.
当然,不是所有的数据都是公开的。可能有些时候你不得不采集一个特定的数据集为了你的统计工作,这需要一个特定的网络或者通过某种关系。也或者你需要的数据只能花钱购买。不管哪种情况,你都需要同意并且签署某种授权才能得到这些数据,所以在你购买这些数据前你要明确使用条款。即使没有条款提供,在你使用这些数据前,也要坚持得到书面的允许。
Let’s say you’ve found a dataset that fits your criteria. But is the quality good enough?
假设你已经获得了一个符合你要求的数据集。但是它的质量足够好吗?
Assessing data quality means looking at all the details provided about the data (including metadata, or “data about the data,” such as time and date of creation) and the context in which the data is presented. Good datasets will provide details about the dataset’s purpose, ownership, methods, scope, dates, and other notes. For online datasets, you can often find this information by navigating to the “About” or “More Information” web pages or by following a “Documentation” link.
评估数据的质量意味着要对所提供的数据的细节做全面的评价(包括元数据,或叫“关于数据的数据”,比如创建数据的时间和日期)和数据被提出时的背景。好的数据集应该包括数据集的目的、所有者、分类的方法、范围、日期和其他的说明。对于网上的数据集,你经常会发现这样的信息通过浏览“关于”或“更多信息”这样的网页,也可能是通过某条“参考资料”的链接。
Feel free to use general information evaluation techniques when reviewing data. For instance, one popular method used by academic libraries is the CRAAP Test, which is a set of questions that help you determine the quality of a text. The acronym stands for:
进行数据检验时,用一般的信息评价技术是很方便的.例如,一种被高校图书馆广泛采用的CRAAP评价方法,这种方法是通过一系列问题来帮助你确定文本的质量。CRAAP的含义:
Currency | Is the information up-to-date? When was it collected / published / updated? |
Relevancy | Is the information suitable for your intended use? Does it address your research question? Is there other (better) information? |
Authority | Is the information creator reputable and has the necessary credentials? Can you trust the information? |
Accuracy | Do you spot any errors? What is the source of the information? Can other data or research support this information? |
Purpose | What was the intended purpose of the information collected? Are other potential uses identified? |
时效性 | 信息是最新的吗? 信息是什么时候采集的/ 什么时候发布的 / 什么时候更新的? |
相关性 | 信息是否符合你的用途? 它是否解决了你的问题? 是否还有其他(更好)的信息? |
权威性 | 这些信息创建者是否很有声望?这些信息是否有必要的认证? 你能相信这些信息吗? |
准确性 | 你是否发现了任何错误? 这些信息的来源是哪里? 其他的数据或研究能支持这些信息吗? |
目的性 | 收集这些信息的目的是什么?是否能确定这些信息有其他潜在用途? |
Finally, when you review the dataset and its details, watch out for the following red flags:
最后,当你进行数据集的检验和它的细节时,关注一下下面红色的标签:
So now you have a dataset that meets your criteria and quality requirements, and you have permission to use it. What other things should you consider before you start your work?
在你已经有了一个能满足你的标准和质量要求的数据集,并且你也被允许使用它后。在你开始你的工作前,还有什么事情是你需要考虑的呢?
Checklist | |
---|---|
Did you get all the necessary details about the data? | Don’t forget to obtain variable specifications, external data dictionaries, and referenced works. |
Is the data part of a bigger dataset or body of research? | If yes, look for relevant specifications or notes from the bigger dataset. |
Has the dataset been used before? | If it has and you’re using the data for an analysis, make sure your analysis is adding new insights to what you know has been done with the data previously. |
How are you documenting your process and use of the data? | Make sure to keep records of licensing rights, communication with data owners, data storage and retention, if applicable. |
Are you planning to share your results or findings in the future? | If yes, you’ll need to include your data dictionary and a list of your additional data sources. |
检查例表 | |
---|---|
你获得了所有必需的数据集的细节? | 别忘了获取变量的规范说明,外部数据词典,引用的材料 |
这个数据是某个更大数据集或研究的一部分吗? | 如果是,从更大的数据集总寻找相关的规范或者说明 |
这个数据集以前被使用过吗? | 如果使用过, 并且你也正在使用这些数据在做分析,要确保你的分析相比于之前的分析增加了新的见解。 |
如何证明处理和使用的数据? | 如果可以,要保留授权的文件,与数据所有者的沟通记录,数据存储和保留的权利。 |
你是否有打算将来分享你的研究成果? | 如果是,你需要编撰你的数据词典或者附加数据来源的清单。 |
Your answers to these questions can change the scope of your analysis or prompt you to look for additional data. They may even lead you to think of an entirely new research question.
你对上面问题的解答可能会改变你分析问题的范围,或者,进一步寻找更多的数据。这些问题可能会让你思考到一个全新的研究课题。
The checklist encourages you to document (a lot). Careful documentation is important for two big reasons. First, in case you need to redo your analysis, your documentation will help you retrace what you did. Second, your documentation will provide evidence to other researchers that your analysis was conducted properly and allow them to build on your data findings.
上面的检查列表鼓励你提供许多参考文献。做出详细的参考文献有两个重要的理由。第一,这种情况下你需要重做你的分析,这些参考文献会重推你所做的工作。第二,这些参考文献会为其它研究者提供证明,你的研究处理是正确的,让他们能重做你的研究结果。
Simply put, crediting the source of your external dataset is the right thing to do. It’s also mandatory. Ethical research guidelines state that crediting sources is required for any type of research. So always make sure that you properly credit any external data you use by providing citations.
简单地说,感谢外部数据的出处是正确的。它也是强制性的。从研究的道德准则来说,感谢出处对于任何研究都是必要的。所以你要通过正确的引文对每条外部数据表示感谢。
Good citations give the reader enough information to find the data that you have accessed and used. Wondering what a good citation looks like? Try using an existing citation style manual from APA, MLA, Chicago, Turabian, or Harvard. Unlike citations for published items (like books), citations for a dataset vary a great deal from style to style.
好的引文可以给读者足够的信息去找到你访问和使用的数据。好的引文长成什么样呢?尝试一下从 APA,MLA,Chicago,Turabian,或者Harvard的引文风格指南。不像出版物的引文(比如图书),数据集的引文风格非常多样。一般来说,无论什么风格都需要说明作者和标题。另外,编辑、出品人、发行的信息(地点,出版日期),取存的时间(你第一次看到数据的时间),数据集的详细资料(唯一标识符、版本、材料类型),还有链接网址也都是需要的。对于官方的数据,可以用政府部门、委员会或者机构的名字作为作者的名字。
As a general rule, all styles require the author and the title. In addition, editor, producer or distributor information (location, publication date), access date (when you first viewed the data), details about the dataset (unique identifier, edition, material type), and the URL may be needed. For government datasets, use the name of the department, committee or agency as the group / corporate author.
For example, let’s say you’re using the U.S. Census Annual Survey of Public Employment and Payroll.
举例来说,假设你正在使用美国就业和工资的年度普查
The APA Style Manual (Publication Manual of the American Psychological Association, 6th edition) would cite this the following way:
APA的引文风格指南(Publication Manual of the American Psychological Association, 6th edition)应该是下面这种样子
while the MLA Style Manual (MLA Handbook for Writers of Research Paper, 7th edition) cites the same census data as:
如果是MLA风格处理同样的数据时(MLA Handbook for Writers of Research Paper, 7th edition):
Data repositories and organizations often have their own citation guidelines and provide ready citations that you can use “as is”. The Interuniversity Consortium for Political and Social Research (ICPSR), The National Center for Health Statistics, Dryad, PANGAEA, and Roper Center Data all provide guidelines for citing their datasets.
数据库和一些组织都有自己的引用指南,提供给你用来作为成型的引文。政治和社会研究校际联盟(ICPSR),国家卫生统计中心, Dryad, PANGAEA,罗珀数据中心都提供了数据引用指南。
This chapter gives you a brief look into external data: the important takeaway is that we are only at the start of a significant growth in data thanks to the technologies that now make massive data storage and processing an affordable reality. Open datasets in particular have the potential to become a de facto standard for anyone looking for data to analyze.
本章对于外部数据作了简要的考察:很重要的一点是,我们正处于数据爆发性增长的开端,新技术让海量的数据存储和处理成为了现实。特别是开放数据库让每个人去做数据分析成为了一种现实。