
Chapter 6


Finding external data


By Jane Foo

著:Jane Foo,译:眼神

Running your own study to collect data is not the only or best way to start your data analysis. Using someone else’s dataset and sharing your data is on the rise and has helped advance much of the recent research. Using external data offers several benefits:


Time / Cost Can decrease the work required to collect and prepare data for analysis
Access May allow you to work with data that requires more resources to collect than you have, or data that you wouldn’t otherwise have access to at all
Community Promotes new ideas and interesting collaborations by connecting you to people who are interested in the same topic
时间 / 成本 能降低收集和准备所需数据的工作量
可访问性 让你有更多资源来收集数据,或者是那些你根本无法访问的数据。
社区 与那些和你有共同智趣的人交流可以激发你新的想法和令人感兴趣的合作。

Where to Find External Data


All those benefits sound great! So where do you find external data? To help narrow your search, ask yourself the following questions:


Scope What is the scope of the data you’re looking for? What are the:
  • geographic boundaries?
  • specific data attributes (such as age range)?
  • time periods?
Type What type of data are you looking for? Do you need:
  • statistics?
  • research data?
  • raw data?
  • data that have been collected using a specific method?
Contribution How will the data contribute to your existing data analysis?
Do you need several external datasets to complete your analysis?
范围 你在寻找的数据的范围是什么?他们是:
  • 地理边界数据?
  • 特定的数据属性(例如年龄范围)?
  • 时间周期?
类型 你在寻找的数据类型是什么? 你需要的是:
  • 统计型的数据?
  • 研究型的数据?
  • 原始数据?
  • 使用特定方法收集的数据?
贡献度 这些数据对于你现在的分析有多大的贡献?

Public Data


Once you have a better idea of what you’re looking for in an external dataset, you can start your search at one of the many public data sources available to you, thanks to the open content and access movement that has been gaining traction on the Internet. Many institutions, governments, and organizations have established policies that support the release of data to the public in order to provide more transparency and accountability and to encourage the development of new services and products. Here’s a breakdown of public data sources:


Source Examples
来源 出处
Search Engines
Data Repositories
Open Access Directory
Google Public Data Explorer
IBM Many Eyes
Government Datasets
World Bank
United Nations
Open Data Index
Open Data Barometer
U.S. Government Data
Kenya’s Open Data Initiative
Research Institutions
Academic Torrents
American Psychological Association
Other professional associations
Academic institutions

If you decide to use a search engine (like Google) to look for datasets, keep in mind that you’ll only find things that are indexed by the search engine. Sometimes a website (and the resource associated with it) will be visible only to registered users or be set to block the search engine, so these kinds of sites won’t turn up in your search result. Even still, the Internet is a big playground, so save yourself the headache of scrolling through lots of irrelevant search results by being clear and specific about what you’re looking for.


If you’re not sure what to do with a particular type of data, try browsing through the Information is Beautiful awards for inspiration. You can also attend events such as the annual Open Data Day to see what others have done with open data.


Open data repositories benefit both the contributors and the users by providing an online forum to share and brainstorm new ways to study and discuss data. In some cases, data crowdsourcing has led to new findings that otherwise would have developed at a much slower rate or would have not been possible in the first place. One of the more publicized crowdsourcing projects is Foldit from the University of Washington, a Web-based puzzle game that allows anyone to submit protein folding variations which are used by scientists to build new innovative solutions in bioinformatics and medicine. And recently, Cancer Research UK released a mobile game called Genes in Space that tasks users with identifying cancer cells in biopsy slides which in turn helps researchers cut down data analysis time.

开放数据库对于贡献者和使用者都是有好处的,它利用在线的论坛进行分享和头脑风暴去研究和讨论数据。在一些情况下,数据众包推动了许多新的发现,否则数据的研究发展仍将在较慢的速度,也不能是现在这种地位。其中一个参与比较广泛的众包项目是发自华盛顿大学的 Foldit (http://fold.it/portal/info/about) 。这是一个类似于拼图的游戏并且是基于网络的,允许任何人提交蛋白质折叠变体,科学家由此可以在生物信息学和生物医学创建创新的解决方案。并且最近,英国癌症研究部门发布了一款名为太空基因的手机游戏,玩家的任务是在活组织切片的幻灯片上识别癌症细胞,由此帮助研究人员缩减数据分析的时间。

Non-Public Data


Of course, not all data is public. There may come a time when you have access to a special collection of data because of your status within a particular network or through an existing relationship. Or maybe you come across a dataset that you can buy. In either case, you typically have to agree to and sign a license in order to get the data, so always make sure that you review the Terms of Use before you buy. If no terms are provided, insist on getting written permission to use the dataset.


Assessing External Data


Let’s say you’ve found a dataset that fits your criteria. But is the quality good enough?


Assessing data quality means looking at all the details provided about the data (including metadata, or “data about the data,” such as time and date of creation) and the context in which the data is presented. Good datasets will provide details about the dataset’s purpose, ownership, methods, scope, dates, and other notes. For online datasets, you can often find this information by navigating to the “About” or “More Information” web pages or by following a “Documentation” link.


Feel free to use general information evaluation techniques when reviewing data. For instance, one popular method used by academic libraries is the CRAAP Test, which is a set of questions that help you determine the quality of a text. The acronym stands for:


Currency Is the information up-to-date? When was it collected / published / updated?
Relevancy Is the information suitable for your intended use? Does it address your research question? Is there other (better) information?
Authority Is the information creator reputable and has the necessary credentials? Can you trust the information?
Accuracy Do you spot any errors? What is the source of the information? Can other data or research support this information?
Purpose What was the intended purpose of the information collected? Are other potential uses identified?
时效性 信息是最新的吗? 信息是什么时候采集的/ 什么时候发布的 / 什么时候更新的?
相关性 信息是否符合你的用途? 它是否解决了你的问题? 是否还有其他(更好)的信息?
权威性 这些信息创建者是否很有声望?这些信息是否有必要的认证? 你能相信这些信息吗?
准确性 你是否发现了任何错误? 这些信息的来源是哪里? 其他的数据或研究能支持这些信息吗?
目的性 收集这些信息的目的是什么?是否能确定这些信息有其他潜在用途?

Finally, when you review the dataset and its details, watch out for the following red flags:


Using External Data


So now you have a dataset that meets your criteria and quality requirements, and you have permission to use it. What other things should you consider before you start your work?


Did you get all the necessary details about the data? Don’t forget to obtain variable specifications, external data dictionaries, and referenced works.
Is the data part of a bigger dataset or body of research? If yes, look for relevant specifications or notes from the bigger dataset.
Has the dataset been used before? If it has and you’re using the data for an analysis, make sure your analysis is adding new insights to what you know has been done with the data previously.
How are you documenting your process and use of the data? Make sure to keep records of licensing rights, communication with data owners, data storage and retention, if applicable.
Are you planning to share your results or findings in the future? If yes, you’ll need to include your data dictionary and a list of your additional data sources.
你获得了所有必需的数据集的细节? 别忘了获取变量的规范说明,外部数据词典,引用的材料
这个数据是某个更大数据集或研究的一部分吗? 如果是,从更大的数据集总寻找相关的规范或者说明
这个数据集以前被使用过吗? 如果使用过, 并且你也正在使用这些数据在做分析,要确保你的分析相比于之前的分析增加了新的见解。
如何证明处理和使用的数据? 如果可以,要保留授权的文件,与数据所有者的沟通记录,数据存储和保留的权利。
你是否有打算将来分享你的研究成果? 如果是,你需要编撰你的数据词典或者附加数据来源的清单。

Your answers to these questions can change the scope of your analysis or prompt you to look for additional data. They may even lead you to think of an entirely new research question.


The checklist encourages you to document (a lot). Careful documentation is important for two big reasons. First, in case you need to redo your analysis, your documentation will help you retrace what you did. Second, your documentation will provide evidence to other researchers that your analysis was conducted properly and allow them to build on your data findings.


Giving Credit to External Data Sources


Simply put, crediting the source of your external dataset is the right thing to do. It’s also mandatory. Ethical research guidelines state that crediting sources is required for any type of research. So always make sure that you properly credit any external data you use by providing citations.


Good citations give the reader enough information to find the data that you have accessed and used. Wondering what a good citation looks like? Try using an existing citation style manual from APA, MLA, Chicago, Turabian, or Harvard. Unlike citations for published items (like books), citations for a dataset vary a great deal from style to style.

好的引文可以给读者足够的信息去找到你访问和使用的数据。好的引文长成什么样呢?尝试一下从 APAMLAChicagoTurabian,或者Harvard的引文风格指南。不像出版物的引文(比如图书),数据集的引文风格非常多样。一般来说,无论什么风格都需要说明作者和标题。另外,编辑、出品人、发行的信息(地点,出版日期),取存的时间(你第一次看到数据的时间),数据集的详细资料(唯一标识符、版本、材料类型),还有链接网址也都是需要的。对于官方的数据,可以用政府部门、委员会或者机构的名字作为作者的名字。

As a general rule, all styles require the author and the title. In addition, editor, producer or distributor information (location, publication date), access date (when you first viewed the data), details about the dataset (unique identifier, edition, material type), and the URL may be needed. For government datasets, use the name of the department, committee or agency as the group / corporate author.

For example, let’s say you’re using the U.S. Census Annual Survey of Public Employment and Payroll.


The APA Style Manual (Publication Manual of the American Psychological Association, 6th edition) would cite this the following way:

APA的引文风格指南(Publication Manual of the American Psychological Association, 6th edition)应该是下面这种样子

APA citation

while the MLA Style Manual (MLA Handbook for Writers of Research Paper, 7th edition) cites the same census data as:

如果是MLA风格处理同样的数据时(MLA Handbook for Writers of Research Paper, 7th edition):

MLA citation

Data repositories and organizations often have their own citation guidelines and provide ready citations that you can use “as is”. The Interuniversity Consortium for Political and Social Research (ICPSR), The National Center for Health Statistics, Dryad, PANGAEA, and Roper Center Data all provide guidelines for citing their datasets.

数据库和一些组织都有自己的引用指南,提供给你用来作为成型的引文。政治和社会研究校际联盟(ICPSR)国家卫生统计中心, Dryad, PANGAEA,罗珀数据中心都提供了数据引用指南。

This chapter gives you a brief look into external data: the important takeaway is that we are only at the start of a significant growth in data thanks to the technologies that now make massive data storage and processing an affordable reality. Open datasets in particular have the potential to become a de facto standard for anyone looking for data to analyze.
