Introduction
1. Preparing a DMP
2. Documenting and Organizing Data
3. Storing Data and Data Security
1 of 2

Different Levels of Data Documentation

There are many different ways to set up and organise your documentation.

Project Level

Project level documentation gives contextual information about the study/project: it explains the aims of the study, the research questions, the methodologies, etc.

Project level documentation also seeks answers to questions such as:

  • For what purpose was the data created?
    Describe the project history, its aims, objectives, concepts and hypotheses, including:
    • The title of the project
    • Authors, creators, co workers of the dataset
    • The institution of the author(s)/creator(s)
    • Funders
    • Grant numbers
    • References to related projects
    • Publications from the data.
  • What does the dataset contain?
    • Kind of data (interviews, images, questionnaires, instrumental, etc.)
    • Organization & structure
    • Relationships between files
    • Description of data file(s): version and edition, structure of the database, associations, links between files, external links, formats, compatibility
  • How was the data collected?
    • The methodology and technique used in collecting and creating the data
    • Description of all the sources the data originate from
    • The methods/modes of data collection (for example):
      • The instruments, hardware and software used to collect the data
      • Digitisation or transcription methods
      • Data collection protocols
      • Sampling design and procedure
      • Target population, units of observation
  • What possible manipulations were done to the data? How was the data processed?
    • Modifications made to data over time since their original creation and identification of different versions of datasets
    • Describe workflow and specific tools, instruments, procedures, hardware/software or protocols you might have used to process the data
    • Anonymisation /pseudonymization strategy
  • What where the quality assurance procedures?
    • Checking for equipment and transcription errors
    • Quality control of materials
    • Data integrity checks
    • Calibration procedures
    • Data capture resolution and repetitions
    • Other procedures related to data quality such as weighting, calibration, reasons for missing values, checks and corrections of transcripts, transformations.
  • How can the data be accessed?
    Describe the use and access conditions of the data:
    • Where the data can be found
    • Access conditions such as embargo
    • Parts of the data that are restricted, protected or confidential
    • Licences
    • Permanent identifiers
    • Copyright and ownership issues

A complete academic thesis normally contains this information in details, but a published article may not. If a dataset is shared, a detailed technical report needs to be included for the user to understand how the data were collected and processed. You should also provide a sample bibliographic citation to indicate how you would like secondary users of your data to cite it in any publication.

File or Database Level

File or database level documentation documents how all the files (or tables in a database) that make up the dataset relate to each other, what format they are in, whether they supersede or are superseded by previous files, etc.

For this purpose, a codebook is advised. These codebooks can be used as a separate file or they can be embedded within the datafile. The first allows for much flexibility, but is yet another document to maintain, the latter sits close to data, is easy to use, but is hardly flexible and may get lost in conversion

The separate codebook
The embedded codebook

Source: Ellis SE, Leek JT. 2017. How to share data for collaboration. PeerJ Preprints 5:e3139v5 https://doi.org/10.7287/peerj.preprints.3139v5

Data level documentation should also seek to document the processing steps, answering questions such as:

  • What happens between data files and why?
  • What is the chronology like? What happens when, and why?
  • use annotated scripts or cookbooks that describe all steps, decisions and study protocol

Variable or Item Level

Variable or item level documentation documents how an object of analysis came about. For example, it does not just document a variable name at the top of a spreadsheet file, but also the full label explaining the meaning of that variable in terms of how it was operationalised.

Best practices regarding variable names:

  • Use valid variable names
  • Meaningful abbreviations, e.g. use bmi, not var1
  • Refer to numbering system in instrument, e.g. q1a, q1b, q2, q3a
  • Avoid simplistic numerical order system like v1, v2, v3
  • Be consistent
  • Don’t change variable names across (versions of) datasets (e.g. gender, sex)
  • Use 1 language
  • Short, no spaces, no special characters and lower case. (Gender vs gender)

Best practices regarding variable descriptions:
Variables in tabular data should have descriptive labels.

  • Be brief, max. 80 characters
  • Spaces or special characters are ok
  • Include unit of measurement where applicable
  • Refer to number used in instrument. e.g. variable q11bhexw with label q11b: hours spent taking physical exercise in a typical week the description gives the unit of measurement and a reference to the question number (q11b)

Additional resources