Introduction
1. Preparing a DMP
2. Documenting and Organizing Data
3. Storing Data and Data Security
1 of 2

File Formats

Choosing the right file format is crucial for your research data. The video below explains why.

CC BY – UGent Data Stewards, 2020

The remainder of this topic is based on the online Research Data Management training ‘MANTRA’ of The University of Edinburgh (CC BY: https://mantra.edina.ac.uk/) and Managing Data @ Melbourne.

When collecting data it is important to consider the formats of your data files. Not all file formats are created equal. Some file fonmats are better suited for your data’s needs than others. When you need to convert, migrate, compress or share your data, you should choose the best available method and file format for the intended use. In this part of the course you will learn to:

  • Understand why research data formatting and transformation is important;
  • Identify the risks of file transformations;
  • Understand the difference between proprietary and open file formats;
  • Understand why data centres require you to deposit your data in preffered data format;
  • Make informed decisions about data file formatting, conversion and migration.

The file formats you use to generate your research data will influence how you can manage them over time, i.e. a program or application must be able to recognise the file format in order to access your data within the file.
For example, a web browser is able to process and display a file in the HTML file format so that it appears as a web page. If the browser encounters another file type, it may need to call on a special plug-in to view it. Or it may simply let you download the file to view if it can recognise it in another program.

To identify the file format, files usually have a file name extension, or suffix that follows a full stop in the file name and contains three or four letters, like for example:

  • .txt
    text
  • .pdf
    portable document format
  • .jpg
    joint photographic experts group
  • .csv
    comma separated values
  • .html
    hypertext markup language
  • .xml
    extensible markup language
  • .rtf
    rich text format

Proprietary and open formats

File formats that are non-proprietary (e.g. open source}, or in widespread use, will tend to retain the best chance of being readable in the future. Proprietary formats, especially non-standard formats, used only by a specific software program or a specific software version, are likely to present problems for future (re)use.

Rapid changes in technology and the market mean that file formats can become obsolete quickly. It happens all too often that a software application is unable to read a file created by an earlier version of itself. The implications for research data management depend on how long data need to be retained for future use by yourself or others.

Data formats that conform to an agreed intemational standard are less likely to become obsolete, because a variety of software applications should be able to read them. However, there are likely to be trade-offs in tenns of software functionality, for example, loss of formatting or macros.

File formats may be proprietary and closed, or open and published by a company, standards organisation or collective for anyone to use.

Files in proprietary formats usually must be opened by the software in which they were created. Someone without a license to the software may not be able to open the file at all. Open formats, in which the software company or collective publishes the format rather than keeps it proprietary, can be opened by more than one application. Adobe PDF is an example of an open format that may be viewed in a number of applications, not limited to just Adobe products.

It is recommended that you use open formats for your research data. If this is not possible, try and store a copy of your data in an open format. For example, you might use .xls to store your spreadsheet data, but by saving an additional copy in an open format (such as .csv) you will ensure that your data will be readable in the future.

Even though some proprietary file formats can also be opened in other programs, distortions and loss of data may occur. For example, .ppt can be opened on an IOS operating system, but some features may no longer function. Therefore, it is always advised to store your format in an open data file format, although this does not guarantee loss of features and information either. At least, you are not locked in by one supplier.

Preferred formats

Generally speaking, those file formats that allow for sustainable access are preferred. Very often they share these characteristics:

  • Non proprietary (not protected by trademark, patent or copyright)
  • Open, documented standard
  • Common usage by research community
  • Lossless compression (>< lossy compression)

It is not necessary to know all the technical ins and outs when you want to acquire knowledge about data formats. It is important, however, to have an idea of all factors involved. However, you need to know that long-term, sustainable storage requires a certain data format.

Data archives often maintain a list of preferred formats for the delivery of data. Only for data delivered in a preferred format can data archives like DANS and 4TU.Centre for Research Data guarantee long- term storage and accessibility.

There are reasons for preferring some data formats over others. For example, the preferred formats for image files are:

  • JPEG, a universally used format which has the advantage of being easily accessible from many applications.
  • TIFF, a format which allows you to preserve images at the highest quality, without compression, albeit with large file sizes.
  • PNG, a high-quality format of a smaller file size than TIFF, with the disadvantage of not allowing the inclusion of metadata such as the type of camera used to take the picture. JPEG and TIFF do possess this functionality.

More information can be found

File Conversion and migration

At some time during your research you may need to convert or migrate your data files from one format to another or from one system to another. This may be due to a new computer, new software or because you want to share your data with someone who has different software. In this kind of cases, conversion is necessary in order to be able to keep using the data.

Conversion can be done via an export function if provided by the software, via save as or via scripts. When data is converted from one file format to another, this could result in loss of internal metadata (e.g. from SPSS to Excel) or loss of editing, formatting and formulas (e.g. XLSX to CSV). Therefore, it is important for you to understand what is at stake if you lose information for the type of data you are working with. Check for errors and changes after the conversion, and choose your formats carefully.

Exercises

Exercise 1

Exercise 2

Exercise 3

Data compression

At some point you may choose to compress your data files for the purpose of local or networked storage, transportation or transmission. This is called bit-rate reduction, which involves encoding information using fewer bits than the original representation.

Zip (zip) is a de facto standard compression format that is used, though there are others, sometimes specific to a particular operating system. For example, a self-executing zip file {.exe) should not be used if the file is to be decompressed on another operating system.

Zip is a ‘lossless’ type of compression, which means the file should be identical to the original once unzipped. There are also ‘lossy’ types of compression associated with some multimedia file formats which may result in loss of quality/fidelity when played.