Table Of Contents

This Page

LIAC-ARFF v2.0

Introduction

The liac-arff module implements functions to read and write ARFF files in Python. It was created in the Connectionist Artificial Intelligence Laboratory (LIAC), which takes place at the Federal University of Rio Grande do Sul (UFRGS), in Brazil.

ARFF (Attribute-Relation File Format) is an file format specially created for describe datasets which are commonly used for machine learning experiments and softwares. This file format was created to be used in Weka, the best representative software for machine learning automated experiments.

An ARFF file can be divided into two sections: header and data. The Header describes the metadata of the dataset, including a general description of the dataset, its name and its attributes. The source below is an example of a header section in a XOR dataset:

% 
% XOR Dataset
% 
% Created by Renato Pereira
%            rppereira@inf.ufrgs.br
%            http://inf.ufrgs.br/~rppereira
% 
% 
@RELATION XOR

@ATTRIBUTE input1 REAL
@ATTRIBUTE input2 REAL
@ATTRIBUTE y REAL

The Data section of an ARFF file describes the observations of the dataset, in the case of XOR dataset:

@DATA
0.0,0.0,0.0
0.0,1.0,1.0
1.0,0.0,1.0
1.0,1.0,0.0
% 
% 
% 

Notice that several lines are starting with an % symbol, denoting a comment, thus, lines with % at the beginning will be ignored, except by the description part at the beginning of the file. The declarations @RELATION, @ATTRIBUTE, and @DATA are all case insensitive and obligatory.

For more information and details about the ARFF file description, consult http://www.cs.waikato.ac.nz/~ml/weka/arff.html

ARFF Files in Python

This module uses built-ins python objects to represent a deserialized ARFF file. A dictionary is used as the container of the data and metadata of ARFF, and have the following keys:

  • description: (OPTIONAL) a string with the description of the dataset.

  • relation: (OBLIGATORY) a string with the name of the dataset.

  • attributes: (OBLIGATORY) a list of attributes with the following template:

    (attribute_name, attribute_type)
    

    the attribute_name is a string, and attribute_type must be an string or a list of strings.

  • data: (OBLIGATORY) a list of data instances. Each data instance must be a list with values, depending on the attributes.

The above keys must follow the case which were described, i.e., the keys are case sensitive. The attribute type attribute_type must be one of these strings (they are not case sensitive): NUMERIC, INTEGER, REAL or STRING. For nominal attributes, the atribute_type must be a list of strings.

In this format, the XOR dataset presented above can be represented as a python object as:

xor_dataset = {
    'description': 'XOR Dataset',
    'relation': 'XOR',
    'attributes': [
        ('input1', 'REAL'),
        ('input2', 'REAL'),
        ('y', 'REAL'),
    ],
    'data': [
        [0.0, 0.0, 0.0],
        [0.0, 1.0, 1.0],
        [1.0, 0.0, 1.0],
        [1.0, 1.0, 0.0]
    ]
}

Features

This module provides several features, including:

  • Read and write ARFF files using python built-in structures, such dictionaries and lists;
  • Supports the following attribute types: NUMERIC, REAL, INTEGER, STRING, and NOMINAL;
  • Has an interface similar to other built-in modules such as json, or zipfile;
  • Supports read and write the descriptions of files;
  • Supports missing values and names with spaces;
  • Supports unicode values and names;
  • Fully compatible with Python 2.6+ and Python 3.4+;
  • Under MIT License

How To Install

Via pip:

$ pip install liac-arff

Via easy_install:

$ easy_install liac-arff

Manually:

$ python setup.py install

Basic Usage

arff.load(fp)

Load a file-like object containing the ARFF document and convert it into a Python object.

Parameters:fp – a file-like object.
Returns:a dictionary.
arff.loads(s)

Convert a string instance containing the ARFF document into a Python object.

Parameters:s – a string object.
Returns:a dictionary.
arff.dump(obj, fp)

Serialize an object representing the ARFF document to a given file-like object.

Parameters:
  • obj – a dictionary.
  • fp – a file-like object.
arff.dumps(obj)

Serialize an object representing the ARFF document, returning a string.

Parameters:obj – a dictionary.
Returns:a string with the ARFF document.

Encoders and Decoders

class arff.ArffDecoder

An ARFF decoder.

decode(s)

Returns the Python representation of a given ARFF file.

When a file object is passed as an argument, this method read lines iteratively, avoiding to load unnecessary information to the memory.

Parameters:s – a string or file object with the ARFF file.
class arff.ArffEncoder

An ARFF encoder.

encode(obj)

Encodes a given object to an ARFF file.

Parameters:obj – the object containing the ARFF information.
Returns:the ARFF file as an unicode string.
iter_encode(obj)

The iterative version of arff.ArffEncoder.encode.

This encodes iteratively a given object and return, one-by-one, the lines of the ARFF file.

Parameters:obj – the object containing the ARFF information.
Returns:(yields) the ARFF file as unicode strings.

Exceptions

exception arff.BadRelationFormat

Error raised when the relation declaration is in an invalid format.

exception arff.BadAttributeFormat

Error raised when some attribute declaration is in an invalid format.

exception arff.BadDataFormat

Error raised when some data instance is in an invalid format.

exception arff.BadAttributeType

Error raised when some invalid type is provided into the attribute declaration.

exception arff.BadNominalValue

Error raised when a value in used in some data instance but is not declared into it respective attribute declaration.

exception arff.BadNumericalValue

Error raised when and invalid numerical value is used in some data instance.

exception arff.BadLayout

Error raised when the layout of the ARFF file has something wrong.

exception arff.BadObject(msg='')

Error raised when the object representing the ARFF file has something wrong.

Unicode

LIAC-ARFF works with unicode (for python 2.6+, in python 3.x this is default), and to take advantage of it, you need to load the arff file using codecs, specifying its codification:

import codecs
import arff

file_ = codecs.load('/path/to/file.arff', 'rb', 'utf-8')
arff.load(file_)

Examples

Dumping An Object

Converting an object to ARFF:

 import arff

 obj = {
    'description': u'',
    'relation': 'weather',
    'attributes': [
        ('outlook', ['sunny', 'overcast', 'rainy']),
        ('temperature', 'REAL'),
        ('humidity', 'REAL'),
        ('windy', ['TRUE', 'FALSE']),
        ('play', ['yes', 'no'])
    ],
    'data': [
        ['sunny', 85.0, 85.0, 'FALSE', 'no'],
        ['sunny', 80.0, 90.0, 'TRUE', 'no'],
        ['overcast', 83.0, 86.0, 'FALSE', 'yes'],
        ['rainy', 70.0, 96.0, 'FALSE', 'yes'],
        ['rainy', 68.0, 80.0, 'FALSE', 'yes'],
        ['rainy', 65.0, 70.0, 'TRUE', 'no'],
        ['overcast', 64.0, 65.0, 'TRUE', 'yes'],
        ['sunny', 72.0, 95.0, 'FALSE', 'no'],
        ['sunny', 69.0, 70.0, 'FALSE', 'yes'],
        ['rainy', 75.0, 80.0, 'FALSE', 'yes'],
        ['sunny', 75.0, 70.0, 'TRUE', 'yes'],
        ['overcast', 72.0, 90.0, 'TRUE', 'yes'],
        ['overcast', 81.0, 75.0, 'FALSE', 'yes'],
        ['rainy', 71.0, 91.0, 'TRUE', 'no']
    ],
 }

print arff.dumps(obj)

resulting in:

@RELATION weather

@ATTRIBUTE outlook {sunny, overcast, rainy}
@ATTRIBUTE temperature REAL
@ATTRIBUTE humidity REAL
@ATTRIBUTE windy {TRUE, FALSE}
@ATTRIBUTE play {yes, no}

@DATA
sunny,85.0,85.0,FALSE,no
sunny,80.0,90.0,TRUE,no
overcast,83.0,86.0,FALSE,yes
rainy,70.0,96.0,FALSE,yes
rainy,68.0,80.0,FALSE,yes
rainy,65.0,70.0,TRUE,no
overcast,64.0,65.0,TRUE,yes
sunny,72.0,95.0,FALSE,no
sunny,69.0,70.0,FALSE,yes
rainy,75.0,80.0,FALSE,yes
sunny,75.0,70.0,TRUE,yes
overcast,72.0,90.0,TRUE,yes
overcast,81.0,75.0,FALSE,yes
rainy,71.0,91.0,TRUE,no
%
%
%

Loading An Object

Loading and ARFF file:

import arff
import pprint

file_ = '''@RELATION weather

@ATTRIBUTE outlook {sunny, overcast, rainy}
@ATTRIBUTE temperature REAL
@ATTRIBUTE humidity REAL
@ATTRIBUTE windy {TRUE, FALSE}
@ATTRIBUTE play {yes, no}

@DATA
sunny,85.0,85.0,FALSE,no
sunny,80.0,90.0,TRUE,no
overcast,83.0,86.0,FALSE,yes
rainy,70.0,96.0,FALSE,yes
rainy,68.0,80.0,FALSE,yes
rainy,65.0,70.0,TRUE,no
overcast,64.0,65.0,TRUE,yes
sunny,72.0,95.0,FALSE,no
sunny,69.0,70.0,FALSE,yes
rainy,75.0,80.0,FALSE,yes
sunny,75.0,70.0,TRUE,yes
overcast,72.0,90.0,TRUE,yes
overcast,81.0,75.0,FALSE,yes
rainy,71.0,91.0,TRUE,no
%
%
% '''
d = arff.loads(file_)
pprint.pprint(d)

resulting in:

{u'attributes': [(u'outlook', [u'sunny', u'overcast', u'rainy']),
                 (u'temperature', u'REAL'),
                 (u'humidity', u'REAL'),
                 (u'windy', [u'TRUE', u'FALSE']),
                 (u'play', [u'yes', u'no'])],
 u'data': [[u'sunny', 85.0, 85.0, u'FALSE', u'no'],
           [u'sunny', 80.0, 90.0, u'TRUE', u'no'],
           [u'overcast', 83.0, 86.0, u'FALSE', u'yes'],
           [u'rainy', 70.0, 96.0, u'FALSE', u'yes'],
           [u'rainy', 68.0, 80.0, u'FALSE', u'yes'],
           [u'rainy', 65.0, 70.0, u'TRUE', u'no'],
           [u'overcast', 64.0, 65.0, u'TRUE', u'yes'],
           [u'sunny', 72.0, 95.0, u'FALSE', u'no'],
           [u'sunny', 69.0, 70.0, u'FALSE', u'yes'],
           [u'rainy', 75.0, 80.0, u'FALSE', u'yes'],
           [u'sunny', 75.0, 70.0, u'TRUE', u'yes'],
           [u'overcast', 72.0, 90.0, u'TRUE', u'yes'],
           [u'overcast', 81.0, 75.0, u'FALSE', u'yes'],
           [u'rainy', 71.0, 91.0, u'TRUE', u'no']],
 u'description': u'',
 u'relation': u'weather'}