mung.io module

This module implements functions for reading and writing the data formats used by MUSCIMA++.

Data formats

All MUSCIMA++ data is stored as XML, in <Node> elements. These are grouped into <Nodes> elements, which are the top-level elements in the *.xml dataset files.

The list of object classes used in the dataset is also stored as XML, in <NodeClass> elements (within a <NodeClasses> element).

Node

To read a Node list file (in this case, a test data file):

>>> from mung.io import read_nodes_from_file
>>> import os
>>> file = os.path.join(os.path.dirname(__file__), '../test/test_data/01_basic.xml')
>>> nodes = read_nodes_from_file(file)

The Node string representation is a XML object:

<Node xml:id="MUSCIMA-pp_1.0___CVC-MUSCIMA_W-01_N-10_D-ideal___25">
  <Id>25</Id>
  <ClassName>grace-notehead-full</ClassName>
  <Top>119</Top>
  <Left>413</Left>
  <Width>16</Width>
  <Height>6</Height>
  <Mask>1:5 0:11 (...) 1:4 0:6 1:5 0:1</Mask>
  <Outlinks>12 24 26</Outlinks>
  <Inlinks>13</Inlinks>
</Node>

The Nodes are themselves kept as a list:

<Nodes>
  <Node> ... </Node>
  <Node> ... </Node>
</Nodes>

Parsing is only implemented for files that consist of a single <Nodes>.

Additional information

Caution

This part may easily be deprecated.

Arbitrary data can be added to the Node using the optional <Data> element. It should encode a dictionary of additional information about the Node that may only apply to a subset of Nodes (this facultativeness is what distinguishes the purpose of the <Data> element from just subclassing Node).

For example, encoding the pitch, duration and precedence information about a notehead could look like this:

<Node>
    ...
    <Data>
        <DataItem key="pitch_step" type="str">D</DataItem>
        <DataItem key="pitch_modification" type="int">1</DataItem>
        <DataItem key="pitch_octave" type="int">4</DataItem>
        <DataItem key="midi_pitch_code" type="int">63</DataItem>
        <DataItem key="midi_duration" type="int">128</DataItem>
        <DataItem key="precedence_inlinks" type="list[int]">23 24 25</DataItem>
        <DataItem key="precedence_outlinks" type="list[int]">27</DataItem>
    </Data>
</Node

The Node will then contain in its data attribute the dictionary:

self.data = {'pitch_step': 'D',
             'pitch_modification': 1,
             'pitch_octave': 4,
             'midi_pitch_code': 63,
             'midi_pitch_duration': 128,
             'precedence_inlinks': [23, 24, 25],
             'precedence_outlinks': [27]}

This is also a basic mechanism to allow you to subclass Node with extra attributes without having to re-implement parsing and export.

Warning

Do not misuse this! The <Data> mechanism is primarily intended to encode extra information for MUSCIMarker to display.

Individual elements of a <Node>

  • <Id> is the unique integer ID of the Node inside this document

  • <ClassName> is the name of the object’s class (such as noteheadFull, beam, numeral3, etc.).

  • <Top> is the vertical coordinate of the upper left corner of the object’s bounding box.

  • <Left> is the horizontal coordinate of the upper left corner of the object’s bounding box.

  • <Width>: the amount of rows that the Node spans.

  • <Height>: the amount of columns that the Node spans.

  • <Mask>: a run-length-encoded binary (0/1) array that denotes the area within the Node’s bounding box (specified by top, left, height and width) that the Node actually occupies. If the mask is not given, the object is understood to occupy the entire bounding box. For the representation, see Implementation notes below.

  • <Inlinks>: whitespace-separate id list, representing Nodes from which a relationship leads to this Node. (Relationships are directed edges, forming a directed graph of Nodes.) The ids are valid in the same scope as the Node’s id: don’t mix Nodes from multiple scopes (e.g., multiple documents)! If you are using Nodes from multiple documents at the same time, make sure to check against the ``unique_id``s.

  • <Outlinks>: whitespace-separate id list, representing Nodes to which a relationship leads to this Node. (Relationships are directed edges, forming a directed graph of Nodes.) The ids are valid in the same scope as the Node’s id: don’t mix Nodes from multiple scopes (e.g., multiple documents)! If you are using Nodes from multiple documents at the same time, make sure to check against the ``unique_id``s.

  • <Data>: a list of <DataItem> elements. The elements have two attributes: key, and type. The key is what the item should be called in the data dict of the loaded Node. The type attribute encodes the Python type of the item and gets applied to the text of the <DataItem> to produce the value. Currently supported types are int, float, and str, and list[int], list[float] and list[str]. The lists are whitespace-separated.

The parser function provided for Nodes does not check against the presence of other elements. You can extend Nodes for your own purposes – but you will have to implement parsing.

Implementation notes on the mask

The mask is a numpy array that will be saved using run-length encoding. The numpy array is first flattened, then runs of successive 0’s and 1’s are encoded as e.g. ``0:10 `` for a run of 10 zeros.

How much space does this take?

Objects tend to be relatively convex, so after flattening, we can expect more or less two runs per row (flattening is done in C order). Because each run takes (approximately) 5 characters, each mask takes roughly 5 * n_rows bytes to encode. This makes it efficient for objects wider than 5 pixels, with a compression ratio approximately n_cols / 5. (Also, the numpy array needs to be made C-contiguous for that, which explains the NODE_MASK_ORDER=’C’ hack in set_mask().)

NodeClass

This is what a single NodeClass element might look like:

<NodeClass>
    <Id>1</Id>
    <Name>notehead-empty</Name>
    <GroupName>note-primitive/notehead-empty</GroupName>
    <Color>#FF7566</Color>
</NodeClass>

See e.g. test/test_data/mff-muscima-classes-annot.xml, which is incidentally the real NodeClass list used for annotating MUSCIMA++.

Similarly to a <Nodes>, the <NodeClass> elements are organized inside a <NodeClasses>:

<NodeClasses>
   <NodeClass> ... </NodeClass>
   <NodeClass> ... </NodeClass>
</NodeClasses>

The NodeClass represents one possible Node symbol class, such as a notehead or a time signature. Aside from defining the “vocabulary” of available object classes for annotation, it also contains some information about how objects of the given class should be displayed in the MUSCIMarker annotation software (ordering related object classes together in menus, implementing a sensible color scheme, etc.). There is nothing interesting about this class, we pulled it into the mung package because the object grammar (i.e. which relationships are allowed and which are not) depends on having NodeClass object as its “vocabulary”, and you will probably want to manipulate the data somehow based on the objects’ relationships (like reassembling notes from notation primitives: notehead plus stem plus flags…), and the grammar file is a reference for doing that.

mung.io.export_node_list(nodes: List[Node], file_path: str, document: str = None, dataset: str = None) None
mung.io.export_nodeclass_list(node_classes: List[NodeClass]) str[source]

Writes the Node data as a XML string. Does not write to a file – use with open(output_file) as out_stream: etc.

mung.io.get_edges(nodes: List[Node], validate: bool = True) List[Tuple[int, int]][source]

Collects the inlink/outlink Node graph and returns it as a list of (from, to) edges.

Parameters:
  • nodes – A list of Node instances. All are expected to be within one document.

  • validate – If set, will raise a ValueError if the graph defined by the Nodes is invalid.

Returns:

A list of (from, to) id pairs that represent edges in the Node graph.

mung.io.parse_node_classes(filename: str) List[NodeClass][source]

Extract the list of NodeClass objects from an xml file with a NodeClasses as the top element and NodeClass children.

mung.io.read_nodes_from_file(filename: str) List[Node][source]

From a xml file with a Nodes as the top element, parse a list of nodes. (See Node class documentation for a description of the XMl format.)

Let’s test whether the parsing function works:

>>> test_data_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)),
...                              'test', 'test_data')
>>> file = os.path.join(test_data_dir, '01_basic.xml')
>>> nodes = read_nodes_from_file(file)
>>> len(nodes)
48

Let’s also test the data attribute: >>> file_with_data = os.path.join(test_data_dir, ‘01_basic_binary_2.0.xml’) >>> nodes = read_nodes_from_file(file_with_data) >>> nodes[0].data[‘pitch_step’] ‘G’ >>> nodes[0].data[‘midi_pitch_code’] 79 >>> nodes[0].data[‘precedence_outlinks’] [8, 17] >>> nodes[0].dataset ‘MUSCIMA-pp_2.0’ >>> nodes[0].document ‘01_basic_binary’

Returns:

A list of ``Node``s.

mung.io.validate_document_graph_structure(nodes: List[Node]) bool[source]

Check that the graph defined by the inlinks and outlinks in the given list of Nodes is valid: no relationships leading from or to objects with non-existent ``id``s.

Checks that all the Nodes come from one document. (Raises a ValueError otherwise.)

Parameters:

nodes – A list of Node instances.

Returns:

True if graph is valid, False otherwise.

mung.io.validate_nodes_graph_structure(nodes: List[Node]) bool[source]

Check that the graph defined by the inlinks and outlinks in the given list of Nodes is valid: no relationships leading from or to objects with non-existent ``id``s.

Can deal with Nodes coming from a combination of documents, through the Node document property. Warns about documents which are found inconsistent.

Parameters:

nodes – A list of Node instances.

Returns:

True if graph is valid, False otherwise.

mung.io.write_nodes_to_file(nodes: List[Node], file_path: str, document: str = None, dataset: str = None) None[source]
mung.io.write_nodes_to_string(nodes: List[Node], document: str = None, dataset: str = None) str[source]

Writes the Node data as an XML string. Does not write to a file – use write_nodes_to_file if you want that behavior.