mung.io module¶
This module implements functions for reading and writing the data formats used by MUSCIMA++.
Data formats¶
All MUSCIMA++ data is stored as XML, in <Node>
elements.
These are grouped into <Nodes>
elements, which are
the top-level elements in the *.xml
dataset files.
The list of object classes used in the dataset is also stored as XML,
in <NodeClass>
elements (within a <NodeClasses>
element).
Node¶
To read a Node list file (in this case, a test data file):
>>> from mung.io import read_nodes_from_file
>>> import os
>>> file = os.path.join(os.path.dirname(__file__), '../test/test_data/01_basic.xml')
>>> nodes = read_nodes_from_file(file)
The Node
string representation is a XML object:
<Node xml:id="MUSCIMA-pp_1.0___CVC-MUSCIMA_W-01_N-10_D-ideal___25">
<Id>25</Id>
<ClassName>grace-notehead-full</ClassName>
<Top>119</Top>
<Left>413</Left>
<Width>16</Width>
<Height>6</Height>
<Mask>1:5 0:11 (...) 1:4 0:6 1:5 0:1</Mask>
<Outlinks>12 24 26</Outlinks>
<Inlinks>13</Inlinks>
</Node>
The Nodes are themselves kept as a list:
<Nodes>
<Node> ... </Node>
<Node> ... </Node>
</Nodes>
Parsing is only implemented for files that consist of a single
<Nodes>
.
Additional information¶
Caution
This part may easily be deprecated.
Arbitrary data can be added to the Node using the optional
<Data>
element. It should encode a dictionary of additional
information about the Node that may only apply to a subset
of Nodes (this facultativeness is what distinguishes the
purpose of the <Data>
element from just subclassing Node
).
For example, encoding the pitch, duration and precedence information about a notehead could look like this:
<Node>
...
<Data>
<DataItem key="pitch_step" type="str">D</DataItem>
<DataItem key="pitch_modification" type="int">1</DataItem>
<DataItem key="pitch_octave" type="int">4</DataItem>
<DataItem key="midi_pitch_code" type="int">63</DataItem>
<DataItem key="midi_duration" type="int">128</DataItem>
<DataItem key="precedence_inlinks" type="list[int]">23 24 25</DataItem>
<DataItem key="precedence_outlinks" type="list[int]">27</DataItem>
</Data>
</Node
The Node
will then contain in its data
attribute
the dictionary:
self.data = {'pitch_step': 'D',
'pitch_modification': 1,
'pitch_octave': 4,
'midi_pitch_code': 63,
'midi_pitch_duration': 128,
'precedence_inlinks': [23, 24, 25],
'precedence_outlinks': [27]}
This is also a basic mechanism to allow you to subclass Node with extra attributes without having to re-implement parsing and export.
Warning
Do not misuse this! The <Data>
mechanism is primarily
intended to encode extra information for MUSCIMarker to
display.
Individual elements of a <Node>
¶
<Id>
is the unique integer ID of the Node inside this document<ClassName>
is the name of the object’s class (such asnoteheadFull
,beam
,numeral3
, etc.).<Top>
is the vertical coordinate of the upper left corner of the object’s bounding box.<Left>
is the horizontal coordinate of the upper left corner of the object’s bounding box.<Width>
: the amount of rows that the Node spans.<Height>
: the amount of columns that the Node spans.<Mask>
: a run-length-encoded binary (0/1) array that denotes the area within the Node’s bounding box (specified bytop
,left
,height
andwidth
) that the Node actually occupies. If the mask is not given, the object is understood to occupy the entire bounding box. For the representation, see Implementation notes below.<Inlinks>
: whitespace-separateid
list, representing Nodes from which a relationship leads to this Node. (Relationships are directed edges, forming a directed graph of Nodes.) The ids are valid in the same scope as the Node’sid
: don’t mix Nodes from multiple scopes (e.g., multiple documents)! If you are using Nodes from multiple documents at the same time, make sure to check against the ``unique_id``s.<Outlinks>
: whitespace-separateid
list, representing Nodes to which a relationship leads to this Node. (Relationships are directed edges, forming a directed graph of Nodes.) The ids are valid in the same scope as the Node’sid
: don’t mix Nodes from multiple scopes (e.g., multiple documents)! If you are using Nodes from multiple documents at the same time, make sure to check against the ``unique_id``s.<Data>
: a list of<DataItem>
elements. The elements have two attributes:key
, andtype
. Thekey
is what the item should be called in thedata
dict of the loaded Node. Thetype
attribute encodes the Python type of the item and gets applied to the text of the<DataItem>
to produce the value. Currently supported types areint
,float
, andstr
, andlist[int]
,list[float]
andlist[str]
. The lists are whitespace-separated.
The parser function provided for Nodes does not check against the presence of other elements. You can extend Nodes for your own purposes – but you will have to implement parsing.
Implementation notes on the mask¶
The mask is a numpy array that will be saved using run-length encoding. The numpy array is first flattened, then runs of successive 0’s and 1’s are encoded as e.g. ``0:10 `` for a run of 10 zeros.
How much space does this take?
Objects tend to be relatively convex, so after flattening, we can expect
more or less two runs per row (flattening is done in C
order). Because
each run takes (approximately) 5 characters, each mask takes roughly 5 * n_rows
bytes to encode. This makes it efficient for objects wider than 5 pixels, with
a compression ratio approximately n_cols / 5
.
(Also, the numpy array needs to be made C-contiguous for that, which
explains the NODE_MASK_ORDER=’C’ hack in set_mask().)
NodeClass¶
This is what a single NodeClass element might look like:
<NodeClass>
<Id>1</Id>
<Name>notehead-empty</Name>
<GroupName>note-primitive/notehead-empty</GroupName>
<Color>#FF7566</Color>
</NodeClass>
See e.g. test/test_data/mff-muscima-classes-annot.xml
,
which is incidentally the real NodeClass list used
for annotating MUSCIMA++.
Similarly to a <Nodes>
, the <NodeClass>
elements are organized inside a <NodeClasses>
:
<NodeClasses>
<NodeClass> ... </NodeClass>
<NodeClass> ... </NodeClass>
</NodeClasses>
The NodeClass
represents one possible Node
symbol class, such as a notehead or a time signature. Aside from defining
the “vocabulary” of available object classes for annotation, it also contains
some information about how objects of the given class should
be displayed in the MUSCIMarker annotation software (ordering
related object classes together in menus, implementing a sensible
color scheme, etc.). There is nothing interesting about this class,
we pulled it into the mung
package because the object
grammar (i.e. which relationships are allowed and which are not)
depends on having NodeClass object as its “vocabulary”,
and you will probably want to manipulate the data somehow based
on the objects’ relationships (like reassembling notes from notation
primitives: notehead plus stem plus flags…), and the grammar
file is a reference for doing that.
- mung.io.export_node_list(nodes: List[Node], file_path: str, document: str = None, dataset: str = None) None ¶
- mung.io.export_nodeclass_list(node_classes: List[NodeClass]) str [source]¶
Writes the Node data as a XML string. Does not write to a file – use
with open(output_file) as out_stream:
etc.
- mung.io.get_edges(nodes: List[Node], validate: bool = True) List[Tuple[int, int]] [source]¶
Collects the inlink/outlink Node graph and returns it as a list of
(from, to)
edges.- Parameters:
nodes – A list of Node instances. All are expected to be within one document.
validate – If set, will raise a ValueError if the graph defined by the Nodes is invalid.
- Returns:
A list of
(from, to)
id pairs that represent edges in the Node graph.
- mung.io.parse_node_classes(filename: str) List[NodeClass] [source]¶
Extract the list of
NodeClass
objects from an xml file with a NodeClasses as the top element and NodeClass children.
- mung.io.read_nodes_from_file(filename: str) List[Node] [source]¶
From a xml file with a Nodes as the top element, parse a list of nodes. (See
Node
class documentation for a description of the XMl format.)Let’s test whether the parsing function works:
>>> test_data_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), ... 'test', 'test_data') >>> file = os.path.join(test_data_dir, '01_basic.xml') >>> nodes = read_nodes_from_file(file) >>> len(nodes) 48
Let’s also test the
data
attribute: >>> file_with_data = os.path.join(test_data_dir, ‘01_basic_binary_2.0.xml’) >>> nodes = read_nodes_from_file(file_with_data) >>> nodes[0].data[‘pitch_step’] ‘G’ >>> nodes[0].data[‘midi_pitch_code’] 79 >>> nodes[0].data[‘precedence_outlinks’] [8, 17] >>> nodes[0].dataset ‘MUSCIMA-pp_2.0’ >>> nodes[0].document ‘01_basic_binary’- Returns:
A list of ``Node``s.
- mung.io.validate_document_graph_structure(nodes: List[Node]) bool [source]¶
Check that the graph defined by the
inlinks
andoutlinks
in the given list of Nodes is valid: no relationships leading from or to objects with non-existent ``id``s.Checks that all the Nodes come from one document. (Raises a
ValueError
otherwise.)- Parameters:
nodes – A list of
Node
instances.- Returns:
True
if graph is valid,False
otherwise.
- mung.io.validate_nodes_graph_structure(nodes: List[Node]) bool [source]¶
Check that the graph defined by the
inlinks
andoutlinks
in the given list of Nodes is valid: no relationships leading from or to objects with non-existent ``id``s.Can deal with
Nodes
coming from a combination of documents, through the Nodedocument
property. Warns about documents which are found inconsistent.- Parameters:
nodes – A list of
Node
instances.- Returns:
True
if graph is valid,False
otherwise.