To help developing graph-handling applications, test data is needed. Here on graph-test-data we collect it.
We categorize formats and their test files by category: binary, text, xml or json.
In the graph-format-registry we get stable identifiers for graph formats. New formats should first be added to the registry.
Every graph format has an ID like format:graphml-1.1 or format:connected-json-8.0.0.
Graph format can be a version of of a more general family of formats.
In our case, format:graphml-1.1 is a version of format:graphml.
All are xml-based formats.
So their test files are in /xml/graphml/graphml-1.1 or /json/connected-json/connected-json-8.0.0 respectively.
This allows to load test data by simply filtering on pathname prefix.
For test data, where the exact version is not determined, it might also be filed under the /xml/graphml path directly.
Test files can be labeled as synthetic (created with the purpose of testing) or collected (created for various, often unknown purposes).
Furthermore, we often collect more than one test file forming sub-collections, for example all test files collected from a certain website. Datasets should state the source they have been collected from. Datasets have short, arbitrary names, but avoiding the reserved format ID names as defined in the graph registry.
We use simple meta.ddot files.
These are plain-text files with embedded, human&machine-readable ddot.it triples.
If we have more text, we use an meta.adoc (or meta.md if you prefer) file, also including ddot.it syntax.
We are interested in
..source url.. stating the URL where the data was collected from ..download date.. ISO date when the data was downloaded. ..license.. which is very often not stated at all. Public files on the web are considered public domain unless otherwise stated.
There are two ways to attach metadata, pick whichever fits:
-
Per-directory
meta.ddot— use when a directory holds many files that share the same provenance (e.g. one dataset, one source, one license). The triples useddot.it/thisfor facts about the whole folder and the file name as the subject for facts about a single file, e.g.got-graph.graphml ..license.. …. -
Per-file sidecar
<filename>.ddot— use when a directory holds many small, unrelated files with differing sources/licenses. Each data file gets its own.ddotfile next to it, named by appending.ddotto the full file name (e.g.got-graph.graphmlis described bygot-graph.graphml.ddot, andplanar.graphml.xmlbyplanar.graphml.xml.ddot). Inside a sidecar, the subject is alwaysddot.it/this(it refers to the one file the sidecar belongs to).
A sidecar .ddot is just a meta.ddot scoped to a single file; the triple
vocabulary is identical.
Some test files are intentionally broken so that parsers can be tested against bad input.
We mark them with a tag in the file name, written with a double dash -- directly before the
extension(s): name—TAG.ext.
| Tag | Meaning |
|---|---|
|
The file is intentionally invalid. Optionally append the format to say how it is invalid,
using the format family from the
graph-format-registry (lower-case, no dots):
|
|
A manually corrected, valid counterpart of a broken sibling file — useful to show the intended,
repaired form next to the invalid one. Example: |
Keep the marker right before the extension so the tag survives multi-dot extensions
(e.g. foo—INVALIDxml.graphml.xml). Tags are matched case-insensitively.
A valid-input test run should skip the broken files. Filter them out by the --INVALID tag in the
file name:
-
To skip all intentionally-invalid files: drop any file whose name (before the extension) contains
--INVALID. -
To skip only files invalid for your format: drop names containing
--INVALID<yourformat>(e.g. a GraphML reader skips--INVALIDgraphmland--INVALIDxml). Note an--INVALIDxmlfile is also invalid for every XML-based format, so exclude--INVALIDxmlas well when consuming an XML-based format.
--FIXED files are valid and should be treated as normal input.
If you want your data to be removed from the test collection, please file an issue at the tracker.