RATT
RATT can only be used in combination with TriplyDB. Contact info@triply.cc to receive your token to access the RATT package.
RATT is a TypeScript package that is developed by Triply. RATT makes it possible to develop and maintain production-grade linked data pipelines. It is used in combination with one of the TriplyDB subscriptions to create large-scale knowledge graphs.
RATT Connectors
RATT Connectors are modules that allow various backend systems to be connected to a RATT pipeline.
RATT Connectors generate RATT Records. The RATT Records are used to configure the rest of the pipeline. This decouples pipeline configuration from source system structure. This is one of the essential features of RATT that set it apart from most other pipeline systems.
Use Assets for static source data
Assets are a feature of TriplyDB that allows storage of arbitrary files, including source data files.
Source data is often made available in static files. For example, a pipeline may make use of a Microsoft Excel file and a collection of ESRI ShapeFiles. Or a pipeline may use a relational database in addition to a set of CSV text files that store information that is not in the relational dataset.
If your pipeline needs to connect to static data files, it is a best practice to upload such files as TriplyDB Assets. This has the following benefits:
- Shareable
- TriplyDB Assets can be added to any TriplyDB Dataset. This means that collaborators that have access to a dataset will also have access to the static data files that are needed to create the linked data in that dataset.
- Secure
- TriplyDB Assets are accessible under the same access levels as the TriplyDB Dataset to which they belong. This means that you can share static data files in a secure way with your collaborators.
- Versioned
- TriplyDB Assets are versioned. If a new version of the same static file becomes available, this file can be uploaded to the same TriplyDB Asset. If there are problems with the new data files then your collaborators can always roll back to an earlier version of the source data.
- Transparent
- All collaborators have access to the same TriplyDB Assets. This makes it transparent which static data files are needed, and which versions are available. This is much more transparent than having to share (versions of) files over email or by other indirect means.
- Backed-up
- TriplyDB instances that are maintained by Triply are also backed up regularly. This includes the static data files that are uploaded as TriplyDB Assets. This is much more secure than storing static data files on a local laptop that can break, or where files can get lost otherwise.
Microsoft Excel (XLSX) files
Microsoft Excel (file name extension .xlsx
) is a popular file format for storing tabular source data.
RATT has a dedicated connector for Excel files. After such files are uploaded as TriplyDB Assets, RATT can connect to them as follows:
const account = 'my-account'
const dataset = 'my-dataset'
app.use(
fromXlsx(Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-table.xlsx'}))
)
Comma Separated Values (CSV) files and Tab Separated Values (TSV)
Comma Separated Values (file name extension .csv
) and Tab Separated Values (file name extension .tsv
) are popular file formats for storing tabular source data.
RATT has dedicated connectors for CSV and TSV files. In the example below, after your CSV files are compressed and uploaded as TriplyDB Assets, RATT can connect to them as follows:
const account = 'my-account'
const dataset = 'my-dataset'
app.use(
fromCsv(Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-table.csv.gz'}))
)
This connector also handles CSV variants that use a cell separator that is not comma (,
).
For TSV files, you can use fromTSV()
accordingly.
Collect records from a specified OAI endpoint (fromOai
)
fromOai
allows a RATT pipeline to be run over the self-contained RATT records that come from the specified OAI endpoint.The handling of resumption tokens and iteration over the array members per batch is abstracted away by this new middleware, simplifying the use of OAI endpoints for the RATT user. The middleware supports xml parsing and the content is automatically determined by the returned Content-Type, this is the default behaviour. The received content can be cached and if cached, each record contains metadata about whether it came from a cached result, or whether it's a 'new' record.
Function signature
This function has the following signature:
app.use(
fromOai({
since: Time,
url: "https://somethingsomething.redacted/webapioai/oai.ashx",
set: "xyzname",
cacheOverride: "use cache",
maxCacheAgeDays: number
})
)
The function can be configured in the following ways:
Time
This is a value that is modified for testing purposes, for example, in unit testing by the development team. It is dangerous when this value is used in combination with caching and therefore, is not preset in the ETls.https://somethingsomething.redacted/webapioai/oai.ashx
is the specified OAI endpoint.xyzname
is the name for the specific dataset.use cache
starts the caching process.number
is a natural number and it indicates the number of days after which the cache will be cleared.
Keeping track of records in the cache
To keep track of the new records or the modified records in the caching, we can use a custom middleware.
An example to show the custom middleware:
(ctx, next) => next({ ...ctx.getAny('metadata.record'), fromCache: ctx.getBoolean('header.fromCache') }),
Notice the following details:
header.fromCache
returns boolean true if the record exists in the cachefromCache
property is added to the record with either a true/falsemetadata.record
retrieve the record
The modified record is then passed to the next middleware.
Standards-compliance
RATT supports the official CSV standard: RFC 4180. Unfortunately, there are some ‘CSV’ files that do not follow the RFC 4180 standard. Strictly speaking these files are not CSV files, although they may look similar to CSV files in some parts.
If your files do not follow the official CSV standard, then RATT may or may not be able to process your tabular data files correctly. In general, it is a good idea to correct tabular data files that deviate from the CSV standard. Such files are likely to cause issues in other data processing tools too.
Known limitations of CSV/TSV
While CSV/TSV files are often used in practice, they come with significant limitations.
Specifically, CSV/TSV does not allow the type of values to be specified. All values have type 'string'
.
This is specifically an issue when tabular data contains numeric information. Such numeric information will only be available as strings. These strings must be explicitly transformed to numbers in RATT (see the change
function).
More advanced tabular formats like Microsoft Excel are able to store the types of values.
JSON sources
JSON (JavaScript Object Notation) is a popular open standard for interchanging tree-shaped data.
The following example uses a JSON source that is stored as a TriplyDB asset:
const account = 'my-account'
const dataset = 'my-dataset'
app.use(
fromJson(Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-data.json'}),),
)
The following example uses an in-line specified JSON source:
app.use(
fromJson([{ a: "a", b: "b", c: "c" }]),
)
SPARQL queries
RATT is able to use existing SPARQL queries as data sources. This can be used to tap into existing RDF sources for transformation and/or enrichment.
This section assumes a SPARQL query has been saved inside TriplyDB. See the SPARQL endpoints section on how to use SPARQL endpoints without such saved queries. Notice that using saved queries is significantly better than querying endpoints, especially in production systems.
RATT is able to load RDF data from a SPARQL construct
query. Such queries can be used to transform data from one graph structure to another. For example, this can be used to transform DCAT metadata records into Schema.org metadata records.
The following one-liner runs an existing saved construct
query in TriplyDB:
loadRdf(Ratt.Source.TriplyDb.query('my-account', 'my-query')),
Similar to the other RATT Connectors, the above snippet automatically performs multiple requests in the background, if needed, to retrieve the full result set. This is not supported by bare SPARQL endpoints which lack a standardized form of pagination. See the page on SPARQL Pagination for more information on how this works.
The above example is identical for public and non-public TriplyDB Saved Queries. This makes it easy to start out with private or internal queries, and move them to public once the project matures. This is not supported by raw SPARQL endpoints.
Specifying a result graph
It is often useful to store the results of construct
queries in a specific graph. For example, when internal data is enriched with external sources, it is often useful to store the external enrichments in a separate graph. Another example is the use of a query that applies RDF(S) and/or OWL reasoning. In such cases the results of the reasoner may be stored in a specific graph.
The following example stores the results of the specified construct
query in a special ‘enrichment’ graph:
const graph = Ratt.prefixer('https://example.com/id/graph/')
const myQuery = Ratt.Source.TriplyDb.query('my-account',
'my-dataset',
{toGraph: graph('enrichment')})
loadRdf(myQuery)
The value of the toGraph
option can be any IRI that is specified inside RATT. In the above example the graph
prefix is used together with the enrichment
local name to produce the absolute IRI https://example.com/id/graph/enrichment
.
Using a specific query version
In production systems, applications must be able to choose whether they want to use the latest version of a query (acceptance mode), or whether they want to use a specific recent version (production mode), or whether they want to use a specific older version (legacy mode).
This is supported by TriplyDB Saved Queries. A specific version can be used by specifying the version
option in RATT. For example, the following snippet always uses the first version of the specified query:
const myQuery = Ratt.Source.TriplyDb.query('my-account',
'my-dataset',
{toGraph: graph.results,
version: 1})
loadRdf(myQuery)
Not specifying the version
option automatically uses the latest version. There is no standardized support for query versioning with raw SPARQL endpoints.
Specifying API variables
In production systems, applications often request different information based on a limited set of input variables. This is supported by TriplyDB Saved Queries, for which API variables can be configured. The API variables ensure that the query string is parameterized correctly, maintaining the RDF syntax and semantics.
The following example binds the ?country
variable inside the query string to literal 'Holland'
. This allows the results for Holland to be returned.
const myQuery = Ratt.Source.TriplyDb.query('my-account',
'my-dataset',
{toGraph: graph.results,
variables: {country: 'Holland'},
version: 1})
loadRdf(myQuery)
There is no standardized support for specifying API variables with raw SPARQL endpoints.
Specifying dynamic API variables
In the previous section the value 'Holland'
for the API variable country
was known at the time of writing the RATT pipeline. But what do we do if the requested country is not known at the time of writing the RATT pipeline, but depends on data that is read/transformed during the execution of the RATT pipeline?
In such cases we can use the following custom middleware to run the SPARQL query:
const account = 'my-account'
app.use(
async (context, next) => {
const api_variables = {
country: context.getString('COUNTRY')
}
const myQuery = await account.getQuery('my-query')
for await (const statement of myQuery.results(api_variables).statements()) {
statement.graph = graph('enrichment')
context.store.addQuad(statement)
}
return next()
}),
In the above example, different countries are specified by data values that are read dynamically from the COUNTRY
key. This key can be a column in a table, or an element in XML, or some other dynamic data location, depending on the RATT source that is used.
The following line is used to configure the graph where the results from the queries are stored:
statement.graph = graph('enrichment')
SPARQL endpoints
The previous section explained how RATT pipelines can be connected to TriplyDB Saved Queries. It is also possible to connect RATT to raw SPARQL endpoints, including non-TriplyDB endpoints. Unfortunately, raw SPARQL endpoints do not offer the same production-grade features as TriplyDB Saved Queries. For example, there is no standardized way to retrieve larger result sets.
The following code snippet issues a raw SPARQL query against a public SPARQL endpoint:
const myQuery = Ratt.Source.url(
'https://dbpedia.org/sparql',
{
request: {
headers: {
accept: 'text/csv',
'content-type': 'application/query-string',
},
body: 'select * { ?s ?p ?o. } limit 1',
method: 'POST',
},
}
)
Since we specified CSV as the result set format (Media Type text/csv
), the above SPARQL query can be accessed as any other CSV source in RATT:
app.use(
fromCsv(myQuery),
)
Extensible Markup Language (XML) files
Extensible Markup Language (file name extension .xml
) is similar to HTML, but where you can define your own tags to use. This is why it is a very useful format to store, search and share your data.
RATT has a dedicated connector for XML files. After such files are uploaded as TriplyDB Assets, RATT can connect to them as follows:
const account = 'my-account'
const dataset = 'my-dataset'
app.use(
fromXml(Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-data.xml'}),{ selectors: ['first-element'] })
)
selectors is an array of string-arrays indicating which XML paths should be stored as a record.
- Example: if you have an XML of the format:
<root>
<a>
....
</a>
</root>
by using the array [ 'root', 'a' ] as selectors, you would add as a record the elements which are nested inside <a>
tag . Note that you must specify the full path in the selector from the root up to the node you want as a record.
This function transforms XML to JSON.
Specify multiple source files
The RATT connectors for source files allow an arbitrary number of files to be specified.
The following example code connects two CSV files to a RATT pipeline:
const account = 'my-account'
const dataset = 'my-dataset'
app.use(
fromCsv([
Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-table-1.csv.gz'}),
Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-table-2.csv.gz'}),
]),
)
This also works with sources that are specified in the RATT context:
const account = 'my-account'
const dataset = 'my-dataset'
const app = new Ratt({
sources: {
table1: Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-table-1.csv.gz'}),
table2: Ratt.Source.TriplyDb.asset(account, dataset, {name: 'my-table-2.csv.gz'}),
}
})
app.use(
fromCsv([
app.sources.table1,
app.sources.table2,
]),
)
Iterate over all Assets
While it is possible to explicitly specify one or more source file that are uploaded as TriplyDB Assets, RATT also allows all Assets to be used in an easy way:
const app = new Ratt({
sources: {
one: Ratt.Source.TriplyDb.asset(dataset),
two: Ratt.Source.TriplyDb.asset(account, dataset),
}
})
Source one
provides an iterator over all Assets in the dataset
that is published under the user account associated with the current API Token.
Source two
provides an iterator over all Assets in the dataset
that is published under the specified account
.
File compression for plain text files
It is a best practice to compress static data files if they are plain text files. Compression should be applied prior to uploading such files as TriplyDB Assets.
The following command shows how a local CSV file can be compressed using GNU Zip (.gz
):
gzip my-table.csv
Running this command will replace file my-table.csv
with file my-table.csv.gz
.
Using local files
Some people like to work with local files. This is generally a bad idea, because your work cannot be shared with others. Still, if you understand the implications of using local files, you can connect them to your RATT pipeline.
The following example connects a local CSV file:
app.use(
fromCsv(Ratt.Source.file('my-table.csv.gz')),
)
Files from publicly accessible URLs
Some people like to work with publicly accessible URLs on the Internet. This is generally a bad idea, because this cannot be used for data that is not known to have a public license. Because a lot of data has no clear license, this approach can almost never be used legitimately. Also, Internet resources may not always be available, making them risky to depend on in a pipeline. Internet resources may change their contents, which may affect the pipeline.
Still, if you understand the implications of using publicly accessible URLs, you can connect them to your RATT pipeline.
The following example connects an imaginary remote CSV file that is published at a made up public URL to RATT:
app.use(
fromCsv(Ratt.Source.url('https://example.com/my-table.csv.gz')),
)
Using a local string
When you create an Extraction Transform Load pipeline, there are several usecases when you do not have an external resource that you want to transform, but instead you have a string resource that contains the data. To support this usecase RATT allows the use of a local string as Source
. This usecase is useful when you want to test certain parts of your ETL, or when you want to learn how RATT works.
The Ratt.Source.string()
can be used in combination with three middlewares: loadRdf
, validateShacl
and fromJson
. All three examples are shown below.
The following example loads two in-line specified RDF triples that are now used as input source for loadRdf
.
app.use(
loadRdf(Ratt.Source.string(`
<https://triplydb.com/me> a <https://triplydb.com/Person> .
<https://triplydb.com/me> <https://triplydb.com/name> "me".`)),
)
The following example loads two in-line specified RDF triples that are now used as input source for loadvalidateShaclRdf
.
app.use(
validateShacl(Ratt.Source.string(`
prefix sh: <http://www.w3.org/ns/shacl#>
<https://triplydb.com/PersonShape> a sh:NodeShape .
<https://triplydb.com/PersonShape> sh:targetClass <https://triplydb.com/Person>.`)),
)
The following example loads an array of two in-line JSON objects as the input source for fromJson
.
app.use(
fromJson(Ratt.Source.string(`
[
{
name: 'Alice'
type: 'Person'
},
{
name: 'Bob'
type: 'Person'
},
]`)),
)
Working with IRIs
Linked data uses IRIs for uniquely identifying data items. This means that IRIs are often mentioned inside RATT pipelines. Because IRIs can be long and complex, it is a best practice to declare short aliases that can be used to abbreviate IRIs.
It is a best practice to declare such prefixes together and at the top of the TypeScript file that implements the RATT pipeline:
- When all prefix declarations appear together, it is less likely that the same prefix is accidentally declared twice.
- When all prefix declarations appear at the top of the file, this avoids situations in which a prefix cannot be used because it has not yet been declared.
Declaring IRI prefixes in RATT
RATT has a special function that creates prefixes. It works as follows:
const ALIAS = Ratt.prefixer(IRI_PREFIX)
This allows a potentially complex and long IRI_PREFIX
to be used through a short and simple object called ALIAS
.
To distinguish objects that denote prefix declarations from objects that denote other things, it is common to place prefix declarations into an object called prefix
:
const prefix = {
ex: Ratt.prefixer('https://example.com/'),
}
After this prefix has been declared, prefix.ex
can be used instead of the longer IRI 'https://example.com/'
. In linked data an alias (in this example: ex
) denotes a namespace: a collection of IRI terms that have the same IRI prefix.
It is common to place IRI terms that belong to the same namespace in an object that is named after the corresponding prefix alias.
For example, the following 3 IRI terms belong to the ex
namespace:
const ex = {
john: prefix.ex('john'),
knows: prefix.ex('knows'),
mary: prefix.ex('mary'),
}
Later in the RATT pipeline, these terms can be used to create statements:
app.use(
// “John knows Mary.”
triple(ex.john, ex.knows, ex.mary),
)
External prefixes
In linked data it is common to reuse existing vocabularies and datasets. Such external vocabularies and datasets use their own IRIs, so it is a good idea to declare prefixes for them whenever they are used in RATT.
The following example adds a prefix declaration for the Friend of a Friend (FOAF) and Resource Description Framework vocabularies to the prefix
object:
const prefix = {
ex: Ratt.prefixer('https://example.com/'),
foaf: Ratt.prefixer('http://xmlns.com/foaf/0.1/'),
rdf: Ratt.prefixer('http://www.w3.org/1999/02/22-rdf-syntax-ns#'),
}
Once these prefixes have been declared, they can be used to create terms within these namespaces:
const foaf = {
Agent: prefix.foaf('Agent'),
Person: prefix.foaf('Person'),
}
const rdf = {
type: prefix.rdf('type'),
}
const a = rdf.type
These declared terms can be used later in the RATT pipeline to create statements:
app.use(
// “John is a person.”
triple(ex.john, a, foaf.Person),
// “Mary is a person.”
triple(ex.mary, a, foaf.Person),
)
Because the foaf
and ex
objects have been declared at the start of the pipeline, the rest of the pipeline can use autocompletion for IRIs terms. This works by typing the namespace alias and a dot (for example: foaf.
) and pressing Ctrl + SPC
(control and space at the same time). In modern code editors this will bring up a list of autocomplete results.
Notice that the RATT notation for statements is purposefully close to the widely used Turtle/TriG syntax.
prefix ex: <https://example.com/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
// “John is a person.”
ex:john a foaf:Person.
// “Mary is a person.”
ex:mary a foaf:Person.
This makes it easy to read and maintain statement declarations in RATT.
Custom abbreviations
It is possible, but not common, to introduce special abbreviations for linked data terms. In the previous section we saw an example of this:
const a = rdf.type
The custom abbreviation a
is also available in the popular Turtle/TriG syntax for RDF, so it is recognizable to people familiar with linked data. In Turtle/TriG syntax this abbreviation is only allowed to be used in the predicate position. This restriction is not enforced in RATT: the programmer has to enforce this restriction themselves.
It is possible to create additional abbreviations as needed:
const is_a = rdfs.subClassOf
The rdfs.subClassOf
relation implements the subsumption relation. This relation is commonly denoted as the is_a
relation in many other modeling languages. (The abbreviation is_a
is not supported by any of the linked data standards.)
The following example uses the introduced custom abbreviation for subsumption:
app.use(
// "A person is an agent."
triple(foaf.Person, is_a, foaf.Agent)
)