Welcome to TripleGeo: An open-source tool for extracting geospatial features into RDF triples¶
TripleGeo is a utility developed by the Institute for the Management of Information Systems http://www.ipsyp.gr/ at Athena Research Center http://www.athena-innovation.gr/en.html under the EU/FP7 project GeoKnow: Making the Web an Exploratory for Geospatial Knowledge http://geoknow.eu. This generic purpose, open-source tool can be used for integrating features from geospatial databases into RDF triples.
TripleGeo is based on open-source utility geometry2rdf https://github.com/boricles/geometry2rdf/tree/master/Geometry2RDF. However, this earlier tool (2010) has been substantially modified and enhanced to extract non-geographical attributes and also interact with diverse geographical and triple formats. TripleGeo is written in Java and is still under development; more enhancements will be included in future releases. However, all supported features have been tested and work smoothly in both MS Windows and Linux platforms.
The Java source code for TripleGeo is freely available from https://github.com/GeoKnow/TripleGeo.
Converting geospatial features into triples¶
From a user’s perspective, the utility works from command line in a transparent fashion according to some preconfigured settings (specified by the user in a configuration file). Execution is parameterized with a configuration file that declares user preferences for the conversion. TripleGeo provides the following functionality:
- It can take as input ESRI shapefiles, geographic files (in KML or GML), as well as spatial tables hosted in major DBMSs.
- It currently handles most common spatial data types, including points, (multi)linestrings and (multi)polygons.
- It can perform on-the-fly transformation of a given dataset into another spatial reference system.
- It can export geometries in several serialized formats, including WKT as prescribed by the recent GeoSPARQL standard https://portal.opengeospatial.org/files/?artifact_id=47664.
When initiated, this process iterates through all features in the original dataset and emits a series of triples per record. Every geometric feature is turned into properly formatted triple(s), according to the specified vocabulary. Additional descriptive or thematic (i.e., non-spatial) attributes can be extracted, including identifiers, names, or classifications.
CAUTION: We stress that the main purpose of TripleGeo is to extract geometric/geographic information from various formats (spatial DBMS, ESRI shapefiles, GML, KML) and transform it into RDF. Hence, support for thematic attributes is kept to a minimum; currently, only an identifier, a name and a classification value can be exported per geometry. These attributes can be specified in the configuration file. For the time being, such attributes are exported as literals, without taking into account any underlying ontology.
Architecture¶
TripleGeo has been implemented with several Java classes in a modular fashion as illustrated in the following flow diagram:
- Connectors to source data are required in order to access geometric features. In case of a DBMS, this is possible thanks to suitable JDBC drivers. With respect to shapefiles, the integrated GeoTools library provides all required functionality.
- A configuration file lists several properties that control several stages: how input source will be accessed, which data is involved, what geometric representation should be used, whether geometries must be transformed in another reference system, as well as the output format.
- A parser iterates through each input record and converts geometries into a suitable representation according to user specifications. It also consumes non-spatial attribute values (e.g., types, names) of the features involved and emits properly formatted literals.
- A Jena model http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/Model.html is used to retain in memory all state information consisting of the collection of generated triples.
- Optionally, reprojection of geometries into another spatial reference system is also available.
- Export of generated triples into files is performed by the Jena API http://jena.apache.org/tutorials/rdf_api.html. This offers the possibility of writing the output into a single file at several triple formats as defined by the user.
Input¶
The current version of TripleGeo utility can access geometries from:
- ESRI shapefiles, a widely used file-based format for storing geospatial features.
- Geographical data stored in GML (Geography Markup Language) and KML (Keyhole Markup Language).
- INSPIRE-aligned datasets for seven Data Themes (Annex I) in GML format: Addresses, Administrative Units, Cadastral Parcels, GeographicalNames, Hydrography, Protected Sites, and Transport Networks (Roads).
- Several geospatially-enabled DBMSs, including:
- Oracle Spatial
- PostGIS (spatial module for PostgreSQL)
- MySQL
- IBM DB2 with Spatial extender.
Geospatial data must reside in a single table (in case of a database), a single GML/KML file, or one shapefile. Currently, there is no support for combining information from several sources (e.g., by joining two or more tables).
Output¶
In terms of output serializations, triples can be obtained in one of the following formats:
- RDF/XML (default)
- RDF/XML-ABBREV
- N-TRIPLES
- N3
- TURTLE (also abbreviated as TTL).
Concerning geospatial representations, triples can be exported according to:
- the GeoSPARQL standard for several geometric types (including points, linestrings, and polygons) https://portal.opengeospatial.org/files/?artifact_id=47664
- the WGS84 RDF Geoposition vocabulary for point features http://www.w3.org/2003/01/geo/
- the Virtuoso RDF vocabulary for point features http://docs.openlinksw.com/virtuoso/rdfsparqlgeospat.html.
Results are written into a local file, so that they can be readily imported into a triple store.
Configuration settings¶
Before attempting any conversion using TripleGeo, a configuration file must be prepared. This file lists crucial properties that define how input data will be accessed, where they will be exported and into which format, as well as optional features (e.g., reprojection into another spatial reference system).
These settings include properties concerning:
- Input and output parameters, including paths for necessary files and output triple format.
- Target RDF vocabulary for geometric representation.
- Database credentials and features (when accessing a DBMS) OR shapefile features (from the file system)
- Namespace parameters and prefixes for the resources that will be generated as well as for the utilized ontology.
- Spatial Reference Systems, when transformations should take places for geometries.
- Optional parameters (e.g., default language for string literals).
In the specific case of SHAPEFILES, configuration parameters have the following meaning:
- featureString : the name of the shapefile (without the extension .shp);
- attribute : the attribute in the shapefile that contains values to be used as (unique) identifiers in URIs and as string literals in "rdfs:label" triples;
- ignore : specifies values (e.g., UNK) in attributes that should not be exported as literals; NULL values are automatically suppressed;
- type : defines a user-defined name for the resources that will be created; you may specify any valid string (with no wild characters) you wish;
- name : attribute in the shapefile containing values that should be extracted as name literals; such values will become the objects of "georesource:name" triples;
- class : attribute in the shapefile containing values that should be extracted as type literals; such values will become objects of "rdf:type" triples.
You may consult sample configurations from https://github.com/GeoKnow/TripleGeo/tree/master/test/conf, which cover several indicative cases in terms of data access and supported geometric types.
Execution¶
In order to use TripleGeo for extracting triples from a spatial dataset, the user should follow these steps (in a Windows platform, but these are similar in Linux as well):
- Download the current software bundle from https://github.com/GeoKnow/TripleGeo/archive/master.zip
- Extract the downloaded .zip file into a separate folder, e.g.,
c:\temp
. - Open a terminal window (in DOS or in Linux) and navigate to the directory where TripleGeo has been extracted, e.g.,
cd c:\temp\TripleGeo-master
This directory must be the one that holds the LICENSE file. For convenience, this is where you can place your configuration file (e.g.,options.conf
), although you can specify another path for your configuration if you like. - Normally, under this same folder there must be a
lib/
subdirectory with the required libraries. Make sure that the actualTripleGeo.jar
is under thebin/
subdirectory. - Verify that Java JRE (or SDK) ver 1.7 or later is installed. Currently installed version of Java can be checked using:
java –version
from the command line. - Next, specify all properties in the required configuration file, e.g.,
options.conf
. You must specify correct paths to files (i.e., in[parametersinputFile
,outputFile
, andtmpDir
), which are RELATIVE to the executable. - In case that triples will be extracted from ESRI shapefiles, give the following command (in one line):
java -cp lib/*;bin/TripleGeo.jar eu.geoknow.athenarc.triplegeo.ShpToRdf options.conf
- Make sure that the specified paths to .jar files are correct. You must modify these paths to the libraries and/or the configuration file, if you run this command from a path other than the one containing the
LICENSE
file, as specified in step (3). - While conversion is running, it periodically issues notifications about its progress. Note that for large datasets (i.e., hundreds of thousands of records), conversion may take several minutes.
- As soon as processing is finished and all triples are written into a file, the user is notified about the total amount of extracted triples and the overall execution time.
Resources for testing¶
- The Java source code for TripleGeo is freely available from https://github.com/GeoKnow/TripleGeo.
- Precompiled Java executable binaries for TripleGeo utility can be freely downloaded from https://github.com/GeoKnow/TripleGeo/archive/master.zip. Note that in order to execute TripleGeo directly from these binaries, Java JRE (or SDK) 1.7 or later must have been installed and properly configured on your local machine.
- Sample geographic datasets for testing are available in ESRI shapefile format from https://github.com/GeoKnow/TripleGeo/tree/master/test/data.
- Sample configuration files for several cases are also available from https://github.com/GeoKnow/TripleGeo/tree/master/test/conf. You can edit any of these files in order to prepare suitable configuration settings for accessing a geospatial repository (from shapefile or DBMS) before executing TripleGeo on its contents.
License¶
The contents of this project are licensed under the GPL v3 License https://github.com/GeoKnow/TripleGeo/blob/master/LICENSE.
Development: © 2013-2015 Institute for the Management of Information Systems,
Athena Research Center, Greece.
Please send any comments to: kpatro AT dblab DOT ece DOT ntua DOT gr
Last updated: 8 November 2015 11:30:00 EET.