Cutting long stories short: a plethora of new facts extracted from Wikipedia text

Comments Off on Cutting long stories short: a plethora of new facts extracted from Wikipedia text
Share

We proudly announce the release of new datasets extracted from Wikipedia text

The DBpedia Extraction Framework is pretty much mature when dealing with Wikipedia semi-structured content like infoboxes, links and categories.
However, unstructured content (typically text) plays the most crucial role, due to the amount of knowledge it can deliver, and few efforts have been carried out to extract structured data out of it.

For instance, given the Germany national football team Wikipedia article, we want to extract a set of meaningful facts and structure them in machine-readable statements. The following sentence:

“In Euro 1992, Germany reached the final,
but lost 0–2 to Denmark”

would produce:

[Germany, defeat, Denmark]
[defeat, score, 0-2]
[defeat, winner, Denmark]
[defeat, competition, Euro 1992]

The Google Summer of Code 2015 project “Fact Extraction from Wikipedia Text” has brought this idea to reality.
Mentor Marco and student Emilio have been working hard all the summer.
Outcome: the computer can now read the human language!

Marco and Emilio built a fact extractor, which understands the semantics of a sentence thanks to Natural Language Processing (NLP) techniques.

So there you go with plenty of new facts extracted from soccer player articles in the Italian Wikipedia:

Supervised approach Triples Download
All the facts 213479 nt.gz
Confident facts 110102 nt.gz
Confidence scores 43893 nt.gz
Unsupervised approach Triples Download
All the facts 216451 nt.gz
Confident facts 118895 nt.gz
Confidence scores 40489 nt.gz

The datasets are also loaded into the official SPARQL endpoint, so feel free to directly query the knowledge base.
Each dataset belongs to a different named graph: remember to use the FROM clause in your queries, followed by the URI of the dataset you want to explore:

http://fact.extraction.org/supervised
http://fact.extraction.org/supervised/confident
http://fact.extraction.org/supervised/scores
http://fact.extraction.org/unsupervised
http://fact.extraction.org/unsupervised/confident
http://fact.extraction.org/unsupervised/scores

Here is a query to give you an idea:

“All the soccer players who participated to some competition and when”

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX fact: <http://dbpedia.org/fact-extraction/>

SELECT ?player ?competition ?when
FROM <http://fact.extraction.org/unsupervised>
WHERE
{
    ?player dbo:careerStation ?activity .
    ?activity fact:competition ?competition ;
              fact:time ?when .
}

If you feel adventurous, you can check out the project codebase here.

Comments are closed.