Temporal expression identification#

Data#

Start by downloading dataset.

[1]:

from tieval import datasets
datasets.download("TempEval_3")

Dataset tempeval_3 was already on data.

[2]:

te3 = datasets.read("TempEval_3")

100%|████████████████████████████████████████| 275/275 [00:00<00:00, 311.13it/s]

Some statistics of the corpus:

[3]:

print(f"Number of Documents: {len(te3)}\n")

n_train_events = 0
n_train_timexs = 0
n_train_tlinks = 0
for doc in te3.train:
    n_train_events += len(doc.events)
    n_train_timexs += len(doc.timexs)
    n_train_tlinks += len(doc.tlinks)


print(f"Train")
print(f"-----")
print(f"Number Documents: {len(te3.train)}")
print(f"Number Events: {n_train_events}")
print(f"Number Timex: {n_train_timexs}")
print(f"Number TLinks: {n_train_tlinks}\n")

n_test_events = 0
n_test_timexs = 0
n_test_tlinks = 0
for doc in te3.test:
    n_test_events += len(doc.events)
    n_test_timexs += len(doc.timexs)
    n_test_tlinks += len(doc.tlinks)


print(f"Test")
print(f"----")
print(f"Number Documents: {len(te3.test)}")
print(f"Number Events: {n_test_events}")
print(f"Number Timex: {n_test_timexs}")
print(f"Number TLinks: {n_test_tlinks}")

Number of Documents: 275

Train
-----
Number Documents: 255
Number Events: 11028
Number Timex: 2065
Number TLinks: 10952

Test
----
Number Documents: 20
Number Events: 748
Number Timex: 158
Number TLinks: 929

[4]:

doc = te3["wsj_0006"]  # wsj_0006 is the smalest document of the corpus
print(doc)

Pacific First Financial Corp. said shareholders approved its acquisition by Royal Trustco Ltd. of Toronto for $27 a share, or $212 million.
The thrift holding company said it expects to obtain regulatory approval and complete the transaction by year-end.

[5]:

for timex in doc.timexs:
    print(timex)
    print(timex.is_dct)  # dct stands for document creation time
    print(timex.text)
    print(timex.endpoints)
    print("-----")

Timex("year-end")
False
year-end
(245, 253)
-----
Timex("11/02/89")
True
11/02/89
None
-----

Model#

tieval provides pretrained models for temporal expression identificaiton. To access them one needs to import the models module.

[6]:

from tieval import models

To check the avalaible models refer to the documentation. For this demonstration we will use the TimexIdentificationBaseline model.

[15]:

model = models.TimexIdentificationBaseline()
predictions = model.predict(te3.train)
print(predictions["wsj_0006"])

[Timex("year-end")]

Note that the predicitons ar emissing one of the temporal expressions from the annotation. THis is expected since the missing expression is the documetn publication text, which is part of the document metadata and not of the raw text.

Evaluation#

[16]:

from tieval import evaluate

On the training set.

[17]:

annotations = {doc.name: doc.timexs for doc in te3.train}
result = evaluate.timex_identification(annotations, predictions, verbose=True)

|       |    f1 |   precision |   recall |
|-------+-------+-------------+----------|
| macro | 0.921 |       0.907 |    0.935 |
| micro | 0.949 |       0.949 |    0.949 |

On the test set.

[18]:

predictions = model.predict(te3.test)
annotations = {doc.name: doc.timexs for doc in te3.test}
result = evaluate.timex_identification(annotations, predictions, verbose=True)

|       |    f1 |   precision |   recall |
|-------+-------+-------------+----------|
| macro | 0.778 |       0.817 |    0.742 |
| micro | 0.746 |       0.746 |    0.746 |