PoliGraph Dataset

This page releases the dataset(s) used in the paper “PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs“.

Please check our Artifact Appendix (to be appeared on the USENIX’23 website) and Artifact Evaluation document for more instructions on how to use the dataset to reproduce results in our paper.

PoliCheck Privacy Policies (Mar 2023 version)

File: policheck-dataset-202303.tar.xz

Description: The paper mainly used privacy policies from the PoliCheck project as the benchmark dataset. The original dataset (flow.csv) only provides a list of Android apps and privacy policy URLs. We downloaded these privacy policy webpages in March 2023.

The archive file has four directories:

  • by-url: Originally crawled data. Directories are named as the Base64 encoded URLs in flow.csv.
  • dedup: Full deduplicated dataset. Directories are named as the Blake2s hash of clean.html. There are 6,084 unique privacy policies in total.
  • s_test: A subset of privacy policies used for evaluation purposes.
  • s_dev: A subset of privacy policies used to create the training data for NER and purpose classification.

Each sub-directory inside these directories is a privacy policy:

  • cleaned.html is the HTML cleaned by Readability.js library.
  • readability.json is the output dictionary from Readability.js for debugging purpose.
  • accessibility_tree.json is the accessibility tree generated by Firefox, which helps to parse the DOM tree.

PoliCheck Privacy Policies (2019 version from Internet Archive)

File: policheck-dataset-2019-internet-archive.tar.xz

Description: For the data flow-to-policy analysis, as data flows in the PoliCheck dataset were obtained around 2019, we downloaded historical versions of privacy policies around that time from Internet Archive.

The archive file has only one directory wb2019. Subdirectories are named as the Blake2s hash of the URLs in PoliCheck’s flow.csv. The contents in each subdirectory are the same as the Mar 2023 version.

External Data for Artifact Evaluation

File: poligraph-artifacts-external.tar.xz

Description: Some results in our paper rely on manual labeled data or output from other tools. We provide these artifacts for the ease of reproducibility. Please check our Artifact Evaluation document for instructions on how to use these files.

License

The following terms comprise the Acceptable Use Policy and Data License Agreement for all publicly accessible datasets (the “Public Agreement”) made available here.

UCI Networking Group’s authorization to access the data grants You a limited, non-exclusive, non-transferable, non-assignable, and terminable license to copy, modify, and use the data in accordance with this Public Agreement. No license is granted for any other purpose and there are no implied licenses in this Agreement. Nothing in this License is intended to limit any rights You may have arising from fair use or due to other limitations on UCI Networking Group’s exclusive rights under copyright law or other applicable laws. The UCI Networking Group has the authority and reserves the right, in its sole discretion, to discontinue further access and use to anyone who violates this public agreement.

If You create a publication (including web pages, papers published by a third party, and publicly available presentations) using data from this dataset, You should cite the corresponding paper as follows:

@inproceedings{cui2023poligraph,
  title     = {{PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs}},
  author    = {Cui, Hao and Trimananda, Rahmadi and Markopoulou, Athina and Jordan, Scott},
  booktitle = {Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23)},
  year      = {2023}
}

We also encourage You to provide the UCI Networking Group with a link to your publication. We use this information in reports to our funding agencies.

Access the Data

To access the data, please fill out the form below and we will email you the data. Note that by filling out the form, you agree to our Privacy Policy.

PoliGraph Dataset

This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.