CV-Inspector Dataset

The CV-Inspector datasets are sets of JSON, HTML, CSV, and PNG, resulting from systematic crawling of the web. Note that data may be broken up by rank.

  • Ground Truth (GT) Dataset: ~2.3K URLs extracted from the anti-circumvention list and the Tranco Top-2K. The dataset is used to train the classifier for CV-Inspector.
    • ground_truth/ground_truth_from_top2k: data crawled from the top2k and annotated using EasyList.
    • ground_truth/ground_truth_from_anticv_list: data crawled from domains extracted from the Anti-CV list and annotated using EasyList
  • Tranco-20K Dataset: ~29.3K URLS extracted from the Tranco Top sites. This is our in the wild dataset.
    • top20k_easylist: data crawled from top20k sites and annotated using EasyList. Zip files are broken up by rank.
    • top20k_easylist/top20k_easylist_screenshots: screenshots of top20k.
    • top20k_easylist/subpages_and_screenshots: additional data from subpages of top20k sites.

This page details the datasets’ format and provides a form to request access to the datasets (at the bottom of the page). By requesting access to the datasets, you agree to the terms of the dataset license below.

Dataset Structure and Format

The CV-Inspector datasets are made available as a zip file that comprises of the following:

  • Recall that CV-Inspector visits a site for two main cases: (1) No Adblocker, (2) With Adblocker. It will also do this four times for each case.
    • Files with “_trial[0-3]” represent the four times that CV-Inspector visited the URL.
  • It is broken down by 100 sites at a time. For example, folders may be named “0_to_100“.
  • The subfolders are as follows:
    • crawl_data_XXX: holds the JSON files for web requests and DOM mutation.
      • control_dommutation: holds the JSON for DOM mutation (No Adblocker)
      • control_webrequests: holds the JSON for web requests (No Adblocker)
      • variant_dommutation: holds the JSON for DOM mutation (With Adblocker + EasyList)
      • variant_webrequests: holds the JSON for web requests (With Adblocker + EasyList)
    • pagesource_XXX: holds the HTML files for the page source. Here files with “control” means No Adblocker, and “variant” means With Adblocker.
    • screenshot_XXX: holds the PNGs for the screenshots.

CV-Inspector Workflow

The data collection for CV-Inspector is described in the figure below. For each site given to CV-Inspector, it will visit the page for two main cases: (1) No Adblocker, (2) With Adblocker. It will do this four times for each case and collect the following types of data: (1) web requests (JSON), (2) DOM mutation (JSON), (3) time series (CSV), (4) page source (HTML). It will also take a screenshot of each visit. This results in 8 screenshots (PNG).

For more details on how we collected and labeled the GT dataset, we refer to the paper, Section IV-D.

CV-Inspector Workflow

License

The CV-Inspector data sharing agreement is inspired by a similar one from CAIDA. This is a basic policy to which you must agree before we give you access to any part of our dataset.

CV-INSPECTOR DATASETS ACCEPTABLE USE AGREEMENT for DATA COLLECTED BY UCI NETWORKING GROUP.

Usage of these datasets is subject to agreeing to the following terms.

LICENSE

UCI Networking Group authorization to access the data grants You a limited, non-exclusive, non-transferable, non-assignable, and terminable license to copy, modify, and use the data only for non-profit research and education. No license is granted for any other purpose and there are no implied licenses in this Agreement. Nothing in this License is intended to limit any rights You may have arising from fair use or due to other limitations on UCI Networking Group’s exclusive rights under copyright law or other applicable laws. UCI Networking Group has the authority and reserves the right, in its sole discretion, to discontinue further access and use to anyone who violates this AUA.

You will not disclose the datasets to any other person other than those employed by your institute who are collaborating with you using the datasets. Other entities must request access to the datasets separately using our form below.

You will make no attempts to reverse engineer, decrypt, or otherwise identify any personal information in the CV-Inspector datasets. We have done our best to de-anonymize the datasets to protect our systems. However, if you find any remaining vulnerabilities or credentials in the datasets, you must responsibly disclose them to us.

If You create a publication (including web pages, papers published by a third party, teaching material, and publicly available presentations) using data from these datasets, You must cite the corresponding paper as follows:

@inproceedings{le2021cvinspector,
  title={{CV-Inspector: Towards Automating Detection of Adblock Circumvention}},
  author={Le, Hieu and Markopoulou, Athina and Shafiq, Zubair},
  booktitle={The Network and Distributed System Security Symposium (NDSS)},
  url = {https://dx.doi.org/10.14722/ndss.2021.24055},
  doi = {10.14722/ndss.2021.24055},
  year={2021}
}

We also encourage You to provide the UCI Networking Group with a link to your publication. We use this information in reports to our funding agencies.

DISCLAIMER OF WARRANTIES. UCI NETWORKING GROUP USES ITS BEST EFFORTS TO PROVIDE DATA IN ACCORDANCE WITH ETHICAL PRINCIPLES AND SCIENTIFIC INTEGRITY. HOWEVER, THE DATA PROVIDED HEREIN IS ON AN “AS IS” BASIS. NEITHER CV-INSPECTOR DATASETS, ITS RESEARCHERS, RESEARCH PARTNERS, LICENSORS, AND DATA PROVIDERS, NOR THE UNIVERSITY OF CALIFORNIA AND ITS TRUSTEES, OFFICERS, EMPLOYEES, AND AGENTS MAKE ANY WARRANTY, EITHER IMPLIED OR EXPRESS, OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, INCLUDING, BUT NOT LIMITED TO, THE ACCURACY, TIMELINESS, COMPLETENESS, RELIABILITY, OR AVAILABILITY OF CV-INSPECTOR DATA, APPLICATIONS, OR SERVICES ACCESSIBLE THROUGH OR MADE AVAILABLE BY UCI NETWORKING GROUP.

LIMITATION OF LIABILITY. TO THE EXTENT ALLOWED BY LAW, IN NO EVENT SHALL UCI NETWORKING GROUP AND THE UNIVERSITY OF CALIFORNIA BE LIABLE TO YOU OR ANY THIRD PARTY FOR ANY INDIRECT, CONSEQUENTIAL, INCIDENTAL, SPECIAL OR PUNITIVE DAMAGES, ARISING FROM YOUR USE OF THE DATA.

If You have any questions about the data or about this Public Agreement, please email athinagroupreleases@gmail.com.

Access the Data

To access the data, please fill out the form below. Note that by filling out the form, you agree to our Privacy Policy and the dataset license above. The datasets are hosted on Google Drive. Once you submit the form, we will give you access to the datasets through Google Drive. Please do not share the Google Drive link with anyone else. Instead, please refer any other interested party to this access form. Keeping track of datasets accesses is important for us as it facilitates accurate reporting to our funding agencies.

CV-Inspector Dataset

This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.