AutoFR Dataset

The AutoFR dataset is broken down by each site that we crawl within a zip file. It contains multiple different experiments that we conducted in our paper. The overall dataset contains 1042 sites that we crawled where we detected ads within the Top-5K. This page details the dataset’s format and provides a form to request access to the dataset at the bottom of the page. By requesting access to the datasets, you agree to the terms of the dataset license below.

Dataset Overview

Our dataset contains 1042 zips, one per-site. This includes a Top5K_rules.csv that aggregates the results of each site and its corresponding rules that we generated using AutoFR.

Below, we summarize each directory within each zip file and how it corresponds to each experiment. Recall that each zip file corresponds to a site.

  • Reward: These directories contain our output from running the AutoFR Algorithm (see Sec. 5 on how we evaluated the rules that we generated).
  • Snapshots: These directories contain our site snapshots (see Sec. 4 on how we collect site snapshots, Sec. 5.1 and Table 3 column 1).
  • Init: These directories contain the 10 visits to the site to collect the outgoing HTTP requests, which are needed to build our action space. This corresponds to the INITIALIZE procedure of our AutoFR algorithm (see Sec. 3.2.1, 3.3, and Fig. 4).
  • Custom_applied: These directories contain the in the wild evaluation of our rules. For example, for a particular site X, we will visit the site 10 times and apply the rules that we generated and retrieve the results in terms of counters of ads, images, and text (see Sec. 5.1 and Table 3 column 2).
  • Easylist_applied: These directories contain the in the wild evaluation of EasyList rules. The file EasyList_filter_rule_transformed_default.txt corresponds to the exact rules that we applied. For example, for a particular site X, we will visit the site 10 times and apply the EasyList rules and retrieve the results in terms of counters of ads, images, and text (see Sec. 5.1 and Table 3 column 4).

Dataset Structure and Format

We describe the structure and format within each directory below.

  • Snapshots:
    • adgraph_networkx: This holds the site snapshots. (graphml files)
    • init_adgraph_site_feedback: This holds the raw AdGraphs before annotations. (See Sec. 4.1 for how we annotate raw AdGraphs). There will be 10 of these, representing ten visits to the site. (JSON files)
  • Reward:
    • action_values.csv: This contains information about the multi-arm bandit run, such as the q-value of each action, the number of pulls per action, and whether we put the arm to sleep or not.
    • dh_graph.json: This is the hierarchy action space in JSON format. (See Sec. 3.2.1 on how we build the action space.)
    • dh_nodes_history.json: This just holds more logging information about which actions we took per time step t.
    • final_rules.txt: These are the filter rules that we generated. (ignore the header comments in this file)
    • low_q_rules.txt: These are the rules that could block ads but caused breakage beyond the threshold w. (See Sec. 3.2.2 and 3.3)
    • unknown_rules.txt: These are rules that we put to sleep. (See Sec. 3.2.2 and Algorithm 1 line 23.)
    • log.log: This holds the verbose logging of our algorithm run.
  • Init:
    • filter_lists: This holds the rules that we applied (if any). (Text files)
    • json: This holds the collected outgoing HTTP requests.
    • screenshots: This holds a screenshot of the site. (PNG files)
    • stats_init.csv: This holds the counters of ads, images, and text.
    • log.log: This holds the verbose logging of the site visit.
  • Custom_applied:
    • see init description above, same format.
  • Easylist_applied:
    • see init description above, same format.

AutoFR Algorithm

The algorithm works as follows. During the Initialize procedure, we will visit the site 10 times to collect its network traffic and site snapshots. The network traffic from all 10 visits is used to create the action space and to get the baseline representation of the site in terms of the expected counts of ads, images, and text. Then, during the main AutoFR algorithm, we will use the site snapshots to run our multi-arm bandits solution. See Sec. 3 for the formulation and Sec. 4 for the implementation.

License

The AutoFR data sharing agreement is inspired by a similar one from CAIDA. This is a basic policy to which you must agree before we give you access to any part of our dataset.

AUTOFR DATASETS ACCEPTABLE USE AGREEMENT for DATA COLLECTED BY UCI NETWORKING GROUP.

Usage of these datasets is subject to agreeing to the following terms.

LICENSE

UCI Networking Group authorization to access the data grants You a limited, non-exclusive, non-transferable, non-assignable, and terminable license to copy, modify, and use the data only for non-profit research and education. No license is granted for any other purpose and there are no implied licenses in this Agreement. Nothing in this License is intended to limit any rights You may have arising from fair use or due to other limitations on UCI Networking Group’s exclusive rights under copyright law or other applicable laws. UCI Networking Group has the authority and reserves the right, in its sole discretion, to discontinue further access and use to anyone who violates this AUA.

You will not disclose the datasets to any other person other than those employed by your institute who are collaborating with you using the datasets. Other entities must request access to the datasets separately using our form below.

You will make no attempts to reverse engineer, decrypt, or otherwise identify any personal information in the AutoFR datasets. We have done our best to de-anonymize the datasets to protect our systems. However, if you find any remaining vulnerabilities or credentials in the datasets, you must responsibly disclose them to us.

If You create a publication (including web pages, papers published by a third party, teaching material, and publicly available presentations) using data from these datasets, You must cite the corresponding paper as follows:

@inproceedings{le2023autofr,
  title={{AutoFR: Automated Filter Rule Generation for Adblocking}},
  author={Le, Hieu and Elmalaki, Salma and Markopoulou, Athina and Shafiq, Zubair},
  booktitle={32nd USENIX Security Symposium (USENIX Security)},
  year={2023},
  month=aug,
  address={Anaheim, CA}
}

We also encourage You to provide the UCI Networking Group with a link to your publication. We use this information in reports to our funding agencies.

DISCLAIMER OF WARRANTIES. UCI NETWORKING GROUP USES ITS BEST EFFORTS TO PROVIDE DATA IN ACCORDANCE WITH ETHICAL PRINCIPLES AND SCIENTIFIC INTEGRITY. HOWEVER, THE DATA PROVIDED HEREIN IS ON AN “AS IS” BASIS. NEITHER AUTOFR DATASETS, ITS RESEARCHERS, RESEARCH PARTNERS, LICENSORS, AND DATA PROVIDERS, NOR THE UNIVERSITY OF CALIFORNIA AND ITS TRUSTEES, OFFICERS, EMPLOYEES, AND AGENTS MAKE ANY WARRANTY, EITHER IMPLIED OR EXPRESS, OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, INCLUDING, BUT NOT LIMITED TO, THE ACCURACY, TIMELINESS, COMPLETENESS, RELIABILITY, OR AVAILABILITY OF AUTOFR DATA, APPLICATIONS, OR SERVICES ACCESSIBLE THROUGH OR MADE AVAILABLE BY UCI NETWORKING GROUP.

LIMITATION OF LIABILITY. TO THE EXTENT ALLOWED BY LAW, IN NO EVENT SHALL UCI NETWORKING GROUP AND THE UNIVERSITY OF CALIFORNIA BE LIABLE TO YOU OR ANY THIRD PARTY FOR ANY INDIRECT, CONSEQUENTIAL, INCIDENTAL, SPECIAL OR PUNITIVE DAMAGES, ARISING FROM YOUR USE OF THE DATA.

If You have any questions about the data or about this Public Agreement, please email athinagroupreleases@gmail.com.

Access the Data

To access the data, please fill out the form below. Note that by filling out the form, you agree to our Privacy Policy and the dataset license above. The datasets are hosted on Google Drive. Once you submit the form, we will give you access to the datasets through Google Drive. Please do not share the Google Drive link with anyone else. Instead, please refer any other interested party to this access form. Keeping track of datasets accesses is important for us as it facilitates accurate reporting to our funding agencies.

AutoFR Dataset

This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.