AntShield Dataset

Data Format

This dataset is the one used in the paper  “Privacy Leak Classification from Mobile Devices”, in Proc. of SPAWC (19th IEEE Int’l Workshop in Signal Processing Advances in Wireless Communications) 2018.  We manually interacted with 100 popular Android apps on a test phone and using test accounts (no human subjects involved). In addition, we used the UI/Application Exerciser Monkey to automatically interact with 400 popular Android apps. 297 of these apps used the network during our testing. For these apps, we captured packet traces and then extracted and saved fields from each packet – primarily from HTTP/S and IP headers. Each packet was also annotated with additional fields, such as a label indicating whether or not the packet contains personally identifiable information (PII), which app generated the packet, etc.

Packet Traces

The AntShield dataset provides the aforementioned information in  JSON format, building on and extending the JSON format used by ReCon. An example of a packet in our JSON format is shown below.

    "1488757348051,e8998c3e-7c17-452a-8ceb-3bbf556e128b": {
        "domain": "", 
        "dst_ip": "52.9.207.173", 
        "dst_port": 80, 
        "headers": {
            "Accept-Encoding": "gzip", 
            "Charset": "UTF-8", 
            "Connection": "Keep-Alive", 
            "Content-Length": "404", 
            "Content-Type": "application/x-www-form-urlencoded", 
            "Host": "cm.gcm.ksmobile.com", 
            "User-Agent": "Dalvik/2.1.0 (Linux; U; Android 7.0; Nexus 6 Build/NBD91Y)"
        }, 
        "host": "cm.gcm.ksmobile.com", 
        "is_foreground": true, 
        "is_host_ip": 0, 
        "label": 1, 
        "md5": null, 
        "method": "POST", 
        "package_name": "com.cleanmaster.mguard", 
        "package_version": "5.15.9", 
        "pii_types": [
            "AndroidId"
        ], 
        "platform": "android", 
        "post_body": "appflag=khcleanmaster&phonelanguage=en_&cmlanguage=en&mcc=202&mnc=01&apkversion=5.15.9.6769&dataversion=2017.2.14.737&sdkversion=7.0&manufacture=motorola&channel=200001&trdmarket=1&cl=_en&aid=RECON_AndroidId&timezone=America/Los_Angeles&enabled=1&regid=APA91bGGRwv3t-bAdvaophwAJ1_oy8JEot9C9c0UgnyLvWqPxMK57LHuTnBPIvjuWjK0eokpwnwvQ_E8Dn_hrIkVL10xAqaQ91tAZeMLSEPDeVoOBWiHfqYu11Z6nz8FgThDelzk4x-e&regtime=0", 
        "protocol": "HTTP", 
        "referrer": null, 
        "scr_port": 47069, 
        "src_ip": "192.168.0.2", 
        "tk_flag": null, 
        "ts": "1488757348051", 
        "uri": "/rpc/gcm/report", 
        "user_agent": "Dalvik/2.1.0 (Linux; U; Android 7.0; Nexus 6 Build/NBD91Y)"
    }

In addition to the information extracted from HTTP and TCP/UDP/IP headers, the JSON for each packet contains the following extra information :

  • label” – this is the label indicating whether or not this packet contains PII. Please see the paper for details of how these labels are obtained.
  • pii_types” – indicates which types of PII were found in the packet. Since we used a test phone and test accounts, no data from actual users have been collected (i.e., no human subjects were involved). In addition, we also redacted any PII value to the best of our ability, maintaining only the PII type. For instance, in the above example the Advertiser ID value was replaced with “RECON_AndroidId,” and the only information retained is that this packet contained some PII of type “Advertiser ID.”
  • package_name” – the package name of the app responsible for the TCP/UDP connection that sent the packet.
  • package_version” – the version number of the app provided by the “package_name” field.
  • post_body” – this field has multiple purposes. In the case of HTTP POST message, it contains the body of the message. In the case of a non-HTTP(S) TCP/UDP packet, we use this field to store the packet’s data. In most cases such “raw” TCP packets belong to a segmented HTTP(S) connection.

Aggregated Data

To ease analysis, we also provide two CSV files containing aggregated data from the packet traces:

  • apps_info.csv” – contains information from the Google Play Store on each of the 297 apps in our dataset. Specifically, the Google Play URL, app category, average rating, number of ratings, and number of installs.
  • exposures.csv” – contains unique combinations of PII exposures based on the PII type, the app that sent it, the destination host, the protocol (HTTP vs. HTTPS), and the number of times the combination occurs in our dataset.

License

The AntShield data sharing agreement is inspired by a similar one from CAIDA. This is a basic policy to which you must agree before we give you access to any part of our dataset.

ANTSHIELD ACCEPTABLE USE AGREEMENT for DATA COLLECTED BY ANTSHIELD

Usage of this dataset is subject to agreeing to the following terms.

LICENSE

AntShield authorization to access the data grants You a limited, non-exclusive, non-transferable, non-assignable, and terminable license to copy, modify, and use the data in accordance with this Public Agreement. No license is granted for any other purpose and there are no implied licenses in this Agreement. Nothing in this License is intended to limit any rights You may have arising from fair use or due to other limitations on AntShield’s exclusive rights under copyright law or other applicable laws. AntShield has the authority and reserves the right, in its sole discretion, to discontinue further access and use to anyone who violates this AUA. You will make no attempts to reverse engineer, decrypt, or otherwise identify any personal information in the AntShield dataset.

If You create a publication (including web pages, papers published by a third party, and publicly available presentations) using data from this dataset, You should cite the corresponding paper as follows:

@inproceedings{shuba2018privacy,
  title={{Privacy Leak Classification on Mobile Devices}},
  author={Shuba, Anastasia and Bakopoulou, Evita and Markopoulou, Athina},
  booktitle={2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)},
  year={2018},
  organization={IEEE}
}

We also encourage You to provide the AntShield Team with a link to your publication. We use this information in reports to our funding agencies.

DISCLAIMER OF WARRANTIES. ANTSHIELD USES ITS BEST EFFORTS TO PROVIDE DATA IN ACCORDANCE WITH ETHICAL PRINCIPLES AND SCIENTIFIC INTEGRITY. HOWEVER, THE DATA PROVIDED HEREIN IS ON AN “AS IS” BASIS. NEITHER ANTSHIELD, ITS RESEARCHERS, RESEARCH PARTNERS, LICENSORS, AND DATA PROVIDERS, NOR THE UNIVERSITY OF CALIFORNIA AND ITS TRUSTEES, OFFICERS, EMPLOYEES, AND AGENTS MAKE ANY WARRANTY, EITHER IMPLIED OR EXPRESS, OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, INCLUDING, BUT NOT LIMITED TO, THE ACCURACY, TIMELINESS, COMPLETENESS, RELIABILITY, OR AVAILABILITY OF ANTSHIELD DATA, APPLICATIONS, OR SERVICES ACCESSIBLE THROUGH OR MADE AVAILABLE BY ANTSHIELD.

LIMITATION OF LIABILITY. TO THE EXTENT ALLOWED BY LAW, IN NO EVENT SHALL ANTSHIELD AND THE UNIVERSITY OF CALIFORNIA BE LIABLE TO YOU OR ANY THIRD PARTY FOR ANY INDIRECT, CONSEQUENTIAL, INCIDENTAL, SPECIAL OR PUNITIVE DAMAGES, ARISING FROM YOUR USE OF THE DATA.

If You have any questions about the data or about this Public Agreement, please email antmonitor.uci@gmail.com.

Access the Data

To access the data, please fill out the form below and we will email you the data. Note that by filling out the form, you agree to our Privacy Policy.

Please use your university/business email. Gmail and others will not be accepted.