Originally posted at Wired
FOR THE PAST year, Apple has touted a mathematical tool that it describes as a solution to a paradoxical problem: mining user data while simultaneously protecting user privacy. That secret weapon is “differential privacy,” a novel field of data science that focuses on carefully adding random noise to an individual user’s information before it’s uploaded to the cloud. That way, a company such as Apple’s total dataset reveals meaningful results without any one person’s secrets being spilled.
But differential privacy isn’t a simple toggle switch between total privacy and no-holds-barred invasiveness. And a new study, which delves deeply into how Apple actually implements the technique, suggests the company has ratcheted that dial further toward aggressive data-mining than its public promises imply.
Epsilon, Epsilon
Researchers at the University of Southern California, Indiana University, and China’s Tsinghua University have dug into the code of Apple’s MacOS and iOS operating systems to reverse-engineer just how the company’s devices implement differential privacy in practice. They’ve examined how Apple’s software injects random noise into personal information—ranging from emoji usage to your browsing history to HealthKit data to search queries—before your iPhone or MacBook upload that data to Apple’s servers.
Ideally, that obfuscation helps protect your private data from any hacker or government agency that accesses Apple’s databases, advertisers Apple might someday sell it to, or even Apple’s own staff. But differential privacy’s effectiveness depends on a variable known as the “privacy loss parameter,” or “epsilon,” which determines just how much specificity a data collector is willing to sacrifice for the sake of protecting its users’ secrets. By taking apart Apple’s software to determine the epsilon the company chose, the researchers found that MacOS uploads significantly more specific data than the typical differential privacy researcher might consider private. iOS 10 uploads even more. And perhaps most troubling, according to the study’s authors, is that Apple keeps both its code and epsilon values secret, allowing the company to potentially change those critical variables and erode their privacy protections with little oversight.
In response to the study, Apple points out that its data collection is purely opt-in. (Apple prompts users to share “diagnostics and usage” information with the company when its operating systems first load.) And it fundamentally disputes many of the study’s findings, including the degree to which Apple can link any specific data to a specific user.
But the study’s authors stand by their claims and maintain that Apple oversells its differential privacy protections. “Apple’s privacy loss parameters exceed the levels typically considered acceptable by the differential privacy research community,” says USC professor Aleksandra Korolova, a former Google research scientist who worked on Google’s own implementation of differential privacy until 2014. She says the dialing down of Apple’s privacy protections in iOS in particular represents an “immense increase in risk” compared to the uses most researchers in the field would recommend.
Frank McSherry, one of the inventors of differential privacy and a former Microsoft researcher, puts his interpretation of the study’s findings more candidly: “Apple has put some kind of handcuffs on in how they interact with your data,” he says. “It just turns out those handcuffs are made out of tissue paper.”
‘Not a Reassuring Guarantee’
To determine the exact parameters Apple uses to handicap its data mining, the Indiana, USC and Tsinghua researchers spent more than six months digging through the code of MacOS and iOS 10, identifying the specific files Apple assembles, encrypts, and uploads from iPhones and Macs back to its servers once a day. They used the reverse-engineering tool Hopper to pull the code apart and debugging tools to watch it run in real time to see how it functions, step by step.
Based on those observations, Korolova says, the research team determined that MacOS’s implementation of differential privacy uses an epsilon of 6, while iOS 10 has an epsilon of 14. As that epsilon value increases, the risk that an individual user’s specific data can be ascertained increases exponentially.
According to differential privacy coinventor McSherry, academics generally see any value of epsilon over one as a serious privacy compromise. iOS and MacOS send that data back to Apple once a day; McSherry says the risk increases with every day’s upload. “Anything much bigger than one is not a very reassuring guarantee,” McSherry says. “Using an epsilon value of 14 per day strikes me as relatively pointless” as a safeguard. (The researchers found that the beta version of iOS11 has an epsilon of 43, but the virtually nonexistent privacy protection that implies could be a temporary measure in the beta that will be changed in the full-fledged version of the OS.)
To better understand the implications of those numbers, McSherry suggests a simplified example: Say someone has told their phone’s health app they have a one-in-a-million medical condition, and their phone uploads that data to the phone’s creator on a daily basis, using differential privacy with an epsilon of 14. After one upload obfuscated with an injection of random data, McSherry says, the company’s data analysts would be able to figure out with 50 percent certainty whether the person had the condition. After two days of uploads, the analysts would know about that medical condition with virtually 100 percent certainty. “Even after just one day, you’ve already blown this really strong security,” McSherry says.
Apple, however, strongly disputes the researchers’ epsilon values for its privacy system. The company argues that its differential privacy system adds different levels of noise to different types of data, with far more protection than the researchers acknowledged for each kind, and especially for sensitive data like that uploaded by HealthKit. They also point out that the researchers essentially added together the epsilons for each type of data, and assumed that those different data points could be correlated to learn about a user over time. But Apple says it doesn’t correlate those different kinds of data—that it’s not sure how disparate data types like emoji use and health data could be meaningfully correlated. And it adds that it also doesn’t assemble profiles of individuals over time, institutes limits on storing data that would make that correlation impossible, and throws out any data like IP addresses that could be used as unique identifiers to tie any findings to a particular person.
Once and Future Threats
Korolova counters that the point of differential privacy is that it assumes the worst behavior from the company collecting the data. When properly implemented, users don’t have to trust a company to handle their personal data in a certain way on their faraway servers; the guarantees of differential privacy govern how much specific data they can access in the first place. Apple may use other layers of defenses against invading its users’ privacy, as it says, but Korolova argues those other measures can change without users’ knowledge, and don’t make the company’s vaunted implementation of differential privacy itself any less weak.
As for the supposed difficulty of correlating disparate data types, Korolova argues that differential privacy exists to prevent exactly this sort of “failure of imagination.” “Just because they can’t think of how to correlate this data doesn’t mean that someone else can’t,” she says. Differential privacy helps provide mathematical guarantees against clever correlation techniques that don’t yet exist, not just those currently in use.
But Apple shouldn’t be judged too harshly for its differential privacy imperfections, argues McSherry. The company, after all, has devoted enormous resources toward other privacy-preserving technologies, like full-disk encryption in the iPhone and end-to-end encryption in iMessage and FaceTime. And it’s only one of a small number of Silicon Valley companies that has at least taken a first step towards a more privacy-preserving form of data mining, he says. Some differential privacy likely beats none at all. “It’s a bit like agreeing to the Paris Climate Accords and then realizing you’re a megapolluter and way over your limits,” McSherry says. “It’s still an interesting and probably good first step.”
For comparison, the other major player in the differential privacy world is Google, whose differential privacy system for Chrome is known as RAPPOR, or Randomized Aggregatable Privacy-Preserving Ordinal Response. According to Google’s own analysis, that system claims to have an epsilon of 2 for any particular data sent to the company, and an upper limit of 8 or 9 over the lifetime of the user. That’s in theory better than the Indiana, USC, and Tsinghua study’s assessment of Apple’s differential privacy. And Google also open-sources its RAPPOR code, so that any changes to its epsilon values would be far more obvious. Then again, companies like Facebook and Microsoft have made no public efforts to implement differential privacy, despite Microsoft researchers inventing it over a decade ago.
What Korolova finds most troubling about Apple’s approach, however, is its opacity. It took six months of analysis by a team of researchers to determine the epsilon of its differential privacy systems, when Apple could simply publish it openly. “They’re saying ‘yes, we implement differential privacy, trust us, but we’re not going to tell you at what level we do it,'” Korolova says. “By virtue of not revealing what their parameters are, they’re kind of breaking any real commitment.”
She hopes that the outcome of her study will not be to shame Apple, but to pressure it to open up about how exactly it uses differential privacy—and potentially create more competition among other companies to match or exceed that standard. “If Apple were more transparent about what they’re doing,” she says, “that would be a win for privacy overall.”
And in the meantime, if Apple’s data sharing protections are too lax or opaque for your liking, you can always stop sharing yours.