# Thoughts On Icml

It’s the end of day 2 of the main conference – I’m here primarily for the Data-Efficient Machine Learning workshop (though the Personalization and Machine Learning for Social Good workshops look really good, and I hope to be able to sneak into them). It’s really exciting (and legitimately overwhelming!) being here, seeing faces I recognize from their academic websites walking around, etc.. The scope of the conference is rather broad (even considering the magnitude of interest in deep learning / RL / deep RL) but the papers are rather specific: it’s been most rewarding to engage with the state of the art in some fields which I’ve kind-of, sort-of worked in.

Some interesting ideas and things I took away:

- “Oracles and politicians” - this extension to the oracle framework (assuming you have ‘oracle’ access to a function evaluation at a point, even if you don’t have full access to a general gradient, etc)
- causal inference and machine learning: because in many areas of interest we have many covariates / contexts but perhaps not supervised

Some ideas that I don’t want to forget from ICML talks:

- “Politicians and Oracles” - Sebastian Bubeck The main idea is you can practically improve oracle models of computation for first-order optimization methods, where you assume you can query the function value (e.g. of the gradient) at the point but don’t have access to the full function, by considering access to the function via “politician”. The politician differs from the oracle because if you ask the politician for the value of f() at some point x, instead it’ll give you the value at some other point y. Where this politician differs from real life is that this politician will conduct line search in some region to query the function at a value that’ll lead to strictly better performance. For example, the politician might keep track of an ellipsoid within which it knows the optimum lies. How would such a politician be computed? You can run QR decomposition on the span of the gradients.

“Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning” How do you evaluate a policy for RL without deploying it in real life and incurring costs? This talk presented work that used importance sampling

“Statistical Limits of Convex Relaxation” This work analyzed the statistical suboptimality of convex relaxations of the SOS hierarchy, in particular for two settings. They note that previous work relies on the planted clique hypothesis; in contrast, their work is constructive.

“Faster Convex Optimization: Simulated ANnealing with an efficient Universal Barrier” Abernethy & Hazan - they present really nice analysis relating simulated annealing and interior point methods.

“Provable Algorithms for Inference in Topic Models” Arora, Rong Ge, Frederic koehler, Tengyu Ma, Ankur Moitra

Following up on their work getting theoretical bounds for NMF, they consider the problem of inference - how to determine the topic distribution of documents? In particular you can’t ask for more “samples” from the document to figure out its composition. They address this by expressing the problem as a linear map, where the problem is that the variance tends to be high: E[y] = Ax. They compute the pseudoinverse of A, the matrix B such that BA = x, and

There were several talks on learning choice models from rank data, or comparison data, which I didn’t really understand because I’m not quite familiar with the literature. E.g. plackett-Luce models, etc.

“Recommendations as Treatments: Debiasing Learning and Evaluation” Riding the causal inference wave, this work considered the inverse propensity score estimator, which reweights samples in th eERM estimator with the inverse of propensity score, to debias the dataset for a standard recommendation sytsem. In particular, the data is “missing not at random”: there are definite tendencies for people to provide data for works that they already like.

“Stability of SGD” Why does SGD generalize? Ben Recht presents joint work with Moritz Hardt and Yoram Singer on the stability of SGD from a dynamical systems point of view, including in the nonconvex case.