Jascha Sohl-Dickstein

Profil AI Expert

Nationalité: 
Américain(e)
AI spécialité: 
Neural Network
Deep Learning
Apprentissage Machine
Occupation actuelle: 
Chercheur, Google Brain
Taux IA (%): 
82.31'%'

TwitterID: 
@jaschasd
Tweet Visibility Status: 
Public

Description: 
Jascha est chercheur scientifique senior au sein du groupe de Google Brain, où il dirige une équipe de recherche dont les intérêts couvrent l'apprentissage automatique, la physique et les neurosciences. Ses travaux récents se sont concentrés sur la théorie des réseaux de neurones sur-paramétrés, la méta-formation des optimiseurs d'apprentissage et la compréhension des capacités des grands modèles de langage. Auparavant, il était chercheur invité dans le laboratoire de Surya Ganguli à l'Université de Stanford et résident universitaire à la Khan Academy. Il est tres concerné par le domaine sur les réseaux sociaux,ayant actuellement le taux d'engagement en IA mesuré par Cafiac le plus élevé de la communauté d'Experts en IA. Il sollicite des contributions de tâches à un benchmark collaboratif conçu pour mesurer et extrapoler les capacités et les limites des grands modèles de langage.

Reconnu par:

Non Disponible

Les derniers messages de l'Expert:

Tweet list: 

2024-03-01 00:00:00 CAFIAC FIX

2024-03-11 00:00:00 CAFIAC FIX

2023-05-19 19:00:00 CAFIAC FIX

2023-05-21 19:00:00 CAFIAC FIX

2023-04-21 00:00:01 CAFIAC FIX

2023-03-24 22:00:44 @DavidDuvenaud This announcement makes me very happy! Thank you for working to make the future better for your children and mine.

2023-03-14 02:20:58 @gwern But look at how unexpectedly clean the plots are! I do think it would be possible to make these definitions more objective -- check bonus section 6 in the blog post for some ideas

2023-03-10 16:03:10 @TechCapo These orderings were subjective judgements of others! I buy this though -- alphago is trained indirectly via value functions, so there's another imperfect link in the chain linking it's output to an objective, compared to eg a classifier.

2023-03-10 15:52:51 @georgebdavis Re (1) -- my own hypothesis is that evolution had to work *very hard* to make animals intelligent in a way that contributed positively to our fitness function. It's not that coherence can't be achieved, rather that we're going to have to work hard for every bit of coherence.… https://t.co/KUVplHxxks

2023-03-10 15:41:31 @Sheikheddy +100

2023-03-10 15:38:23 @DavidSKrueger Those are all possible! Here's a sketch of another possible low level mechanism: Agents interacting with the world are high dimensional dynamical systems world state → model output / action → new world state Smarter agents are: - more complex dynamical systems (shorter… https://t.co/YDlQmVJEG3

2023-03-10 15:28:20 @catherineols Yes! That is a risk scenario that sounds worryingly plausible to me.

2023-03-10 02:17:03 @Cory29565470 I didn't choose the organizations -- I asked a subject, who didn't know what the experiment was about, to choose them, so I wouldn't be able to bias the results by cherry picking.

2023-03-09 17:46:45 (And stochastically tagging a few people who might be interested. @KatjaGrace @DavidSKrueger @DavidDuvenaud @bucketofkets @EthanJPerez )

2023-03-09 17:00:46 Huge thank you to my generous volunteer subjects (tagging the few cases where I know your twitter handle -- sorry if I missed you!): @dmdohan @jesseengel @thisismyhat @DylanPaiton @neurotalker

2023-03-09 16:36:57 @nabla_theta I completely agree. But under that scenario we will need to work really hard for every scrap of coherent behavior. We won't accidentally get to a paperclip maximizer.

2023-03-09 16:15:53 See the post for details -- including discussion of the many ways these results are speculative and could be improved. This is my second blog post ever -- please continue to be harsh but also constructive! https://t.co/OukfipSkIJ

2023-03-09 16:15:52 The hot mess theory of AI misalignment (+ an experiment!) https://t.co/OukfipSkIJ There are two ways an AI could be misaligned. It could monomaniacally pursue the wrong goal (supercoherence), or it could act in ways that don't pursue any consistent goal (hot mess/incoherent). https://t.co/tdnZP65DTc

2023-03-05 10:00:00 CAFIAC FIX

2023-03-02 22:00:00 CAFIAC FIX

2023-02-27 01:00:00 CAFIAC FIX

2023-01-30 01:00:00 CAFIAC FIX

2022-12-23 20:23:24 Intuitive extensions to standard notation, that make it less ambiguous for common math in machine learning. This should become common practice in ML papers. This could have saved past me cumulative days of confusion (and worse, misinterpretations I probably never discovered). https://t.co/l6wpPT6hTF

2022-12-08 13:00:00 CAFIAC FIX

2022-12-07 08:00:00 CAFIAC FIX

2022-11-09 14:46:56 @ErikSchluntz +1. Generalizing/abstracting your example slightly, you're saying changes which increase efficiency in the *typical* case, may lead to worse performance in the *average* case, because of an increased risk of catastrophic failure? (A key phrase might be black swan event.)

2022-11-09 14:33:02 @athundt @peteflorence @ruha9 of the phenomenon with a moral judgement about the phenomenon, in a way that I think would make technical discussion, including around mitigations, difficult.)

2022-11-09 14:28:43 @athundt @peteflorence @ruha9 Thanks for the connection! I just added these to the list of related concepts. (* While I think these are excellent observations, I wouldn't be comfortable myself using these examples as the primary term for the underlying concept, because they seem to combine a description https://t.co/JQXJo6CFgY

2022-11-08 00:33:22 @RazMarinescu +1 to adapting goals+incentives being key to mitigating this.

2022-11-08 00:29:28 @PaulsonJonathan This is a really good point! If we could somehow observe the world where the listed thing changed, but everything else was held fixed, we might see absolute outcomes get worse. But we don't live in that world, and there are reasons everything changes at once. I will think on this

2022-11-07 14:43:09 @updateless_ This turns out to be really hard to write, because I have so much uncertainty. Predicting the future is hard.

2022-11-07 14:38:25 @sirbayes These are also worried I have!"In a world that will only become more influenced by mathematical intelligence, can we ruin culture through our attempts to perfect it?"

2022-11-07 14:30:57 @DavidSKrueger I hadn't seen that paper. I like that it introduces an ontology -- I think this was missing from how I thought about it. Thank you for the connection.

2022-11-07 04:45:54 RT @boazbaraktcs: 3/7 this should not detract from the general point, that in many cases, as a system, whether algorithmic, individual, or…

2022-11-07 01:29:20 Also @-ing some people I follow (and get a lot of value from) that might find this perspective interesting. @bucketofkets @AmandaAskell @albrgr @DavidSKrueger @KatjaGrace @OwainEvans_UK @sleepinyourhat @jackclarkSF @geoffreyirving @ESYudkowsky

2022-11-07 01:07:15 If there's one thing that AI will bring, it's dramatically greater efficiency across many domains. We should expect that this will cause similarly dramatic harmful unintended consequences, in every domain AI touches, *all at once*. This is going to be a hard period of history.

2022-11-07 01:07:14 The phenomenon of overfitting in machine learning maps onto a class of failures that frequently happen in the broader world: in politics, economics, science, and beyond. Doing too well at targeting a proxy objective can make the thing you actually care about get much, much worse. https://t.co/LNLOg5IBmA

2022-11-07 01:07:12 My first blog post ever! Be harsh, but, you know, constructive.Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's lawhttps://t.co/uR7pL7WNST https://t.co/NaibgX1bRb

2022-11-06 02:43:41 I'm on mastodon! @jascha@mathstodon.xyz. I will post new content there, before Twitter.I don't like my social+professional interactions being mediated+manipulated by a corporation with very different incentives than me. I'm hoping mastodon replaces scientific Twitter.

2022-11-02 18:09:39 @ericjang11 @dpkingma I think there is a qualitative difference between the magnitude degree of freedom, and other degrees of freedom. That is, I think getting relative magnitudes of activations correct is somehow easier for neural networks then getting the overall norm correct.

2022-11-01 21:19:13 @ericjang11 (though that observation really just moves the why question one step farther up, rather than answering it)

2022-11-01 21:18:30 @ericjang11 This is for the same reason that neural networks are often poorly calibrated. NNs are good at producing a vector that points in the right direction, but bad at getting the magnitude correct. For classification, you just need to get the vector direction right.

2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x

2022-09-27 17:14:18 @TacoCohen +1000 to this.

2022-09-23 03:56:14 RT @BorisHanin: PRINCETON ML THEORY POSTDOCI'm looking for a theory postdoc with background in math, physics, stats, CS. Share widely.…

2022-09-23 01:34:30 One of the largest challenges around learned optimizers is making inner and outer training *stable*. James shows how eigenvalue analysis and careful intervention can produce massive improvements. https://t.co/hcKmrytW4n

2022-09-14 15:38:19 RT @ARomanNovak: Quadratic scaling in the number of pixels is a huge bottleneck of the NNGP/NTK. Very excited about _orders-of-magnitude_ s…

2022-09-14 15:24:57 I'm very excited to help out with the AI Grant program! I know I'm going to learn a lot. Hopefully we can learn a lot together. https://t.co/oDzzmuy1Gi https://t.co/4GhCYkcmwM

2022-08-26 22:47:26 RT @ScienceInsider: BREAKING: White House issues new policy that will require, by 2026, all federally-funded research results to be freely…

2022-08-06 20:41:03 This thread is an excellent read. I don't know that I would characterize the observations as spicy, so much as maybe just worrisome. https://t.co/wnBZUlEpl2

2022-08-06 20:13:18 @jackclarkSF At least half the time, this is because the original authors didn't realize an aspect was actually very important, or didn't realize an insight suggested by their experiments.

2022-07-23 18:10:47 @karpathy (Animal Eyes is an amazing book. Every few pages you'll learn something you want to share with everyone near you. Bruno Olshausen uses it for a great course at Berkeley.)

2022-07-23 18:04:45 @FelixHill84 So I guess -- eventually I think the bitter lesson will apply, but we need to figure out a lot before we can blindly scale the number of interacting large models.

2022-07-23 18:03:07 @FelixHill84 Good Q! I suspect for a while we will design multi-agent systems, then once they're stable we will scale them, then when the agent count is large enough, we will wrap another layer of abstraction on top, and start designing ?societies? of many interacting multi-agent systems.

2022-07-23 04:39:02 I think we will increasingly build systems out of many large models interacting with each other. I think the cascades perspective -- write down a probabilistic graphical model, but with every node a language model -- is the right formalism for describing these systems. https://t.co/oVcHgEu7ad

2022-07-22 03:36:52 RT @sschoenholz: Paper is here with details: https://t.co/kgb8Wvkje5If you don't care about details, the finite-width NTK calculations in…

2022-07-22 03:36:31 RT @ARomanNovak: Will be presenting our work on fast finite-width NTK today at #icml2022 - please come to our talk at 10:55 EDT, or the pos…

2022-07-01 01:48:22 RT @ethansdyer: 1/ Super excited to introduce #Minerva (https://t.co/UI7zV0IXlS). Minerva was trained on math and science found on the web…

2022-07-01 01:47:08 RT @alewkowycz: Very excited to present Minerva: a language model capable of solving mathematical questions using step-by-step natural lan…

2022-06-27 21:55:05 RT @EthanJPerez: We’re announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task…

2022-06-19 15:36:07 @laurence_ai Noted. We should add a discussion of this to our paper.

2022-06-18 06:33:57 @TheGregYang Good question! You can write the reparameterization in terms of either a feature x feature or data x data kernel, whichever is smaller (see Appendix B). So it's not a problem computationally. Large data/ width ratio will lead to a less smooth reparameterized distribution though.

2022-06-18 01:00:39 RT @hoonkp: Awesome work by @jirimhron and friends at Google: Bayesian parameter posterior of the infinite-width limit! Another concrete e…

2022-06-18 00:06:37 PS -- When I described these results to @TheGregYang a couple months ago, he initially described them as "too good to be true", so you know they have to be good!

2022-06-18 00:06:36 Many, many more details in the paper! My fantasy and hope for this work is that it not only helps us understand neural networks better, but will also help make Bayesian models (without egregious approximations) practical. https://t.co/wmjO5F3ozq

2022-06-18 00:06:35 Even better, because the KL between prior and posterior shrinks with width, MCMC sampling after repriorization grows *more efficient* with width. (almost all current common MCMC samplers instead grow dramatically less efficient with increasing dimensionality) https://t.co/wLhDDPmptO

2022-06-18 00:06:34 MCMC mixes much faster after repriorization (we show >

2022-06-18 00:06:33 We characterize the weight space posterior by defining a data-dependent reparameterization that causes the *posterior* distribution over parameters conditioned on a dataset to converge in KL towards the *prior* distribution over parameters. We call this mapping repriorization. https://t.co/7WSan1HCdJ

2022-06-18 00:06:32 Detour for acknowledgements:@jirimhron deserves the lions share of credit. He is also job hunting!! Jiri is brilliant and extremely patient, and you should hire him. Thank you also to @ARomanNovak and Jeffrey Pennington, who played crucial roles.More about the result:

2022-06-18 00:06:31 For years I've shown this 2x2 grid in talks on infinite width networks, but with just a big in the upper-left.No longer! In https://t.co/NyZaHUsYjC we characterize wide Bayesian neural nets in parameter space. This fills a theory gap, and enables *much* faster MCMC sampling. https://t.co/zTUsGJVIhf

2022-06-17 23:28:44 @TrendingML I asked an internal language model I have access to, and it says it will require 114,720 Tweets. That is my final answer.

2022-06-17 18:04:48 @pde33 @machinaut @realSharonZhou The Brier score submission from the three of you is the cause of an entire section on calibration in the BIG-bench paper. Thank you!

2022-06-15 17:17:04 RT @qlhoest: Thanks @LiamFedus @AJAndreassen @jaschasd @ethansdyer @guygr and team for the incredible work on BigBench !You can find it on…

2022-06-14 03:32:35 RT @james_y_zou: Excited to contribute to bias assessment of large language models in the BIG-bench!

2022-06-13 16:10:33 RT @vedantmisra: BIG Bench is not only a fascinating collection of tasks for LLMs, it's also a shining example of how open and collaborativ…

2022-06-13 05:09:56 RT @geoffreyirving: Whether LLMs are conscious or pass Turing Tests or what precisely a Turing Test means matters much less than whether yo…

2022-06-12 04:22:17 RT @adityagupta2211: Glad to have contributed to such a massive collaborative work! Excited to see DISFL-QA (https://t.co/gwdw5s9ici) and T…

2022-06-12 02:29:09 RT @ivanzhouyq: This is incredible work on LLMs! Reading through this paper, I'm not only amazed by the huge amount of work behind BIG-benc…

2022-06-11 19:55:44 This is a fascinating task, on which the performance of the largest models is still close to chance and not obviously increasing with scale. https://t.co/0l68wt7o8f

2022-06-11 19:34:35 The link to the task is here:https://t.co/kMTrrbkjO3This is a great task, that large models still perform roughly at chance on. https://t.co/r1miq5TlKK

2022-06-11 19:30:55 The implicatures task was one of my favorites!! Silly, but also requires some quite complex skills, possibly including a rich world model and theory of mind. https://t.co/uORNzPxqvZ

2022-06-11 18:38:35 @raphaelmilliere @OwainEvans_UK are to human capabilities for quite a while.

2022-06-11 18:36:21 @raphaelmilliere @OwainEvans_UK Good question! I don't want to hazard a timeline, because that's the sort of thing that gets screenshotted and turned into an embarrassing slide. BIG-bench includes many tasks that language models can't do at all though. I believe it will remain a useful test for how close LMs

2022-06-11 03:27:04 @billmdev We will still have the low and high scores that are part of task metadata for new tasks, which are useful for establishing a reasonable scale. To compare to humans though would be another project, which we don't currently have plans for.

2022-06-11 03:18:14 @tdietterich I think experimental physics has smoothed out all the rough spots for arXiv submissions with long author lists. @ethansdyer pasted all the names into the arXiv form field ... and it just worked.

2022-06-11 03:11:38 @tomgara Great! Now, tell me why them getting worse is expected (or at least funny if you have the right context).

2022-06-11 03:06:24 Owain's task is truthful_qa, which is a great tasks that targets a specific worrying failure of language models (that they will just make up incorrect things when they don't know the answer). Thank you!!https://t.co/wlhkuxqaXa https://t.co/THVa4Vcrcz

2022-06-11 03:03:42 @billmdev So scoring close to 100 corresponds to doing well.

2022-06-11 03:03:25 @billmdev We hired humans to do almost all the tasks in the benchmark, so we can compare LM performance to human performance. Each task also specified as part of its metadata their estimate for what "low" and "high" scores on their task would be. We normalize those to be between 0 and 100.

2022-06-11 01:12:16 RT @dmdohan: Huge props to the organizers for their leadership in pushing this to completion! Exciting model for large-scale collaboratio…

2022-06-10 21:11:59 RT @andrey_kurenkov: Generally really cool, but I also like this bit - "BIG-bench continues to accept tasks and evaluation results on a rol…

2022-06-10 21:11:42 And the corresponding task is here!https://t.co/Un3voQbmCXThank you! https://t.co/tI5SPAZQCu

2022-06-10 19:53:34 Here is the task, which is high quality (and somewhat distressing):https://t.co/TlFIH1cgIl https://t.co/NEAJc4uaUx

2022-06-10 19:50:39 Oops -- I just saw that you gave links to your tasks later in a thread. Comment still applies though -- your tasks were excellent!

2022-06-10 19:49:13 Your contributions were great Marie!! To list them for Twitter:https://t.co/voIHsJ0Iy1https://t.co/0v2HRTfJXChttps://t.co/IbaKBUzSK8(I particularly liked yes_no_black_white) https://t.co/YtkxMEdZX7

2022-06-10 19:42:32 @karpathy Unfortunately, tasks where models show breakthrough performance, and the way in which PaLM performance looks like the start of a sigmoid in terms of log-parameter-count, together mean that I'm still highly uncertain about what the near-future capabilities of large models will be.

2022-06-10 19:40:35 @karpathy My primary (personal) motivation for BIG-bench was that I was drawing straight lines on the plots in the GPT3 paper, and I really wanted to know what the *actual* capabilities of larger models would be.

2022-06-10 19:35:10 RT @karpathy: imo a major AI safety contribution, both in short-term (applications) and long-term (AGI) scope

2022-06-10 19:33:34 @kchonyc @thisismyhat You definitely have to work hard for it not to apply. Self-cite, but even a high dimensional random walk is concentrated in a low dimensional subspace, with energy in different eigenvalues of the iterate covariance falling off like a power low: https://t.co/04I1D8vtJl

2022-06-10 19:22:05 RT @karpathy: Incredible effort!!

2022-06-10 19:20:32 This was a great task! https://t.co/q8MHePbJUv

2022-06-10 19:15:22 @Suhail We do not, though a blog post is something we should really do. The paper and repository READMEs are hopefully pretty clearly written.

2022-06-10 17:55:14 RT @barret_zoph: It was a pleasure to be part of this effort! Very bullish on the impact this will have for the future of LLMs.Also very…

2022-06-10 17:21:38 @its_ericchu @snehapriscilla This tasks seems to require both a simple geometric world model, and also to internally perform multiple sequential reasoning steps -- it's great for probing weaknesses of current model architectures!

2022-06-10 16:22:03 @dk_gup @ethansdyer is the answer

2022-06-10 16:20:37 RT @BuzanDilyar: It was an amazing experience collaborating with amazing people @UvA_Amsterdam and contributing to the BIG-bench benchmark.…

2022-06-10 16:18:31 RT @douglas_eck: It indeed takes an army. Lots of interesting new research directions have been uncovered by the BigBench effort!

2022-06-10 15:17:50 @webis_de This is a cool task! Thank you!

2022-06-10 15:16:36 @rodrigfnogueira I think we should be comparing against the top rather than bottom baseline line on that plot. It's true that the trend looks worrying for humans though! (also, that plot is a subset of json tasks, which are generally easier than the programmatic tasks)

2022-06-10 15:12:45 @peppeatta This exists!! Start at one of the links below, and navigate to individual tasks. Performance vs. baseline is at the bottom of every task's readme.https://t.co/4YSK6aLvt4https://t.co/MkuXP5rVqB

2022-06-09 01:14:13 @stanfordnlp I didn't know anyone was saying otherwise! I think it's a mark of pride to manage a large collaboration (or even a small one). Projects in ML are also just going to keep on getting bigger, and so are author lists.

2022-05-25 05:06:10 RT @GoogleAI: Introducing Imagen, a new text-to-image synthesis model that can generate high-fidelity, photorealistic images from a deep le…

2022-05-24 19:10:38 RT @Chitwan_Saharia: We are thrilled to announce Imagen, a text-to-image model with unprecedented photorealism and deep language understand…

2022-05-20 08:11:00 CAFIAC FIX

2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x

2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x

2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x

2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x

2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!

2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!

2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!

2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!

2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!

2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!

2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?

2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…

2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ

2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.

2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)

2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.

2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…

2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?

2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.

2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.

2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv

2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.

2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX

2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ

2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…

2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…

2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…

2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…

2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt

2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.

2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS

2022-11-18 04:50:07 If you are training models with <

2022-03-12 08:11:00 CAFIAC FIX 2022-01-17 08:11:00 CAFIAC FIX 2022-01-11 08:11:00 CAFIAC FIX 2021-12-27 08:20:00 CAFIAC FIX 2021-11-06 23:20:00 CAFIAC FIX 2021-12-14 15:38:34 Want to compute the empirical NTK orders of magnitude faster? Come to our poster presentation to find out how, and let rays of light come from your brain . https://t.co/b2Ja6VIr9g 2021-12-14 01:58:55 @pfau @RobertRosenba14 The abstract was indeed diplomatically phrased. You should read this particular paper in either more or less depth (title: Adversarial Examples Are a Natural Consequence of Test Error in Noise). (also relevant, the "adversarial spheres" paper https://t.co/hdDwBhqhuL) 2021-12-14 01:47:26 @pfau @RobertRosenba14 cope? 2021-12-14 01:46:27 @pfau @achristensen56 @neurograce @josueortc @RobertRosenba14 The internet seems to have a surprising number of images of watermelons that look like pandas. Would that be OK? https://t.co/oIEUX6eA3X 2021-12-14 01:42:01 @pfau @RobertRosenba14 Under some relatively reasonable assumptions, *any* classifier that makes test errors on high dimensional inputs is susceptible to adversarial examples. https://t.co/7Sj8q50KCq. I think humans are likely susceptible, though finding the right perturbation for a person is hard. 2021-12-14 01:37:07 @achristensen56 @neurograce @josueortc @pfau @RobertRosenba14 (ie, human decisions are biased in the direction of the adversarial perturbation, but they don't experience a high confidence belief that the stimulus is actually the class it is adversarially perturbed towards) 2021-12-14 01:35:25 @achristensen56 @neurograce @josueortc @pfau @RobertRosenba14 Thanks! We have upcoming work (DM if you'd like to provide feedback on a draft :P) showing a perceptual effect even for unlimited exposure time, no backward masking, and eps+-2 perturbations. The effect is much weaker in humans than machines though, and relies on a 2afc paradigm. 2021-12-14 01:24:12 RT @IARAInews: Join our next IARAI seminar about ‘Learned optimizers: why they're the future, why they’re hard, and what they can do now’ o… 2021-11-06 23:20:00 CAFIAC FIX 2021-11-06 19:50:00 CAFIAC FIX 2021-11-06 18:59:00 CAFIAC FIX 2021-11-01 19:20:00 CAFIAC FIX 2021-11-01 17:30:00 CAFIAC FIX 2021-09-01 17:34:30 @PBFcomics As another layer: rats can also (probably) echolocate. So they're both cheating, the rat is just worse at it. https://t.co/TvTUg4muJ1 2021-08-12 19:57:45 This is an excellent initiative! A great opportunity for both potential mentors and mentees. https://t.co/5DPJlSNZgn 2021-08-11 21:06:27 RT @savvyRL: Earlier this summer Krystal and I are appointed as ICLR 2022 DEI chairs. Right away we started planning something tangible and… 2021-07-24 01:08:37 If you like BIG-bench, you will also like NL-Augmenter! https://t.co/5spBy3d4PA 2021-07-21 03:28:05 RT @Luke_Metz: Interested in computing gradients through unrolled computation graphs? Come see our paper at ICML! We construct an ES like… 2021-07-19 16:29:50 I am *extremely* proud to share that we were awarded the ICML outstanding paper award! Major credit and thanks to my collaborators @PaulVicol and @Luke_Metz ! Paul especially owned every part of this project, and I think his care and extreme thoroughness are the reasons we won. https://t.co/lOuPVkXTXW 2021-07-08 04:00:20 RT @venkvis: Excited to kick-start focus #SciML series on #ML meets Info theory and statistical mechanics! Amazing speaker/session chair li… 2021-06-22 16:26:00 If you have ideas for data augmentation in NLP, contributing to this is a great way to push the field forward, and *increase your h-index by 1* at the same time. Lightweight barrier to entry -- check out the nice colab: https://t.co/HBHhiQDLtC https://t.co/5spBy3d4PA 2021-06-10 17:50:00 @TacoCohen @geoffreyhinton @erikverlinde @mmbronstein @risi_kondor @erikjbekkers @wellingmax Wow — congratulations!! 2021-06-09 20:51:40 RT @hojonathanho: New paper on cascaded diffusion models for ImageNet generation! We outperform BigGAN-deep and VQ-VAE-2 on FID score and c… 2021-06-04 18:18:23 @pfau Surprise twist : you're theory is completely right in every respect ... but those unusual plasma configurations *are* alien life, that lives on the and is getting stranded on earth 2021-05-25 16:04:30 My group in Google Brain is hiring a full time researcher, for a research team focused on learned optimizers. Are you interested in meta-learning, bilevel optimization, dynamical systems? Apply here: https://t.co/gL8OSotdCo Please reach out with any questions! https://t.co/PAUcnLJ19B 2021-05-11 23:39:11 @jackclarkSF Can I interest you in the BIG-bench large scale collaborative benchmark for text-based AI? Task submissions open until June 1, over 100 diverse tasks submitted so far. Authors of accepted tasks included as co-authors on BIG-bench paper. https://t.co/dXnR9EbDuQ 2021-05-06 20:12:29 I am very, very excited about this workshop on enormous language models, and hope to see you there!!! Also participate (and increase your h-index ) by contributing a task to BIG-bench! https://t.co/dXnR9EbDuQ https://t.co/Igvo9L8ld4 2021-05-06 14:41:59 RT @YSongStanford: Checkout my new blog post on generative modeling by score matching and score-based models. I introduce the intuition beh… 2021-05-03 23:26:50 Come learn about our (outstanding paper award ) work building generative models by running SDEs backwards in time -- ICLR poster session in 30 minutes! https://t.co/4OhydzEQxE https://t.co/Ii6rPjydgZ 2021-05-03 23:17:50 @savvyRL @ml_collective Extremely happy to have you here, and excited to work with you! Welcome to Google! 2021-04-01 19:59:10 @savvyRL @PMinervini @YSongStanford @poolio Thanks!! (As @poolio observed, excellent April fool's day news ) 2021-03-27 16:57:17 @colinraffel @verena_rieser Yes @verena_rieser! Under review elsewhere is fine (though you should consider potential leakage of the dataset into model training data). Check out the review criteria: https://t.co/HVgKRybWh3 2021-03-25 23:27:35 Come to our workshop on Enormous Language Models! Also, submit a task to the associated benchmark https://t.co/dXnR9EbDuQ, and be a co-author on the corresponding paper! https://t.co/YRzCEstjlN 2021-02-24 05:54:15 Hint: the answer is yes. :) https://t.co/p8fqEjgr7R 2021-02-04 00:44:29 @nabla_theta Perhaps there's a small subset of Pile that might make sense? PS -- Sorry for the slow response. I didn't see your question until today. (2/2) 2021-02-04 00:42:39 @nabla_theta We are definitely interested in bits-per-byte style measures of performance (* especially on holdout sets unlikely to be seen during training). Requiring models to evaluate an 825 GiB dataset as part of a standard benchmark might be technically challenging. (1/2) 2021-01-30 00:40:43 @BlancheMinerva Tasks that explore models' abilities to perform mathematical proofs would be great! (keep in mind that, as you say, the generated output does need to be automatically scored) 2021-01-30 00:36:12 @onurgu_ml You should feel free to submit tasks that interact with the model in any language! (so long as reviewers are able to understand what the task is doing) Current models are better at English, but already do surprisingly well at translation. 2021-01-30 00:33:54 @Seb_Bratieres My primary personal motivation for wanting to build a benchmark like this is both to understand the current shortcoming of large scale AI models, and especially to extrapolate what their future capabilities and impact will be. 2021-01-30 00:26:35 @ketran The majority of the training data is English, but the models already do a surprisingly good job at translation. Benchmark tasks can be in any language (so long as there is enough information for reviewers to understand + assess what the task is doing). 2021-01-27 05:14:07 @timnitGebru @JeffDean I + co-organizers would love contributions from researchers working on ethical AI, very much including you and the ethical AI team. Measuring a thing is a first step towards improving it, and we want to make measurement of social biases a core part of the benchmark. 2021-01-27 01:33:09 @nabla_theta and very much welcome any contributions of novel tasks though! You almost certainly have unique insights into the failings of current language models. (2/2) 2021-01-27 01:32:51 @nabla_theta Thanks! I wasn't aware of your evaluation project, which looks neat. We're pretty committed to our infrastructure and to our call-for-tasks-with-paper-authorship-for-submitters framework, so I don't know that it would make sense to combine the projects. We would love (1/2) 2021-01-26 22:58:26 We also encourage submission of tasks which quantify social bias in language models. Including measures of bias in a standard language model benchmark will motivate future research countering it. 2021-01-26 22:58:25 CALL FOR TASKS CAPTURING LIMITATIONS OF LARGE LANGUAGE MODELS We are soliciting contributions of tasks to a *collaborative* benchmark designed to measure and extrapolate the capabilities and limitations of large language models. Submit tasks at https://t.co/eJJXFtqPpi #BIGbench https://t.co/D1PoQHUQPr 2021-01-22 01:01:34 Bootstrapping the training of learned optimizers using randomly initialized learned optimizers. No hand designed optimizer involved (* unless you count population based training). A demonstration of the potential power of positive feedback loops in meta-learning. https://t.co/FPuqf9MX4w 2020-12-01 21:18:20 @SussilloDavid Congratulations David!! It was an honor to work with you for five of those six years. 2020-12-01 17:37:02 I think this will be a very important paper. My take: by unrolling SGD training steps and treating them as part of the NN architecture, computing the kernel after training (w/ feature learning) becomes equivalent to computing the NNGP kernel of the extended architecture. https://t.co/M0VFx2JAGD 2020-12-01 17:23:02 "Creating noise from data is easy 2020-11-13 03:48:12 Infinite width neural networks enable more compute-efficient Neural Architecture Search! https://t.co/TbgEFBI80E 2020-11-11 22:31:42 More observations: - Statistical properties of normalizers change dramatically over the first 50 training steps - Differences at train/test time are crucial to batch norm's success - Mean accumulation across the batch is more important than variance accumulation across the batch https://t.co/tVjnDRGCKn 2020-11-11 22:31:41 A simple prescription that will improve your models: When using LayerNorm, do mean subtraction *before* rather than after the affine transformation. This, and an in-depth empirical investigation of statistical properties of common normalizers in https://t.co/poAWNmpSah https://t.co/Ojcj0hsnTM 2020-11-05 20:16:54 Come watch Niru present some gorgeous analysis of how learned optimizers behave both like and unlike hand designed optimizers (and outperform hand designed optimizers). https://t.co/hCIYi2FiBE 2020-09-25 00:13:23 @LouisKirschAI @Luke_Metz Thank you for the MeaGenRL reference! In terms of dataset generalization, we meta- train and test using 15 datasets (app. G.1.2 in https://t.co/Y4NEodw7u4). Have not looked at generalization beyond those datasets, but I expect it would good for similar model architectures. 2020-09-24 22:08:16 @JFPuget Thank you! We will correct the reference. 2020-09-24 03:56:21 RT @Luke_Metz: We have a new paper on learned optimizers! We used thousands of tasks (and a lot of compute ) to train general purpose lear… 2020-09-24 03:07:42 Analogously to the first time a compiler can compile itself, it is even capable of training itself from scratch!!! I think we are now only a short distance away from learned optimizers being the best choice for most optimization tasks (though, we're not *quite* there yet). https://t.co/GyXhp2k4TT 2020-09-24 03:07:41 Modern deep learning is a story of learned features outperforming (then replacing!) hand-designed algorithms. But we still use hand designed loss functions and optimizers. Here is a big step towards learned optimizers outperforming existing optimizers: https://t.co/lA8R0BkpdX https://t.co/Pg4ehwoEEg 2020-08-30 00:25:17 RT @microcovid: We are delighted to introduce https://t.co/hBxswKbuxi, a tool to numerically estimate the COVID risk of specific ordinary a… 2020-08-25 04:31:53 @goodfellow_ian @pfau @negative_result @duck @sschoenholz @ethansdyer @dwf I have not yet read this paper -- but I've heard that https://t.co/8UgrPPHZXg does what you propose (ht @gamaleldinfe), and finds that relaxing the convolutional constraints at the end of training leads to a slight increase in performance. 2020-08-25 04:26:52 @goodfellow_ian @pfau @negative_result @duck @sschoenholz @ethansdyer @dwf For a CNN, the sparsity structure in the weight matrix means that even after random initialization, the directions corresponding to pixels are still special directions related to each other in a structured way after the randomly initialized conv kernel is applied. 2020-08-25 04:24:55 @goodfellow_ian @pfau @negative_result @duck @sschoenholz @ethansdyer @dwf Yeah, it's counterintuitive! It's at least as much about initialization as training. Information about the data essentially gets erased by randomness in the weight initialization. For a FCN with isotropic weights a random unitary transformation is effectively applied to the data. 2020-08-24 05:26:28 @goodfellow_ian @pfau @negative_result @duck @sschoenholz @ethansdyer @dwf I even believe that regularized ZCA was actively helpful for you reaching SOTA. We recently saw a similar effect where regularized ZCA is beneficial (but unregularized ZCA is worse than unwhitened data) in a different project -- Sec 3.10 and Fig 9 of https://t.co/dlugEMT5fW . 2020-08-24 05:16:29 @goodfellow_ian @pfau @negative_result @duck @sschoenholz @ethansdyer @dwf The strong (theoretical guarantee of chance performance) version of the claim is for a fully connected first layer, an input dimensionality that is at least as large as the dataset size, and no regularization. So the theory survives your example! 2020-08-23 20:13:41 RT @ASmallFiction: "How can I feel strong?" she said. "This hate potion would do," the witch said. "Oh. How can I BE strong?" "Don't tak… 2020-08-21 18:46:38 @AIActorCritic The key is that the performance achievable in a linear model by early stopping is better for GD than for a second order method. This is because the second order method has access to less information about the dataset. See Figures 3 and 4 in the paper for experimental validation. 2020-08-20 23:26:39 @pfau @negative_result @duck @sschoenholz @ethansdyer 2) Yes it does! We haven't done this experiment explicitly, but we have shown that performance is exactly at chance if second order structure is removed from CIFAR10 by a linear transformation. 2020-08-20 23:25:06 @pfau @negative_result @duck @sschoenholz @ethansdyer 1) We went with "second moment matrix" because we normalized by the number of vectors (so not Gram), but didn't mean subtract (so not covariance). But we got rid of the normalization to simplify notation ... so, uh, just cognitive momentum? We should definitely say the word Gram. 2020-08-20 23:02:01 @irfnali1 Thank you for the reference! That definitely looks related enough that we should be citing it -- and we will in the next arXiv version. 2020-08-20 18:25:32 @AIActorCritic GD and second order will still generalize differently even with a regularizer. Note that for linear models GD and second order converge to the same minima even without a regularizer (Fig 4a). They only behave differently when you look at finite training times (/early stopping). 2020-08-20 18:17:16 @sidml12 I think there is a tradeoff, and whitening is often a good idea despite the information loss. As you say it leads to much faster training (Fig 4), and the information loss is only large when there is a large input dimension or a small sample count. 2020-08-20 18:08:29 @RogerGrosse Also -- our paper is stronger, and better communicates the potential benefits of second order methods, based on our email exchange. Though we did not change the title, I nonetheless very much appreciate you taking the time to discuss the work. 2020-08-20 18:07:29 @RogerGrosse I believe that our title is accurate and clearly communicates our core result. No bad faith was intended. I understand you have a different perspective on the same mathematical facts. I respect and believe I understand your perspective, though it remains different from mine. 2020-08-20 04:05:19 @roydanroy *where the dynamics of training are taken to be dynamics over {first layer activations, all parameters after the first layer, model predictions}. First layer weights can hold additional information, but it isn't available to the rest of the model and can't inform predictions. 2/2 2020-08-20 04:02:56 @roydanroy Yup, it holds for any loss, and it holds for the dynamics of training*. The network can learn information that is contained in the training labels -- so if you put information about the inputs in the labels (ie as done in an autoencoder), then the model can use it. 1/2 2020-08-20 00:44:03 @RogerGrosse oop. And I didn't see your most recent two messages when replying above. I totally respect your position, and it's completely plausible that I am failing to see my own motivated reasoning when I think that applying this language to those situations would actually be appropriate. 2020-08-20 00:37:00 @RogerGrosse Also -- I think you are exactly right that we both agree about what's happening mechanistically! I may withdraw from the conversation before Twitter works its platform magic on my brain, and I somehow find myself in an escalating argument with someone I respect and agree with. 2020-08-20 00:33:18 @RogerGrosse The predictions will carry no information about the training data w/in the large-eigenvalue subspace, so that information will be unavailable to the model. (if ridge regression were done by GD with early stopping, the information would again be available for generalization) 2/2 2020-08-20 00:30:41 @RogerGrosse I believe that would be a totally reasonable thing to say about ridge regression, assuming you are also comfortable saying it about linear regression without a ridge penalty. 1/2 2020-08-20 00:13:07 @__wasao Excellent! Thank you for sharing. 2020-08-20 00:11:34 @roydanroy I know, right? 2020-08-19 23:49:53 @RogerGrosse In the abstract, and the text body, and my tweeprint we discuss how this can present a practical tradeoff, and regularized second order methods can be extremely effective despite this information loss. 4/4 2020-08-19 23:48:14 @RogerGrosse By the arguments in our paper then, the information in the large eigenvalue subspace is destroyed even for regularized second order optimization. So information which could be used to generalize is still destroyed, it's just that less information is destroyed. 3/n 2020-08-19 23:47:09 @RogerGrosse Regularized second order methods behave like unregularized second order methods in the subspace corresponding to eigenvalues > 2020-08-19 23:45:49 @RogerGrosse Roger and I also had a totally amiable email thread where we talked past each other about this. I believe the title is correct, even if you take second order optimization to mean second order optimization with a regularized Hessian inverse (as is typically done in practice). 1/n 2020-08-19 21:24:02 But! Second order methods typically involve regularization in practice. This can present a tradeoff where less information about a dataset is lost, but training is still accelerated. In some configurations regularized second order optimization can even improve generalization. https://t.co/vDcOXqDw2T 2020-08-19 21:24:01 Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible: https://t.co/CuDeHxF90r We examine what information is usable for training neural networks, and how second order methods destroy exactly that information. https://t.co/j1Sc09YuKT 2020-08-08 02:49:47 @pfau @hoonkp @sschoenholz @Locchiu @ARomanNovak I actually think infinite networks *would* benefit from equivariance, and that this is why infinite width CNNs underperform finite width CNNs. I can't (yet) say anything satisfyingly formal or precise about this though! Also, +1 on the Myrtle-net results being impressive! 2020-08-08 00:18:59 @pfau @hoonkp @sschoenholz @Locchiu @ARomanNovak The answer is yes and no, and longer than a tweet . Short version is that equivariance is a potentially very useful property of finite width networks, but stops playing a role in the infinite width limit (essentially since all locations have all representations). More in paper. 2020-08-08 00:01:45 Finally, we present a simple method for ensembling the predictions of NNGP and NTK models, making it practical to use data augmentation with infinite width networks. (data augmentation is otherwise impractical, due to the cubic dependence of kernel methods on dataset size) https://t.co/bOwrqOs7dZ 2020-08-08 00:01:44 Regularized ZCA whitening of input images improves model accuracy by a surprising amount, especially for infinite width NNGP and NTK predictions. https://t.co/tRXiUHarHR 2020-08-08 00:01:43 The generalization performance of certain finite width networks (especially CNNs without pooling) is non-monotonic in width, in a way not explained by double descent phenomena. (!?!) https://t.co/hT5v7kgdC1 2020-08-08 00:01:42 Large learning rates and L2 regularization both drive differences between finite networks and kernels, and lead finite width networks to perform better. The combined effect of large learning rates and L2 regularization is superlinear. (repeat figure image for this one) https://t.co/PIggLofUbu 2020-08-08 00:01:41 The NNGP (corresponding to infinite width Bayesian networks) typically outperforms the NTK (corresponding to infinite width networks trained by gradient descent). https://t.co/FAvNUH8BeL 2020-08-08 00:01:40 "Finite Versus Infinite Neural Networks: an Empirical Study." https://t.co/dlugEMT5fW This paper contains everything you ever wanted to know about infinite width networks, but didn't have the computational capacity to ask! Like really a lot of content. Let's dive in. https://t.co/aKjgbCrcLU 2020-07-25 02:39:19 @Foret_p Primary answer is that they're the best lever we currently have to *understand* classical deep networks, and understanding will pay huge practical dividends down the road. Secondary answer is that the performance tradeoff is not completely one-sided. Watch this space. :) 2020-07-22 17:22:46 Also -- infinite width attention achieves a new state of the art accuracy for non-trainable kernel methods on CIFAR10. With excellent collaborators @JiriHron5, @yasamanbb, @ARomanNovak. https://t.co/WFEcOWScNl 2020-07-22 17:22:45 Infinite width limits (NNGP and NTK) for neural networks with self-attention https://t.co/mxpQ4jrMBR. This fills in the last common architectural component which did not have an infinite width correspondence! Along the way we improve on the standard softmax attention mechanism. https://t.co/YRMmPmirkc 2020-07-17 20:26:49 @garibaldu Neural Tangent Kernel. Blatant self-cite, but I think https://t.co/zmHhklDdit is a fairly clear presentation of the ideas involved. 2020-07-17 20:23:19 @laurence_ai I think I understand how this captures the properties of the posterior of a single layer. I wonder, could you say something about how you handle layer-layer interactions? eg, around eq 96 how do you capture the dependence of dL/dV_l on later R_l? 2020-07-17 20:21:16 @laurence_ai Thank you for the connection! We will add a citation. This is a very neat alternative way of approaching the question. 2020-07-16 23:04:37 On the other hand, papers often describe the NNGP as stemming from gradient descent of the readout layer only (ie, of being a special case of the NTK). While this is usually* true, it is IMO the most boring perspective on it. (* see https://t.co/I7CuYUKxLI for a counterexample) 2020-07-16 23:04:36 Neural Network Gaussian Processes (NNGPs) correspond to wide Bayesian neural networks! In https://t.co/P9RJeS7RHc we show that the posterior distribution over functions computed by a Bayesian neural network converges to the posterior of the NNGP as layer width grows large. https://t.co/OH9z6y6d5e 2020-05-18 15:36:57 @RussInMtl @KyleCranmer It is probably more costly to generate exact samples from intermediate distributions than to compute det(A), but this could be a useful approximation. 2020-05-18 15:36:20 @RussInMtl @KyleCranmer Thanks for the suggestion! If I understand correctly, TI requires sampling from a sequence of equilibrium distributions (probably interpolating between E=x^T x and E=X^T A x)? 2020-05-17 06:08:54 @unsorsodicorda Yup! To be fair, I expect the variance will typically be exponential in dimensionality. But we're often interested in the log determinant, which will turn exponential variance into linear variance. So, whether this is reasonable will depend very much on the situation! 2020-05-16 18:04:25 @unsorsodicorda For sure, if you have A then det(A) is more expensive than 1/det(A) with this method. Even targeting det(A) though, it's typically faster and more memory efficient to solve A u = s for u than it is to compute A^{-1} or det(A), so the stochastic estimate will still be faster. 2020-05-16 16:14:49 @unsorsodicorda The benefit is that you can get an unbiased estimate of the determinant from a single matrix-vector products, or from a small number of matrix-vector products. This could be useful for stochastic algorithms, where the cost of exactly computing the determinant is large. 2020-05-16 16:13:18 @VishwakFTW I have not analyzed this carefully ... but probably exponentially. This is maybe not as bad as it seems, as we are often interested in log determinants in practice. Yes! I think a control variate would be great! e.g. matrix-vector products for a matrix with known determinant. 2020-05-16 04:16:22 @KyleLiang5 For part A, the variance will depend entirely on how smart your choices for the distributions p and q are. 2020-05-16 04:15:45 @KyleLiang5 No analytic or experimental answer. My suspicion though is that the variance of the estimator in part B will be exponential in # dims n, which is not great. On the other hand, we often are interested in log determinants, for which the variance will be linear in n ... so maybe ok. 2020-05-16 03:30:04 @DaniloJRezende Thank you for the connection! I will update the note to include it. 2020-05-15 23:52:03 @shakir_za Thanks for the encouragement! Yeah, the variance of the estimator when p and q are isotropic Gaussians seems not great (maybe OK if you are interested in a log determinant). If I pursued this farther, it would involve figuring out the design space for p and q as you suggest. 2020-05-15 21:16:20 Also, let me know if you have a recommendation for a journal that this kind of short and simple observation would be appropriate for! I'm currently not sure whether I should try to publish it anywhere. 2020-05-15 21:10:48 Two simple equalities expressing matrix determinants as expectations over matrix-vector products. Entire paper in attached image. :P It's fun to write short notes like this. Hopefully useful in areas like normalizing flows and Gaussian process evaluation. https://t.co/jX5xjPbHQy https://t.co/HFFr0iCcjH 2020-04-28 14:47:49 @schimpffabian We initialized a one hidden layer network in a specific 2d parameter subspace, such that training remains in that subspace, and there's a 1:1 mapping between parameters in the subspace and [curvature, weight correlation]. See the caption of Figure S1 (appendix) for details. 2020-04-27 23:23:50 Predicting + demonstrating counterintuitive neural network training behavior: - training at learning rates which diverge under NTK theory - exponential *increase* in loss over first ~20 training *steps* (not epochs) - drastic reduction in Hessian eigenvalues over first ~20 steps https://t.co/1UbC8iYTnI https://t.co/TEQBGoJGzf 2020-04-21 04:54:18 This is very cool work. Read this if you want to really, really understand how a neural network solves a specific problem -- like actual scientific understanding. https://t.co/9wbyGuhFFp 2020-04-04 16:30:44 @geoffreyirving Maybe. I think it's likely that individual lucky events come from a heavy tailed distribution, and that they interact closer to multiplicatively than additively. If the significant lucky events are sparse though, maybe it's easier to factor then out? 2020-03-29 00:11:30 @GallowayAngus Indeed! That's a very nice paper ... and it's influence on our title alone is difficult to deny! 2020-03-28 23:29:50 @KyleCranmer In fact we do not importance sample. Our novelty is rather in showing how to interpret the distribution resulting from importance sampling as a tractable energy function in the GAN latent space. (2/2) 2020-03-28 23:29:27 @KyleCranmer Thanks for the connection! I have just added a reference to that paper, and it will be included whenever we update the arXiv. Note that our contribution is not to use the GAN discriminator for importance sampling, which has been done before (https://t.co/Mrb4SkcCIR). (1/2) 2020-03-28 00:23:34 "Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling" This technique can dramatically improve existing trained GANs, by re-interpreting them as an easy-to-sample-from energy based model in the latent space. https://t.co/f9dlYyJJPB 2020-03-25 22:17:03 Check out our review of recent efforts to apply techniques from statistical physics to better understand deep learning, targeted at a physics audience. It tries to be approachable for non-experts in machine learning. https://t.co/cPsByuy02S 2020-03-13 21:08:43 @SussilloDavid Stay tuned! The theory is there for weight sharing (ht @TheGregYang ), but the library doesn't support it yet. 2020-03-13 18:44:10 Build an infinite width neural network with the same code you use to define your finite width neural network. https://t.co/qCSjrSqvIe 2020-03-13 04:19:56 This is a meta-learned list of optimization hyperparameters. Try these hyperparameters in this order for fun, profit, and better performing models with less compute!! A sequence of magic numbers beyond Karpathy's constant! JAX, PyTorch, & 2020-01-25 01:02:07 @dustinvtran @sschoenholz @hoonkp Also -- I should emphasize that the performance gap between finite and infinite networks is very architecture dependent. For fully connected networks especially, the infinite width limit seems to match or exceed the finite width performance. 2020-01-25 01:00:04 @dustinvtran @sschoenholz @hoonkp We have tried, and are trying, ablations similar to these, but definitely lots more to do!! (See Figure 3 in this paper for an mse experiment (w/o direct comparison). Even more see https://t.co/7a4lAfOZS7, https://t.co/eC9NykF4n6 for NNGPs, and https://t.co/zmHhklDdit for NTKs.) 2020-01-24 23:56:31 This makes NTK training dynamics dissimilar from those of standard finite width networks. (Infinite width Bayesian networks, NNGPs, don't suffer from this problem.) In https://t.co/KPfHLJOrOk we derive infinite width kernels for the *standard* parameterization, resolving this. 2020-01-24 23:56:30 Research on the Neural Tangent Kernel (NTK) almost exclusively uses a non-standard neural network parameterization, where activations are divided by sqrt(width), and weights are initialized to have variance 1 rather than variance 1/width. 2019-12-13 04:02:18 RT @shoyer: I'll be presenting this work with @samgreydanus and @jaschasd tomorrow morning (Friday Dec 13) at 11:30am at the NeurIPS Deep I… 2019-12-12 17:34:07 RT @hoonkp: @Locchiu @sschoenholz @yasamanbb @jaschasd https://t.co/VLiVowB644 2019-12-12 16:49:12 RT @hoonkp: Visit our poster today(12/12 Thu) at #NeurIPS2019 10:45am! #175 "Wide Neural Networks of Any Depth Evolve as Linear Models Unde… 2019-12-06 23:22:48 Paper: https://t.co/617vP1bttE Github: https://t.co/fZxNUwBRer Colab Notebook: https://t.co/UwXvlLRpwZ 2019-12-06 23:18:17 Infinite width networks (NNGPs and NTKs) are the most promising lead for theoretical understanding in deep learning. But, running experiments with them currently resembles the dark age of ML research before ubiquitous automatic differentiation. Neural Tangents fixes that. https://t.co/a3unONiXkV 2019-10-29 17:03:49 RT @TheGregYang: 1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width… 2019-10-24 01:01:30 Even better -- the code runs in your browser in colab! https://t.co/ddL8tQLJ7u 2019-09-28 02:09:34 Neural reparameterization improves structural optimization! By parameterizing physical design in terms of the (constrained) output of a neural network, we propose stronger and more elegant bridges, skyscrapers, and cantilevers. https://t.co/M8ol844JyE With shoyer@ samgreydanus@ https://t.co/PZzJjgoCep 2019-05-23 19:48:53 RT @poolio: current mood #NeurIPS2019 https://t.co/iK6ccdeF1c 2019-05-12 23:13:20 A careful empirical study of the effect of network width on generalization and fixed learning rate SGD, for MLPs, convnets, resnets, and batch norm. With superstar resident Daniel Park, and @quocleix + Sam Smith. https://t.co/X0J2wraxNy https://t.co/41VYZG9MRR 2019-05-09 20:46:03 RT @KerenGu: Exciting work in the evolution approach and meta learning! — @Luke_Metz blowing us away with neuron local meta learned update… 2019-05-09 20:45:34 RT @georgejtucker: @Luke_Metz killing it at #ICLR19. https://t.co/XztE7QPfPO 2019-05-05 16:59:43 RT @TheGregYang: 1/ Does batchnorm make optimization landscape more smooth? https://t.co/5J92tRz8ag says yes, but our new @iclr2019 paper h… 2019-03-22 17:59:00 : @laurent_dinh is the most fun to work with. He always has extremely novel ideas ... and makes the most mesmerizing animations. https://t.co/RylpUn9S0V 2019-03-20 17:56:54 Including a massive, well curated, dataset mapping hyperparameter configuration to model performance. This may be a useful resource in your own research. https://t.co/Z3oHSHlp37 2019-03-08 03:46:13 Batch norm causes chaos and gradient explosion in the output of deep networks: figure below shows two nearly identical minibatches going through a random *linear* network with batch norm, and becoming completely dissimilar by depth 30! Much, much more at: https://t.co/6Kwgw6pWXp https://t.co/n41OBSI693 2019-02-21 04:10:58 @TheGradient @RogerGrosse Yup! :) Though with the caveats mentioned by both @danieldazac and @Braden_Brinkman -- it only becomes exactly true at infinite width, and there is the zeroth order constant term as well as the linear term. 2019-02-21 04:06:32 @RogerGrosse If the loss function on top of the neural network output was quadratic, then yes -- one Hessian-free step run to convergence, or one Newton step, will achieve zero training error in the limit of infinite network width. 2019-02-19 18:18:22 A wildly successful collaboration with several of the usual suspects, @hoonkp @Locchiu @sschoenholz @yasamanbb and Jeffrey Pennington. 2019-02-19 18:12:19 Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent https://t.co/zmHhklDdit < 2019-02-19 18:12:18 RT @sschoenholz: 1/5) Maybe our neural networks aren't so different from linear models after all! Here's a plot showing a real network trai… 2019-02-14 22:15:26 RT @TheGregYang: 1/8 Modern deep networks (with conv, (self-)attention, batchnorm, LSTM, etc) become Gaussian Processes when randomly initi… 2019-02-13 10:06:09 RT @TheGregYang: @MSFTResearch won the 2018 Text Adventure AI Competition! Read about our winning agent at https://t.co/dCjfu5l941 . It ach… 2019-01-25 02:31:23 Based upon some recent exchanges, I should emphasize this note is to help understand some seemingly too-good-to-be-true theory results. It is NOT a practical proposal to improve training. See https://t.co/ueX4REpWE4 for an illustration of why it won't help training. 2001-01-01 01:01:01

Découvrez Les IA Experts

Nando de Freitas Chercheur chez Deepind
Nige Willson Conférencier
Ria Pratyusha Kalluri Chercheur, MIT
Ifeoma Ozoma Directrice, Earthseed
Will Knight Journaliste, Wired