Sergey Levine

Profil AI Expert

Nationalité: 
Américain(e)
AI spécialité: 
Deep Learning
Robotique
Apprentissage par renforcement
Occupation actuelle: 
Chercheur, Berkeley Professeur
Taux IA (%): 
75.11'%'

TwitterID: 
@svlevine
Tweet Visibility Status: 
Public

Description: 
Sergey est professeur au département de génie électrique et d'informatique de l'UC Berkeley. Dans ses recherches, il se concentre sur l'intersection entre le contrôle et l'apprentissage automatique, dans le but de développer des algorithmes et des techniques capables de doter les machines de la capacité d'acquérir de manière autonome les compétences nécessaires pour exécuter des tâches complexes. Il s'intéresse à la manière dont l'apprentissage peut être utilisé pour acquérir des compétences comportementales complexes, afin de doter les machines d'une plus grande autonomie et intelligence. Il pense notamment que les notions de perception et de contrôle sont des concepts clés pour obtenir des robots au comportement similaire aux humains.

Reconnu par:

Non Disponible

Les derniers messages de l'Expert:

Tweet list: 

2024-03-01 00:00:00 CAFIAC FIX

2024-03-11 00:00:00 CAFIAC FIX

2023-05-22 20:53:47 @natolambert Absolutely. Providing feedback to our models is far too important a job to be left to humans, we should get feedback from something smarter

2023-05-22 20:52:28 @generatorman_ai I think that could make sense. Would need to pick some form (eg MSE) of recon loss, but it does make sense as an objective, essentially rewarding for distance to gt image on each prompt. Not certain what exactly that would do… maybe @michaeljanner has some thoughts on that

2023-05-22 17:59:59 Of course this is not without limitations. We asked the model to optimize for rewards that correctly indicate the *number* of animals in the scene, but instead it just learned to write the number on the image :( clever thing... https://t.co/xxjiq34npT

2023-05-22 17:59:58 We optimized for "animals doing activities" and it does some cool stuff. Interestingly, a lot of the pictures start looking like childrens' book illustrations -- we think this is because "bear washing dishes" is less likely to be a photograph, more likely in a childrens' book https://t.co/SjTxCZSQLm

2023-05-22 17:59:57 Quantitatively, this leads to very effective optimization of a wide variety of reward functions, both hand-designed rewards and rewards derived automatically from vision-language models (VLMs). https://t.co/TENhaeESQn

2023-05-22 17:59:56 We figured out how to train diffusion models with RL to generate images aligned with user goals! Our RL method gets ants to play chess and dolphins to ride bikes. Reward from powerful vision-language models (i.e., RL from AI feedback): https://t.co/5Mui7Wb8pB A https://t.co/j24K2IQRhh

2023-05-19 19:00:00 CAFIAC FIX

2023-05-21 19:00:00 CAFIAC FIX

2023-05-10 01:26:41 RT @mitsuhiko_nm: We've released our code for Cal-QL, along with public @weights_biases logs to make the replication easier! Please try it…

2023-05-04 18:26:32 RT @mitsuhiko_nm: I will be presenting our work Cal-QL, as a Spotlight at Reincarnating RL Workshop at #ICLR2023 in a few hours! (10:50 am…

2023-05-04 02:36:03 Cal-QL is a simple modification to CQL that "calibrates" the value function during offline training making it finetune efficiently online. You can check out the paper here: https://t.co/Mz8fmhqgQm You can also watch a previous talk on YouTube: https://t.co/VluE64Rk64

2023-05-04 02:36:02 If you want to learn about how offline pretraining with CQL can be combined with efficient online finetuning (Cal-QL), check out @mitsuhiko_nm's talk tomorrow at #ICLR2023 (workshops), 1:50 am PT = 10:50 am in Kigali: https://t.co/WRMny8o3Kb Also at RRL workshop. More info https://t.co/9A8fpu4o9J

2023-05-03 17:04:47 I'll be speaking at the Reincarnating RL WS tmrw (4:30 pm CAT = 6:30 am PDT) along w @JosephLim_AI @furongh @annadgoldie @marcfbellemare @DrJimFan @jeffclune Follow along in person &

2023-05-03 03:46:17 ALM (Aligned Latent Models) is an MBRL alg that optimizes a latent space, model, and policy w.r.t. the same variational objective! By @GhugareRaj, @mangahomanga, @ben_eysenbach, @svlevine, @rsalakhu 16:30/7:30 am MH1-2-3-4 #124 https://t.co/k9h1REVlea https://t.co/O5bxe0CCTg https://t.co/Iols6UcmU0

2023-05-03 03:46:16 Neural constraint satisfaction is a method that uses object-based models and enables planning multi-step behaviors By @mmmbchang, @a_li_d, @_kainoa_, @cocosci_lab, @svlevine, @yayitsamyzhang 16:30 local/7:30am PDT at MH1-2-3-4 #78 https://t.co/qZtQREJA4A https://t.co/UwLK35Bmcm https://t.co/wevEIJ1XQv

2023-05-03 03:46:15 This will be presented virtually at 16:30 local time (7:30 am PDT): https://t.co/s2dgxSPLZQ @setlur_amrith, @DonDennis00, @ben_eysenbach, @AdtRaghunathan, @chelseabfinn, @gingsmith, @svlevine paper: https://t.co/CIDxrji6mC

2023-05-03 03:46:14 Next, bitrate DRO trains classifiers robust to distr. shift w/ a simple idea: real-world shifts affect underlying factors of variation (lighting, background, etc), and leveraging this structure by constraining the shifts we are robust to leads to very effective methods. https://t.co/aUyVx7rEXv

2023-05-03 03:46:13 Value-based offline RL has the potential to drastically increase the capabilities of LLMs by allowing them to reason over outcomes in multi-turn dialogues, and I think ILQL is an exciting step in this direction. More commentary in my blog post here: https://t.co/WLn8ugzwgr

2023-05-03 03:46:12 All are in 16:30 session (7:30 am PDT). ILQL is an offline RL method to train LLMs for multi-turn tasks. RL lets us train LMs to do things like ask questions in dialogue: https://t.co/gsgRwEF7oO https://t.co/xrPkESFPYj @sea_snell, @ikostrikov, @mengjiao_yang, @YiSu37328759 https://t.co/oPwlBqhgV2

2023-05-03 03:46:11 Tmrw at #ICLR2023 in Kigali, we'll be presenting: ILQL: a method for training LLMs w/ offline RL! Bitrate-constrained DRO: an information-theoretic DRO method Neural constraint satisfaction for planning with objects Aligned latent models for MBRL w/ latent states Thread https://t.co/IkV4LTej2G

2023-05-01 17:18:18 The idea is to adapt conservative Q-learning (CQL) to learn a Q-function for all confidence levels, representing a bound: Q(s, a, delta) means that the Q-value is bounded by the predicted value with probability 1 - delta. https://t.co/CHD8qrlCpR

2023-05-01 17:18:17 Pessimism is a powerful tool in offline RL: if you don't know what an action does, assume it's bad. But how pessimistic should an agent be? We'll present "Confidence-Conditioned Value Functions" at #ICLR2023 tmrw, arguing that we can learn all pessimism levels at once https://t.co/Tm5ne7SIaz

2023-05-01 17:14:56 Interested in how we can train fast with RL, taking as many gradient steps as possible per sim step? This work will be presented at #ICLR2023 in Kigali, Tue May 2, 11:30 am local time (2:30 am PDT), MH1-2-3-4 #96 https://t.co/yChkpwjjHf

2023-05-01 06:39:45 RT @agarwl_: I'll be presenting our work today at @iclr_conf on scaling offline Q-learning to train a generalist agent for Atari games. Sto…

2023-05-01 00:41:03 Talk: 10 am local/1 am PDT Mon May 1 Poster: 11:30 am local/2:30 am PDT at MH1-2-3-4 #105 For more, you can find the paper and more details on the project website: https://t.co/f0UPDa635I

2023-05-01 00:41:02 We'll be presenting our work on large-scale offline RL that pretrains on 40 Atari games at #ICLR2023. Come learn about how offline RL can pretrain general RL models! (well, general-Atari RL...) w/ @aviral_kumar2, @agrwl, @georgejtucker Long talk @ 10:00 am Mon (1 am PDT) info https://t.co/FXIPeFtDJN

2023-04-30 21:02:30 @haarnoja Congratulations Tuomas (&

2023-04-30 18:35:02 It's often a mystery why we can't just take way more gradient steps in off-policy model-free RL and get it to learn faster. There are a variety of (very reasonable) explanations, but it turns out that overfitting is a pretty good explanation, and suggests some simple fixes. https://t.co/yChkpwjjHf

2023-04-24 17:45:18 IDQL is also quite fast computationally. While it's not as blazing-fast as IQL, the decoupling does mean that it is much faster than other diffusion-based RL methods, since the diffusion model is trained completely separately from the critic with supervised learning.

2023-04-24 17:45:17 This is cool, but it also reveals a problem: the implicit policy is really complicated! Yet regular IQL uses a standard unimodal Gaussian policy to approximate it, which is going to be really bad. So we propose to use a much more powerful policy that can accurately capture pi_imp

2023-04-24 17:45:16 First: what is IQL? IQL <

2023-04-24 17:45:15 We released a new version of implicit Q-learning (IQL) with diffusion-based policies to get even better results with less hyperparameter tuning. For paper&

2023-04-23 23:14:31 "Sim" to real transfer from artificial pickles to real pickles. Awesome to see that Bridge Data helps with generalization to real foodstuffs :) https://t.co/gIHGKxxuye

2023-04-23 00:48:25 A talk that I prepared recently that describes a few recent works on RL finetuning with RL pretraining: https://t.co/WKUcORzhfp Offline RL can pretrain models that work great for online RL. We can develop algorithms for this (e.g., Cal-QL) and models (e.g., ICVF).

2023-04-21 21:29:30 Robotics research https://t.co/UEN4W7YUWw

2023-04-21 21:28:48 RT @kevin_zakka: I had the same reaction seeing this the first time!

2023-04-21 21:27:51 RT @smithlaura1028: I'm very excited to share our super simple system for teaching robots complex tasks using existing controllers ̈ Build…

2023-04-21 16:30:02 For full results, see the video here: https://t.co/29uhFS9YTl Website: https://t.co/9K7JWae5xF Arxiv: https://t.co/rmNiowu12S Led by @smithlaura1028, Chase Kew, w/ Tianyu Li, Linda Luu, @xbpeng4, @sehoonha, Jie Tan

2023-04-21 16:30:01 This turns out to work really well as a way to do curriculum learning. E.g., we first train basic jumping with motion imitation, then finetune to maximize jumping performance with buffer initialization from imitation. Similar idea for hind legs walking https://t.co/Ph1KXvKMLI

2023-04-21 16:30:00 The idea in TWiRL is to leverage RLPD (https://t.co/ZdRDLKNhZy) as a transfer learning algorithm, initializing the replay buffer from another task, environment, etc., and then using the super-fast RLPD method to quickly learn a new task leveraging the prior environment/task data. https://t.co/LL7PqY5emi

2023-04-21 16:29:59 We've taught our robot dog new tricks! Our new transfer learning method, TWiRL, makes it possible to train highly agile tasks like jumping and walking on hind legs, and facilitates transfer across tasks and environments: https://t.co/9K7JWae5xF https://t.co/yZYb6H4ovc

2023-04-21 03:55:03 @jmac_ai We used RLPD for the online phase (which is similar to SAC) and IQL for pretraining. Tradeoffs b/w IQL and CQL are complex. IQL is a great very simple way to learn values/distances, but tricky with policy extraction, CQL is more SAC-like. No clear winner, depends on how it's used

2023-04-21 00:00:01 CAFIAC FIX

2023-04-20 16:43:01 ...and then finetune on top of that general-purpose initialization with very efficient online RL that can learn at real-time speeds by leveraging an effective initialization. This is likely to make RL way more practical, so that robots can learn in 10-20 minutes

2023-04-20 16:43:00 This was a really fun collaboration with @KyleStachowicz, Arjun Bhorkar, @shahdhruv_ , @ikostrikov The methods this is based on are: IQL: https://t.co/pksjjvwaaf RLPD: https://t.co/ZdRDLKNhZy

2023-04-20 16:37:51 We're also releasing code along with a high quality simulation with nice terrain in MuJoCo, so that you can play with FastRLAP yourself! https://t.co/Z07kcJ96Wl arxiv: https://t.co/jmbQ0LUZX9 video: https://t.co/IK8GN5eG0b web: https://t.co/DXwMZ5oB0m https://t.co/L9nKz3ZGdG

2023-04-20 16:37:50 Here is an example lap time progression for an outdoor course. In 10-20 min it matches the demo, by 35 it's approaching expert (human) level. This is one of the harder tracks we tried, others are also easier (lots more examples on the website above) https://t.co/TzQCnP4bpo

2023-04-20 16:37:49 We then use this backbone to kick off online RL training with RLPD, initialized with one or a few slow demos in the target race course. The offline RL trained encoder is frozen, and RLPD then learns to drive fast in just 10-20 minutes. https://t.co/Mvxg9dmyMC

2023-04-20 16:37:48 FastRLAP is based on offline RL pretraining followed by super fast online RL. We first use an IQL-based offline RL method to pretrain a general-purpose navigation backbone using a large dataset from *other* robots driving around. This gives a general "navigational common sense" https://t.co/9lqL1kuzaJ

2023-04-20 16:37:47 Can we use end-to-end RL to learn to race from images in just 10-20 min? FastRLAP builds on RLPD and offline RL pretraining to learn to race both indoors and outdoors in under an hour, matching a human FPV driver (i.e., the first author...): https://t.co/DXwMZ5oB0m Thread: https://t.co/zUf8Moyvlq

2023-04-18 18:39:43 Paper: https://t.co/3RlP2YFD0M Site &

2023-04-18 18:38:31 ICVFs can learn across morphologies (train on one morphology, finetune to another), and can even pretrain for fast learning of Atari games using YouTube videos of Atari playing (in some cases, video appears to be someone pointing a phone at their screen...) https://t.co/NLkk2JVLlV

2023-04-18 18:37:21 ICVFs (intention-conditioned value functions) learn how effectively a particular intention enables reaching a particular outcome. Roughly this can be thought of as a generalization of goal-conditioned RL, learning not just goals but all tasks in the learned feature space.

2023-04-18 18:35:30 Code, paper, website here: https://t.co/Gax55dfRSE The idea in ICVFs is to do self-supervised RL where we learn a multilinear representation, predicting which state representation will be reached when we attempt a task, for every possible task and every possible outcome. https://t.co/McsUPzgP4H

2023-04-18 18:35:29 If we want to pretrain RL from videos, we might use representation learning methods like VAEs, MAEs, CPC, etc. But can RL *itself* be an effective rep. learning method? Self-supervised RL turns out to be great at this. That's the idea behind ICVFs, by @its_dibya @ChetBhateja https://t.co/C88rANkCp8

2023-04-18 14:49:45 RT @mitsuhiko_nm: Previously, PTR has demonstrated that combining limited target demos with a large diverse dataset can achieve effective o…

2023-04-18 03:36:55 @engradil123 Yes, we are considering international applications.

2023-04-17 20:53:39 We previously were fortunate to have Kristian Hartikainen <

2023-04-17 20:53:38 Our lab at Berkeley (RAIL) is hiring a research engineer! If you have a BS or MS and are interested, please see the official UC job posting here: https://t.co/UVovWzvk86 If you want to know more, the job posting has the official details, a few remarks:

2023-04-17 17:19:11 Paper website with videos, code, and arxiv link are here: https://t.co/vUFWTc42vH PTR pretrains on the Bridge Dataset: https://t.co/g4hcGEglep Using a variant of CQL, it can acquire representations that enable learning downstream tasks with just 10-20 trials...

2023-04-17 17:19:10 PTR (Pretraining for Robots) now supports online RL finetuning as well as offline RL finetuning! The concept behind PTR is to pretrain with offline RL on a wide range of tasks (from the Bridge Dataset), and then finetune. New results (led by @mitsuhiko_nm) below! Links &

2023-04-17 06:54:29 RT @_akhaliq: Reinforcement Learning from Passive Data via Latent Intentions abs: https://t.co/6yVdYQeyq2 project page: https://t.co/XCwd

2023-04-13 21:19:43 RT @GoogleAI: Today we discuss a large-scale experiment where we deployed a fleet of #ReinforcementLearning-enabled robots in office buildi…

2023-04-13 20:58:27 RT @hausman_k: And here is a blog post talking about RLS sorting trash in the real world by @svlevine and @AlexHerzog001: https://t.co/wgq

2023-04-13 01:19:08 RT @_akhaliq: This is wild Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators a system for deep rein…

2023-04-13 00:30:21 RT @xiao_ted: Reinforcement learning (!) at scale in the real world (!!) for useful robotics tasks (!!!) in multiple "in-the-wild" offices…

2023-04-12 17:58:25 @mattbeane @JoeRobotJones When the policies generalize effectively across different hardware platforms and tasks.

2023-04-12 17:35:15 RT @julianibarz: This was a 3+ year effort, RL in the real world is hard for sure. Any research that helps stabilize training, make hyper…

2023-04-12 16:57:20 This would not have been possible without an awesome team, and it has been a long journey. But the journey is not over. With our recent efforts integrating language into robotic policies (RT-1, SayCan, etc.), there are big things ahead as we get robots out in to our offices. https://t.co/28SxL5dM4Z

2023-04-12 16:57:19 There is of course a lot more to it, so you can check out the paper and more naimations and info on the project site: https://t.co/qZkeyLg8xO

2023-04-12 16:57:18 But of course real-world practice is the "real deal" where robots will drive up to waste stations with real trash deposited by real people, try to sort it, and collect more experience to improve on the job https://t.co/R40cfWRtn2

2023-04-12 16:57:17 We settled on sorting recyclables because it builds on things we already knew how to approach, but extends it with new open-world challenges: unexpected objects, novel situations, lots of chances to improve through trial and error (like these hard scenes below!) https://t.co/YqQtDOrycN

2023-04-12 16:57:16 When we developed QT-Opt, we could get even better performance at grasping tasks, and later for multi-task robotic learning, with end-to-end deep RL. https://t.co/sfeFyUQkdi https://t.co/rlLTIOyjYn https://t.co/nYLhIVWwpB

2023-04-12 16:57:15 Our research on large-scale robotic learning goes back to around 2015, when we started with the first "arm farm" system to study collective robot learning from experience, initially for robotic grasping. https://t.co/tQ58x1McRy https://t.co/IplSGfg6Cr

2023-04-12 16:57:14 Can deep RL allow robots to improve continually in the real world, while doing an actual job? Today we're releasing our paper on our large-scale deployment of QT-Opt for sorting recyclables: https://t.co/qZkeyLg8xO This has been a long journey, but here is a short summary https://t.co/glhB5OmnMK

2023-04-11 18:54:02 We have big plans to further expand Bridge Data in the future. @HomerWalke and many collaborators and data collectors have been doing a lot to expand this dataset, and we'll be adding language, more autonomous data, and many more tasks in the future.

2023-04-11 18:54:01 We've updated Bridge Data with 33k robot demos, 8.8k autonomous rollouts, 21 environments, and a huge number of tasks! Check out the new Bridge Data website: https://t.co/g4hcGEglep The largest and most diverse public dataset of robot demos is getting bigger and bigger! https://t.co/gkI7mPngQK

2023-04-11 17:34:59 RT @berkeley_ai: Tomorrow at 12 PM PT, the Berkeley AI lecture series continues. @svlevine will present "Reinforcement Learning with Large…

2023-04-10 18:04:21 @jasondeanlee It's just taking it to the next level https://t.co/UVqKcQYl8c

2023-04-04 16:56:59 The key idea behind Koala is to scrape high-quality data from other LLMs (yeah...). As we discuss in the post, this has some interesting implications for how powerful LLMs can be trained on a budget (in terms of weights and compute).

2023-04-04 16:56:58 We've released the Koala into the wild The Koala is a chatbot finetuned from LLaMA that is specifically optimized for high-quality chat capabilities, using some tricky data sourcing. Our blog post: https://t.co/7gWNiPOQ0T Web demo: https://t.co/TMWpixTZCK

2023-04-03 17:59:07 The reviewer's dilemma: it's like prisoner's dilemma, where if you accept the invitation to review for @NeurIPSConf, everyone will have fewer papers to review, but if you defect, then everyone's papers get reviewed by Reviewer 2. (so you should all accept your reviewer invites!)

2023-04-02 06:04:02 @archit_sharma97 It's really eye-popping to me that an accomplished professor would engage in such short-sighted behavior -- while the research vision is quite clear-eyed, the optics of this kind of thing are murky at best. Btw, where can I get a toad like that?

2023-03-29 22:27:55 RT @ancadianadragan: Offline RL figures out to block you from reaching the tomatoes so you change to onions if that's better, or put a plat…

2023-03-29 17:32:55 Also, to acknowledge the most relevant work that inspired this: Awesome work from Annie Xie, @chelseabfinn &

2023-03-29 17:28:50 Though at the same time, influence itself is not necessarily bad: e.g., an educational agent might influence students to pay more attention or better retain the material, etc. So we should be thoughtful in approaching these problems!

2023-03-29 17:28:49 ...so we should also be thinking carefully about how to *detect* that someone is interacting with an RL agent that is trying to influence or trick them, and also develop rewards and objectives that prevent deceptive and dishonest behavior.

2023-03-29 17:28:48 Large models (like LLMs) could pick up on subtle patterns in human behavior, and offline RL might enable using these patterns for subtle influence. I also discuss this more here: https://t.co/WLn8ugzwgr Of course, this is ripe for abuse...

2023-03-29 17:28:47 This was a fun collaboration with Joey Hong &

2023-03-29 17:28:46 And here is the RL agent. Notice how it puts the plate on the counter and then refuses to help until the human picks up the plate. Once the human is "forced" in this way, they do the right strategy, and the two players can play together more optimally! https://t.co/ZbHuYuDmw3

2023-03-29 17:28:45 Here is a more subtle example. Here, the optimal strategy is for RL agent (green) to pass plate to the human (blue) so the human can plate the soup and deliver it. Naive BC, shown below, doesn't get this and executes a very suboptimal strategy. https://t.co/9N7wqXEvbC

2023-03-29 17:28:44 And here is the offline RL agent. The agent is the green hat, the blue hat is a human (this is a real user study). Notice how green hat blocks blue hat from picking up on the onions -- after blocking them a few times, the human "gets the idea" and makes tomato soup only. https://t.co/j0cQ72kKWs

2023-03-29 17:28:43 Aside from this, it's basically offline RL (with CQL): analyze the data, and figure out how to do better by influencing humans. Here is one example: we change reward to favor tomato soup (humans don't know this), and the agent influences the human to avoid onions! BC baseline: https://t.co/Tm7PH7at3z

2023-03-29 17:28:42 The algorithm has one change from offline RL: add a "state estimator" to infer the "state" of the human's mind, by predicting their future actions and using the latent state as additional state variables. This allows the agent to reason about how it changed a person's mind. https://t.co/F81sfJEk4P

2023-03-29 17:28:41 The idea: get data of humans playing a game (an "overcooked" clone made by @ancadianadragan's group), plug this into offline RL with a few changes (below). Humans might not play well, but if they influence each other *accidentally* RL can figure out how to do it *intentionally*. https://t.co/spMuZvHmBU

2023-03-29 17:28:40 Offline RL can analyze data of human interaction &

2023-03-29 16:33:08 @RCalandra @tudresden_de @TactileInternet Congratulations Roberto!! Good luck with your new lab, looking forward to seeing what kinds of cool new research your group produces!

2023-03-27 21:25:32 RT @tonyzzhao: Introducing ALOHA : ow-cost pen-source rdware System for Bimanual Teleoperation After 8 months iterating @stanford a…

2023-03-27 21:02:21 To read about ALOHA (@tonyzzhao's system), see: Website &

2023-03-27 21:01:02 This is also why we opted to use low-cost widowx250 robots for much of our recent work on robotic learning with large datasets (e.g., Bridge Data https://t.co/fGG35iD0IA, PTR https://t.co/vUFWTc42vH, ARIEL https://t.co/Za9qhEwnIw)

2023-03-27 20:59:44 ...much of the innovation in the setup is to simplify aggressively, using cheap hardware, no IK (direct joint control), etc. With the right learning method, the hardware really can be simpler, with more focus on cost and robustness vs. extreme precision.

2023-03-27 20:58:54 It's also worth pointing out that, separately from the awesome robot results, the action chunking scheme in @tonyzzhao's paper is actually quite clever, and seems to work very well. But perhaps most interesting is that the robots don't have to be very fancy to make this work...

2023-03-27 20:52:44 Fine-grained bimanual manipulation with low-cost robots &

2023-03-27 00:22:09 @MStebelev This paper studies somewhat related questions https://t.co/Og9UpOpleA That said in general offline RL does suffer from the same covariant shift problems in the worst case, so the best we can do is characterize non worst case settings.

2023-03-26 00:44:44 RT @ben_eysenbach: I really enjoyed this conversation about some of the work we've done on RL algorithms, as well as the many open problems…

2023-03-25 23:25:11 I recently gave a talk on RL from real-world data at GTC, here is a recording: https://t.co/Ox7u4XxEL6 Covers our recent work on offline RL, pre-training large scalable models for robotic RL, offline RL for goal-directed large language models, and RL-based human influence.

2023-03-23 03:45:55 Throwing out unnecessary bits of an image to improve generalization, robustness, and representation learning in RL. When you focus on something, you get "tunnel vision" and the irrelevant surroundings seem to fade from view. Maybe RL agents should do the same? https://t.co/I4Kr5JRwL8

2023-03-19 04:20:33 @kchonyc I extrapolated that philosophy to all future "service" tasks (running a big grant, running a conference, etc.), and it seems like a really reliable rule of thumb. Sort of the academic version of "let him who is without sin..."

2023-03-19 04:18:49 @kchonyc Reminds me of wise words one of my colleagues said when I joined UCB: "if the teaching coordinator asks you to teach a specific class, you can say no

2023-03-11 07:52:23 RT @mitsuhiko_nm: Offline pre-training &

2023-03-10 22:45:42 RT @aviral_kumar2: Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-train…

2023-03-10 15:51:03 @hdonancio We just use Monte Carlo estimates for this (sum up rewards in the observed training trajectories). This is always possible and always unbiased, though it has non zero variance.

2023-03-10 04:45:21 Cal-QL was a great collaboration, led by @mitsuhiko_nm, @simon_zhai, @aviral_kumar2, w/ Anikait Singh, Max Sobol Mark, @YiMaTweets, @chelseabfinn Website: https://t.co/iZ1TKkaAqi Arxiv: https://t.co/Mz8fmhqgQm

2023-03-10 04:45:20 I particularly like this result because it shows (1) the high UTD training ideas in RLPD also transfer to other methods

2023-03-10 04:45:19 As a sidenote, the recently proposed RLPD method (w/ @philipjohnball, @ikostrikov, @smithlaura1028) <

2023-03-10 04:45:18 This effectively fixes the problem! Now the Q-function is on the right scale, and online finetuning makes it directly improve, instead of experiencing the "dip." https://t.co/dL2q6vY6yx

2023-03-10 04:45:17 This is very bad both for safety (really bad performance for a while) and learning speed (lots of time wasted for recovery). Fortunately, we can fix this with a very simple 1-line change to CQL!

2023-03-10 04:45:16 This is not an accident: what is happening is that CQL is underestimating during the offline phase, so once it starts getting online data, it rapidly "recalibrates" to the true Q-value magnitudes, and that "traumatic" recalibration temporarily trashes the nice initialization.

2023-03-10 04:45:15 The concept: it's very appealing to pretrain RL with offline data, and then finetune online. But if we do this with regular conservative Q-learning (CQL), we get a "dip" as soon as we start online finetuning, where performance rapidly drops before getting better. https://t.co/6DFTjlwSEr

2023-03-10 04:45:14 Can conservative Q-learning be used to pretrain followed by online finetuning? Turns out that naive offline RL pretraining leads to a "dip" when finetuning online, but we can fix this with a 1-line change! That's the idea in Cal-QL: https://t.co/iZ1TKkaAqi A thread https://t.co/pXt8jgg43N

2023-03-07 04:52:05 It's interesting to compare PaLM-E to PaLM-SayCan: https://t.co/zssoUzy9ep SayCan "stapled" an LLM to a robot, like reading the manual and then trying to drive the robot. PaLM-E is more like learning to drive the robot from illustrations and other media -- obviously way better.

2023-03-07 04:52:04 This allows PaLM-E to do the usual LLM business, answer questions about images, and actually make plans for robots to carry out tasks in response to complex language commands, commanding low-level skills that can do various kitchen tasks. https://t.co/eIRu8bjRyw

2023-03-07 04:52:03 PaLM-E ("embodied" PaLM) is trained on "multimodal sentences" that consist of images and text, in addition to normal LLM language training data. These multimodal sentences can capture visual QA, robotic planning, and a wide range of other visual and embodied tasks. https://t.co/wHozVM1Pwk

2023-03-07 04:52:02 What if we train a language model on images &

2023-03-05 10:00:00 CAFIAC FIX

2023-03-02 22:00:00 CAFIAC FIX

2023-02-27 01:00:00 CAFIAC FIX

2023-02-23 21:34:57 RT @GoogleAI: Presenting Scaled Q-Learning, a pre-training method for scaled offline #ReinforcementLearning that builds on the conservative…

2023-02-23 21:32:58 A really fun collaboration with @aviral_kumar2, @agarwl_, @younggeng, @georgejtucker Arxiv paper here: https://t.co/SmBOijHdSj This will be presented as a long oral presentation at ICLR 2023.

2023-02-23 21:32:57 But of course the real test is finetuning performance. That works from both offline data on new games and online interaction from new game variants! https://t.co/vVWfw2DNDg

2023-02-23 21:32:56 The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data. https://t.co/j2IwWUIYud

2023-02-23 21:32:55 Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain &

2023-02-23 18:41:31 Cassie learns to jump with deep RL! Really fun collaboration with @ZhongyuLi4, @xbpeng4, @pabbeel, @GlenBerseth, Koushil Sreenath. https://t.co/g5KmuANxzq

2023-02-19 20:38:22 @natolambert Uh oh I should watch what I say when I'm being recorded

2023-02-16 23:13:55 RT @philipjohnball: @svlevine Thanks for the comments everyone! We’ve now updated the manuscript to include the great work by @yus167, and…

2023-02-16 16:57:13 If you want to try super-fast RL from prior data, @philipjohnball, @ikostrikov, @smithlaura1028 have now released the RLPD repo: https://t.co/o8369eQPuZ We've been using this in multiple robotics projects lately (stay tuned!), and RLPD works pretty great across the board. https://t.co/Ari63kw9Pu https://t.co/Su7RMClpWe

2023-02-16 16:37:14 @ThomasW423 This is not the most mainstream opinion, but I think MB and MF (value-based) methods are not that diff. Both are about prediction, and for real-world use, they are slowly converging (eg multi-task value functions =>

2023-02-16 03:39:08 Apparently there is a recording of the talk I gave on offline RL for robots, goal-directed dialogue, and human influence https://t.co/fSTjdXRWsc

2023-02-15 17:59:11 In practice, this works *especially* well with label noise, which really hurts methods that don't get known groups, since they often end up up-weighting mislabeled points rather than coherent difficult groups. https://t.co/Gw4MumnLuL

2023-02-15 17:59:10 In theory, this really does work: the "inductive bias" from having a simple function class determine the groups makes it possible to improve robustness of classifiers under group shifts w/ unknown groups w/o knowing the groups in advance. https://t.co/xxGv0dckv6

2023-02-15 17:59:09 We usually don't know groups, we just have a bunch of images/text &

2023-02-15 17:59:08 How can we learn robust classifiers w/o known groups? In Bitrate-Constrained DRO by @setlur_amrith et al., we propose an adversary should use *simple* functions to discriminate groups. This provides theoretically &

2023-02-15 17:44:48 @archit_sharma97 On a more serious note, I think the issue with "ImageNet moment" is that it's a bit of a category error. "Solving robotics" is like finding a cure for all viruses -- it's just too big. Robotics is by its nature integrative, it typically lacks clean problems like ImageNet or Go.

2023-02-15 17:41:33 @archit_sharma97 At least as a roboticist you can sleep soundly knowing that when the AI apocalypse comes the robots will be pretty clumsy. I still remember my favorite headline about the Google arm farm work: "when the robots take over, they'll be able to grab you successfully 84% of the time."

2023-02-15 07:57:35 @QuantumRamses It was a department colloquium talk he gave at Berkeley around 2017. I’m sure he gave it in a few other places too. About how we should stop calling RL RL, among other things.

2023-02-15 03:33:39 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari I would be quite OK with any of those. We could also just call it cybernetics

2023-02-14 17:53:51 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari We could even go with "learning-based optimization" (LBO) to capture the black-box things that are not really control, like chip design and neural architecture search. Perhaps these things deserve to be put under the same umbrella as there are fascinating technical commonalities.

2023-02-14 17:50:41 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari I would be perfectly happy if we can all agree on "learning control" (LC) as a general term and reserve the "reinforcement" bit for some more narrow special case, but everyone got confused when I tried to make that distinction, so I gave up and just call everything RL now.

2023-02-14 17:48:55 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari For better or worse, in modern ML, "RL" is basically a synonym for "learning-based control." I would actually much prefer the latter term (Vladlen Koltun has a great talk about this too!), but language evolves, and people use "RL" to mean a lot more than it originally meant.

2023-02-14 03:10:13 A fun collaboration studying how we can pretrain a variety of features that can then be adapted to distributional shifts. The key is to learn decision boundaries that are all predictive but orthogonal to each other, then use them as “features.” Works in theory and in practice. https://t.co/WwzpGYsmSR

2023-02-13 22:57:41 @hausman_k All my arguments are RL arguments in (thin) disguise

2023-02-13 19:48:44 @hausman_k That would be cool, though lately we're trying hard to make AI as anthropocentric as possible. Clearly we need less imitation learning and more autonomous learning our future robot overlords will wonder why we tried so hard to get them to imitate such imperfect creatures

2023-02-13 17:50:13 @hausman_k Whether we are "generalists" or not, we are clearly way better at some things than others, which determines how we see the world. Missteps in AI research make this starkly apparent, but how deeply do such biases permeate physics, biology, etc.? Maybe just as much...

2023-02-13 17:48:46 @hausman_k One of the most interesting takeaways from the history of AI is just how deeply "anthropocentric" biases permeate our thinking, from Moravec's paradox to the Bitter Lesson -- so much of what we think we know about the world is colored by the nature of our own "intelligence"...

2023-02-13 05:44:18 @danfei_xu I think so. That's part of why I wanted to talk about RL-based training of LMs in there. It may be that with the right training procedure, LMs would be a lot better at rational decision making than they are now.

2023-02-12 22:52:53 I was invited to give a talk about how AI can lead to emergent solutions. Not something I talk about much, so it was a challenge: a reflection on "The Bitter Lesson", ML with Internet-scale data, and RL. I figured I would record and share w/ everyone: https://t.co/sxpfdcwvgS

2023-02-11 17:40:52 @neu_rips @JesseFarebro Interestingly, what the ablation study in our paper shows is that the most crucial choice is *not* the symmetric sampling, it's actually layer norm. The symmetric sampling helps, but not enormously in our experiments in Sec 5.1, we say that it's not enough in Sec 4.1. https://t.co/P2yAnvaeHu

2023-02-11 17:21:11 @neu_rips @JesseFarebro Of course the particular execution, details, results, etc. in each work are novel, and that's kind of the point -- I think our paper is pretty up front about the fact that the interesting new thing is in the details of how parts are combined to get results, not the basic concept.

2023-02-11 17:16:02 @neu_rips @JesseFarebro These papers all do this: https://t.co/lcJW0d8xqN https://t.co/oncdFIcWyC https://t.co/2eT8wx5EaC This paper has the same buffer balancing trick (see App F3): https://t.co/6fDaq3NALN Probably others do too, Wen told me he first saw it in this paper: https://t.co/C80Lfvu5AW

2023-02-11 17:14:23 @neu_rips @JesseFarebro I agree it would be good to expand our related work to add discussion of this paper, as well as several others! Getting feedback like that is one of the reasons to post pre-prints on arxiv. That said, we're pretty clear that this is not even remotely a new idea ->

2023-02-10 21:14:53 @JesseFarebro Ah, good point! You're right, this is indeed jointly training with offline data, just like DDPGfD, DQfD, etc.

2023-02-10 03:15:15 by @seohong_park Web: https://t.co/QsccXWoeYn arxiv: https://t.co/PaROenizIm https://t.co/ihyX6yj4K0

2023-02-10 03:15:14 This works really well for downstream planning: here we show MPC-based control with the PMA model vs models from prior methods and baselines. The PMA model leads to much more effective plans because it makes much more accurate predictions. https://t.co/GtdmYp3NtF

2023-02-10 03:15:13 This has an elegant probabilistic interpretation: the original MDP is turned into an abstracted predictable MDP (hence the name), and a "low level" policy acts as a decoder that decodes predictable MDP actions into grounded actions. https://t.co/9PDWfMHoBa

2023-02-10 03:15:12 The key idea is to learn an *abstraction* of the original MDP that only permits actions whose outcomes are easy to predict. If the agent is only allowed to select among those actions, then learning a model becomes much easier. https://t.co/eDX5wHjt23

2023-02-10 03:15:11 PMA is an unsupervised pretraining method: we first have an unsupervised phase where we interact with the world and learn a predictable abstraction, and then a *zero-shot* model-based RL phase where we are given a reward, and directly use the model from unsupervised interaction. https://t.co/nM9FExxvcX

2023-02-10 03:15:10 Model-based RL is hard, b/c complex physics are hard to model. What if we restrict agent to only do things that are easy to predict? This makes model-based RL much easier. In PMA, @seohong_park shows how to learn such "predictable abstractions" https://t.co/QsccXWoeYn Thread https://t.co/xmXfrOecsT

2023-02-09 16:33:12 @JesseFarebro That paper appears to provide theoretical analysis of offline RL pretraining with online finetuning, something that has been studied in a number of papers (including the IQL algorithm in the comparisons above). That's an old idea, and in my opinion a good one.

2023-02-09 03:22:31 This is work by @philipjohnball, @ikostrikov, @smithlaura1028 You can read the paper here: https://t.co/ZdRDLKNhZy

2023-02-09 03:22:30 These might seem like details, but they lead to a *huge* boost in training, without the more complex approaches in other offline to online work. Here are mujoco Adroit and D4RL results and comparisons https://t.co/ii9bGzM6uw

2023-02-09 03:22:29 Choice 3: use high UTD and a sample-efficient RL method (typically something with stochastic regularization, like ensembles or dropout). This one is known from prior work (DroQ, REDQ, etc.), but it really makes a big difference here in quickly incorporating prior data.

2023-02-09 03:22:28 There are three main design decisions, which turn out to be critical to improve SAC to be a great RL-with-prior data (RLPD) method: Choice 1: each batch is sampled 50/50 from offline and online data This is an obvious one, but it makes sure that prior data makes an impact https://t.co/dTRigTP79T

2023-02-09 03:22:27 RL w/ prior data is great, b/c offline data can speed up online RL and overcome exploration (e.g., solve the big ant maze task below). It turns out that a great method for this is just a really solid SAC implementation, but details really matter https://t.co/ZdRDLKNhZy https://t.co/JrBbtCQsTf

2023-01-30 01:00:00 CAFIAC FIX

2023-01-21 21:16:54 @shahdhruv_ @natolambert @xiao_ted @mhdempsey @andyzengtweets @eerac @jackyliang42 @hausman_k @vijaycivs I didn't even know we had V100s

2023-01-20 17:34:51 3 of the post authors are RAIL alumni @jendk3r was our RE, who in his brief tenure made everything run very smoothly and got a bunch of research done at the same time @ColinearDevin is now at DeepMind @GlenBerseth is prof at U of Montreal It was awesome working with you all!!

2023-01-20 17:34:50 The idea: run RL (after a bit of pretraining), continually picking up toys and dropping them back down to continue practicing. RL takes a while, but once it's autonomous, it can run for days and keep getting better! Web: https://t.co/DNSnEL7qbX Arxiv: https://t.co/RiPiTF3IJ0 https://t.co/WcEZ1kUUFe

2023-01-20 17:34:49 Some perspectives on autonomous lifelong RL for real-world robotics: https://t.co/tcfIHTOz6W This post, by @jendk3r, Charles Sun, @ColinearDevin &

2023-01-17 00:57:21 RL with language -- see my recent article here: https://t.co/WLn8ugzwgr ILQL (the algorithm I mention in the podcast): https://t.co/ssxXviA5wB Meta result on playing Diplomacy: https://t.co/bhxUJCLRCk

2023-01-17 00:57:20 I had a fun discussion with Sam Charrington for TWIML about machine learning in 2022 and things to watch in 2023. You can check it out below, with the YT link here: https://t.co/zGq3BHJjQE If you would like some references to some of the works I discuss, see links below: https://t.co/9fXBUTiIII

2023-01-12 17:23:33 @hausman_k @icmlconf I thought by "opposite" you were going to say that all papers are *required* to be generated by LLMs. The prompt must be submitted at the time of the abstract deadline. Bonus points for papers that use the same LLM that the generated paper is proposing.

2023-01-12 06:12:05 @r_mohan Should be quite possible to run on other robots (the quadcopter is a few hundred bucks...), it's just easiest to provide a clean wrapper for locobot + ROS as an example, but other robots that can read out images and accept inertial frame waypoints should work.

2023-01-11 19:02:05 We're releasing our code for "driving any robot", so you can also try driving your robot using the general navigation model (GNM): https://t.co/ohs8A3qnAw Code goes with the GNM paper: https://t.co/DWtvOWWd0b Should work for locobot, hopefully convenient to hook up to any robot https://t.co/P1kzajZl4u

2022-12-28 19:51:02 RT @hausman_k: SayCan is available in Roblox now! https://t.co/jaE07jB9vl You can play with an interactive agent using supported by GPT-3

2022-12-27 23:01:44 Our new paper analyzes how turning single-task RL into (conditional) multi-task RL can lead to provably efficient learning without explicit exploration bonuses. Nice summary by @simon_zhai (w/ @qiyang_li &

2022-12-21 19:57:06 RT @riadoshi21: Just released our latest work: AVAIL, a method training a robot hand to perform complex, dexterous manipulation on real-wor…

2022-12-21 18:07:52 w/ @imkelvinxu, Zheyuan Hu, Ria Doshi, Aaron Rovinsky, @Vikashplus, @abhishekunique7 https://t.co/NzUtY2vSVx https://t.co/nsxGE6nL3v https://t.co/gSpYuU8cQr

2022-12-21 18:07:51 The experiments cover three tasks. The one shown above is the dish brush task. Another task is to grasp and insert a pipe connector (shown below), and another task involves learning to attach a hook to a fixture. https://t.co/KHVW8T0hXM

2022-12-21 18:07:50 The concept behind our system, AVAIL, is to combine a task graph for autonomous training with VICE for classifier-based rewards and efficient image-based end-to-end RL for control. https://t.co/3iioDf55t1

2022-12-21 18:07:49 We trained a four-finger robot hand to manipulate objects, from images, with learned image-based rewards, entirely in the real world. It can reposition objects, manipulate in-hand, and train autonomously for days. https://t.co/nsxGE6nL3v https://t.co/NzUtY2vSVx A thread: https://t.co/s3jp1JZq0A

2022-12-20 23:05:47 RT @xf1280: If you missed this here is a highlight reel of SayCan presentations and demos at CoRL 2022, enjoy! https://t.co/L8mKVH4Ubv

2022-12-16 23:22:10 The paper is here: https://t.co/KwmujHIZnk Also check out our other recent work on data-driven navigation! GNM: https://t.co/DWtvOWWd0b LM-Nav (also at CoRL!): https://t.co/EVsFOiKGfS

2022-12-16 23:22:09 To learn more, check out the video: https://t.co/DnftBErjd6 @shahdhruv_ will present this work at CoRL, 11 am (NZ time) Sun = 2 pm PST Sat for the oral presentation, 3:45 pm (NZ time) Sun for the poster! w/ @shahdhruv_ @ikostrikov Arjun Bhorkar, Hrish Leen, @nick_rhinehart

2022-12-16 23:22:08 This makes the planner prefer paths that it thinks will satisfy the RL reward function! In practice this method can absorb large amounts of data with post-hoc reward labeling, satisfying user-specified rewards and reaching distant goals, entirely using real data. https://t.co/xT3relwnBR

2022-12-16 23:22:03 Of course, we don't just want to stay on paths/grass/etc., but reach distant goals. It's hard to learn to reach very distant goals end-to-end, so we instead use a topological graph (a "mental map") to plan, with the RL value function as the edge weights. https://t.co/BREA7KgzXP

2022-12-16 23:21:59 The idea: we take a navigation dataset from our prior work, and post-hoc label it with some image classifiers for a few reward functions: staying on paths, staying on grass, and staying in sunlight (in case we need solar power). Then we run IQL offline RL on this. https://t.co/imwvzEwZCS

2022-12-16 23:21:53 Offline RL with large navigation datasets can learn to drive real-world mobile robots while accounting for objectives (staying on grass, on paths, etc.). We'll present ReVIND, our offline RL + graph-based navigational method at CoRL 2022 tomorrow. https://t.co/zA0WVJjHAT Thread: https://t.co/OYEDTXo4wA

2022-12-16 23:12:32 RT @xf1280: Today I learned you can do a live demo during the oral session, and it felt great! Kudos to the entire team!

2022-12-16 23:12:07 RT @shahdhruv_: I'll be presenting LM-Nav at the evening poster session today at @corl_conf: 4pm in the poster lobby outside FPAA. Come fin…

2022-12-16 01:41:24 At CoRL 2022, @xf1280, @hausman_k, and @brian_ichter did a live demo of our SayCan system! Running RT-1 policy with PaLM-based LLM to fetch some chips in the middle of the CoRL oral presentation https://t.co/zssoUzy9ep https://t.co/ipoAEuuJvY https://t.co/0sNPBYVzYI

2022-12-14 20:29:18 RT @shahdhruv_: LangRob workshop happening now at #CoRL2022 in ENG building, room 401! Pheedloop and stream for virtual attendees: https:/…

2022-12-14 16:41:39 @JohnBlackburn75 This is trained entirely in the real world (except one experiment that is not shown here that studies what happens when we include simulation)

2022-12-14 03:04:22 And it's now on arxiv :) https://t.co/GqMEmRsYER

2022-12-13 19:04:06 RT @hausman_k: Introducing RT-1, a robotic model that can execute over 700 instructions in the real world at 97% success rate! Generalize…

2022-12-13 19:03:58 RT @xf1280: We are sharing a new manipulation policy that shows great multitask and generalization performance. When combined with SayCan,…

2022-12-13 19:03:06 Also worth mentioning that (finally) this one has open-source code: https://t.co/2W5dd83zJ8 (The other large-scale projects were very difficult to try to open source because of the complexity of the codebase that goes into these kinds of projects)

2022-12-13 18:59:20 This work is a culmination of a long thread of research on ultra-large-scale robotic learning, going back to 2015: arm farm: https://t.co/tQ58x23fTy QT-Opt: https://t.co/p2Lm3f9tEw MT-Opt: https://t.co/R7hNPwThts BC-Z: https://t.co/RQ5F2Y8HTM SayCan: https://t.co/zssoUzPcgp

2022-12-13 18:59:19 One of the experiments that I think is especially interesting involves incorporating data from our previous experiments (QT-Opt -- https://t.co/CptssIFS9R), which used a different robot, and showed that this can enable the EDR robot to actually generalize better! https://t.co/ZbkxM64FSE

2022-12-13 18:59:18 But perhaps even more important is the design of the overall system, including the dataset, which provides a sufficient diversity and breadth of experience to enable the model to generalize to entirely new skills, operate successfully in new kitchens, and sequence long behaviors. https://t.co/NuUmGZFCJP

2022-12-13 18:59:17 This thread by @hausman_k summarizes the design: https://t.co/XmYehq9fgV The model combines a few careful design decisions to efficiently tokenize short histories and output multi-modal action distributions, allowing it to be both extremely scalable and fast to run. https://t.co/SZcLZowTci

2022-12-13 18:59:16 New large-scale robotic manipulation model from our group at Google can handle hundreds of tasks and generalize to new instructions. Key is the right dataset and a Transformer big enough to absorb diverse data but fast enough for real-time control: https://t.co/iMnmxlnDWm >

2022-12-13 17:53:09 We've developed this line of work in a series of papers, culminating in fully learned systems that can drive for multiple kilometers off road, on roads, etc.: https://t.co/lY3FfMLR35 https://t.co/CljpgnJiOG https://t.co/co8FMAVJsC https://t.co/qqNxL7k3zm

2022-12-13 17:53:08 "Mental maps" can be built using these navigational affordances: take landmarks you've seen in the world, and determine which ones connect to which others using learned affordances. These maps are not geometric, more like "to get to the grocery store, I go past the gas station" https://t.co/L0KRi8gfzk

2022-12-13 17:53:07 Learning can give us both of these things. Navigational affordances should be learned from experience -- try to drive over different surfaces, obstacles, etc., and see what happens, learn how the world works, and continually improve. This is much more powerful. https://t.co/woHXy9U4hX

2022-12-13 17:53:06 The idea: navigation is traditionally approached as a kind of geometry problem -- map the world, then navigate to destinations. But real navigation is much more physical

2022-12-13 17:53:05 First, the entire special issue on 3D vision, our full article, and the author manuscript (if you can't access the paywall) are here (I'll get it on arxiv shortly too!). Full issue: https://t.co/Q06GB7RBuT Article: https://t.co/wRrV2TYJZP Manuscript: https://t.co/cnSnAooGby

2022-12-13 17:53:04 Learning can transform how we approach robotic navigation. In our opinion piece/review in Philosophical Transactions, @shahdhruv_ and I discuss some recent work and present an argument that experiential learning is the way to go for navigation: https://t.co/wRrV2TYJZP Short : https://t.co/pNNTcBn6Iz

2022-12-13 07:33:00 @gklambauer Count (# of times (s, a) has been visited, often approximated with density estimators in practice). C.f., count-based exploration. (but we should probably clarify this in the paper, good catch)

2022-12-13 05:58:55 This method does well offline, and the ability to adaptively adjust delta allows for improved online finetuning! w/ Joey Hong &

2022-12-13 05:58:54 There is a relatively elegant way to derive updates for training such a value function using variable levels of pessimism. We can also flip the sign and train *both* upper and lower bounds for all confidence levels. The CQL-algorithm that results can be summarized as: https://t.co/kAATV4iFB0

2022-12-13 05:58:53 The idea is to train a value function that is conditioned on "delta", a confidence level such that the value function is above the predicted value with that level of confidence. We can then train for *all* values of delta (sampled randomly during training). https://t.co/VrwanDFdez

2022-12-08 13:00:00 CAFIAC FIX

2022-12-07 08:00:00 CAFIAC FIX

2022-11-15 17:29:55 New talk discussing some perspectives on RL with real-world data: https://t.co/3KP7bquaRvDiscusses how offline RL can be applied to large-scale data-driven learning problems for robots, large language models, and other application domains.

2022-10-31 17:33:16 We evaluate on a wide range of test connectors (using a hold-one-out cross-validation procedure), and find that the method can consistently finetune even to pretty tricky insertion scenarios. https://t.co/F6W7Z8h7zF

2022-10-31 17:33:11 This kind of autonomous online finetuning bootstrapped from offline RL is far more practical than conventional RL, which requires either a huge number of trials or engineering-heavy sim-to-real solutions. I think in the future, all robotic RL systems will use offline pretraining! https://t.co/MIXG7yGFsK

2022-10-31 17:33:08 The finetuning is autonomous, and requires no additional human-provided information: the intuition is that the reward model generalizes better than the policy (it has an easier task to solve), and hence the policy can finetune from offline initialization.

2022-10-31 17:33:07 The method uses the IQL algorithm to train on an offline dataset of 50 connectors, and also learn a reward model and a novel domain-adaptive representation to facilitate generalization. For each new connector, the reward model provides image-based rewards that enable finetuning. https://t.co/X9ocEWkZ4J

2022-10-31 17:33:05 Offline RL initialization on diverse data can make online robotic RL far more practical! In our new paper, we show that this works great for industrial connector insertion, pretraining on dozens of connectors and finetuning autonomously to a new one!A thread: https://t.co/liXSKArTNI

2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…

2022-10-20 19:16:52 @chrodan Yup, I think in many cases just reward shaping without bonuses is enough, hence bonuses are often not used (to be clear, the "bonuses" in our paper are just reward shaping multiplied by a count-based discount, so that the shaping goes away over time to recover unbiased sol'n)

2022-10-20 02:11:15 RT @abhishekunique7: Excited about our work on understanding the benefits of reward shaping! Reward shaping is critical in a large portion…

2022-10-19 17:24:01 I think this is quite an important and understudied area: so much RL theory is concerned with complexity w/ exploration bonuses, and so much RL practice eschews bonuses in favor of shaped rewards and other "MDP engineering." This paper aims to bridge theory and practice.

2022-10-19 17:24:00 We show formally that reward shaping improves sample efficiency in two ways: (1) with appropriate (freq-based) weighting, reward shaping terms can act as "informed exploration bonuses", biasing toward novel regions that have high shaping rewards while being unbiased in the limit

2022-10-19 17:23:59 In theory RL is intractable w/o exploration bonuses. In practice, we rarely use them. What's up with that? Critical to practical RL is reward shaping, but there is little theory about it. Our new paper analyzes sample complexity w/ shaped rewards: https://t.co/27wWFe9B8mThread: https://t.co/iQ3dA5rXQi

2022-10-19 03:36:44 Single life reinforcement learning -- when you have to complete the task in one (long) episode at all costsFun new paper with @_anniechen_, @archit_sharma97, @chelseabfinn https://t.co/m1p24MM4k0 https://t.co/uFcuYuo6z5

2022-10-18 16:34:57 Large pre-trained models (ViT, BERT, GPT, etc.) are a starting point for many downstream tasks. What would large pre-trained models look like in robotics, and how do we build them?I discuss this in my new article:https://t.co/d8lNPaQoxsShort summary:

2022-10-17 17:50:36 w/ @HiroseNoriaki, @shahdhruv_, Ajay Sridhararxiv: https://t.co/LNJhXR9BjMwebsite: https://t.co/EWdOhUQeu0video: https://t.co/KbO9QFJi0S

2022-10-17 17:50:25 This lets us drive a *wider* range of robots than what was seen in the data, resulting in policies that generalize over camera placement, robot size, and other parameters, while avoiding obstacles. Uses similar multi-robot dataset as our recent GNM paper: https://t.co/DWtvOXdg2b https://t.co/eAYm73HB9k

2022-10-17 17:50:12 Experience augmentation (ExAug) uses 3D transformations to augment data from different robots to imagine what other robots would do in similar situations. This allows training policies that generalize across robot configs (size, camera placement): https://t.co/EWdOhUQeu0Thread: https://t.co/llRj2tY7Ef

2022-10-14 23:58:20 RT @KuanFang: Check out our CoRL 2022 paper on leveraging broad prior data to fine-tune visuomotor policies for multi-stage robotic manipul…

2022-10-14 21:54:20 Besides FLAP, our recent work on large-scale robotic learning includes:PTR (offline RL pretraining to learn from a few demos): https://t.co/vUFWTcl5xHGNM (learning to drive any robot from diverse multi-robot data): https://t.co/Y58VidiSbs

2022-10-14 21:54:12 w/ @KuanFang, Patrick Yin, @ashvinair, @HomerWalke, Gengchen Yanarxiv: https://t.co/6tMygFkm3iwebsite: https://t.co/6tVoynrihWThis caps off our week of releasing this latest round of large-scale robotic learning papers! https://t.co/vS8QUGF1O6

2022-10-14 21:53:52 FLAP can enable the robot to perform complex tasks, like moving the pot to the stove and putting a bunny in the pot (mmm...). The method is pretrained on the bridge dataset, which we collected last year for our large scale robotic learning research: https://t.co/fGG35ikRus https://t.co/8nWnp64YX9

2022-10-14 21:53:29 That's where finetuning comes in: the subgoals give the robot a "trail of breadcrumbs" to follow to practice the task end-to-end with online RL, getting better and better. Thus, if the representation &

2022-10-14 21:53:08 The idea: learn a goal-conditioned policy, which in the process recovers a lossy latent space for planning. A multi-step predictive model in this latent space can plan subgoals in a new scene to do a task. Crucially, the robot might not succeed at following those subgoals! https://t.co/BvvIZszlvs

2022-10-14 21:52:46 Pretraining on diverse robot data with offline RL allows planning multi-stage tasks, and then finetuning those tasks with end-to-end exploration. Our new paper, FLAP, describes a method for doing this with real robots!https://t.co/6tMygFkm3ihttps://t.co/6tVoynrihWThread: https://t.co/ZiKmO95yJL

2022-10-12 20:00:50 Check out the website:Get the code here to try it yourself: https://t.co/DqDKdkdpi7(supports multi-GPU &

2022-10-12 20:00:41 PTR also gets better as we add more parameters, with larger networks reaching significantly better performance. We might expect here that with even more data and even more parameters, we might get even better performance! https://t.co/i9LgLn4aYV

2022-10-12 20:00:29 PTR works for a variety of tasks in similar environments as Bridge Data (opening doors, placing objects into bowls). It makes sense -- if you want to learn skills with RL, you should pretrain with (multi-task) RL, learning representations aware of dynamics and causal structure. https://t.co/eYXECCx564

2022-10-12 20:00:09 Critically, the same exact offline RL method is used for pretraining as for finetuning! In experiments, this leads to better performance, learning tasks with as few as 10 demonstrations (and no other data!). Better pretraining than BC, etc.: https://t.co/tuHa0Lrgvt

2022-10-12 19:59:57 PTR (pre-training for robots) is a multi-task adaptation of CQL for multi-task robotic pretraining. We pretrain a policy conditioned on the task index on a multi-task dataset (we use the Bridge Dataset: https://t.co/fGG35ikRus). Then we finetune on a new task.

2022-10-12 19:59:47 How should we pretrain for robotic RL? Turns out the same offline RL methods that learn the skills serve as excellent pretraining. Our latest experiments show that offline RL learns better representations w/ real robots:https://t.co/vUFWTcl5xHhttps://t.co/9Q9oUdXefTThread>

2022-10-10 18:30:38 To learn more about GNMs, check out our paper here: https://t.co/7bHqgh0yZ7Website: https://t.co/Y58Vid1P9sVideo: https://t.co/HJHXYRKKeuw/ @shahdhruv_, Ajay Sridhar, Arjun Bhorkar, @HiroseNoriaki

2022-10-04 04:41:30 Great opportunity for anyone who wants to tackle deep and fundamental problems in RL https://t.co/GIFGutS7KD

2022-10-04 03:10:47 @DhruvBatraDB Asking "why not also use X" kind of misses the point, doesn't it? And it's a phrase that anyone who worked on deep learning in the early days probably heard numerous times. There is value in fundamental, general, and broadly applicable learning principles.

2022-10-03 22:33:08 RT @sea_snell: Great to see ILQL making it into the TRLX repo! You can now train ILQL on very large scale language models. See

2022-10-03 16:52:34 Or to summarize more briefly:Perhaps the motto should be: “use data as though the robot is a self driving car, and collect data as though the robot is a child”

2022-10-03 16:16:49 Yet people can learn without simulators, and even without the Internet. Even what we call "imitation" in robotics is different from how people imitate. As for data, driving looks different from robotics b/c we don't yet have lots of robots, but we will: https://t.co/VSp5otL6Ml https://t.co/YslnxBwqCS

2022-09-20 16:41:01 A model-based RL method that learns policies under which models are more accurate (in a learned representation space). This provides a single unified objective for training the model, policy, and representation, such that they work together to maximize reward. https://t.co/HMIR9sYGQV

2022-09-19 17:43:20 This blog post summarizes our ICML 2022 paper:arxiv: https://t.co/STGj2CD3nowebsite: https://t.co/tx1RlPjukmmy talk on this: https://t.co/J7qlep6gfmKatie's talk at ICML: https://t.co/2cZzAcPoaT

2022-09-19 17:40:43 Interested in what density models (e.g., EBMs) and Lyapunov functions have in common, and how they can help provide for safe(r) reinforcement learning? @katie_kang_'s new BAIR blog post provides an approachable intro to Lyapunov density models (LDMs): https://t.co/MiHkzw2us0

2022-09-14 01:03:59 RT @ZhongyuLi4: I think the coolest thing of GenLoco is that, if there is a new quadrupedal robot developed, we can just download the pre-t…

2022-09-13 18:09:53 The concept: train on randomized morphologies with a controller conditioned on a temporal window. With enough morphologies, it generalizes to new real-world robots!w/ Gilbert Feng, H. Zhang, @ZhongyuLi4, @xbpeng4, B. Basireddy, L. Yue, Z. Song, L. Yang, Y. Liu, K. Sreenath https://t.co/NO6cS1FdGo

2022-09-13 18:06:12 Can we train a *single* policy that can control many different robots to walk? The idea behind GenLoco is to learn to control many different quadrupeds, including new ones not seen in training.https://t.co/XDYODDXykRCode https://t.co/Q4cEs5OcQ3Video https://t.co/xyHbeD52Ve https://t.co/gYCvVCDsLY

2022-08-31 17:59:34 Article in the New Scientist about @smithlaura1028 &

2022-08-23 16:49:29 But do make sure to keep an eye on the robot and be ready to stop it... we had an attempted robot uprising once that took out a window. I guess now we have the dubious distinction of saying that our building has been damaged by out-of-control autonomous robots.

2022-08-23 16:48:25 And now you can grab the code and train your own A1 to walk! If you have an A1 robot, this should run without any special stuff, except a bit of determination. https://t.co/9Z8nnFkiEH

2022-08-23 14:55:08 @pcastr On the heels of NeurIPS rebuttals, AC’ing, etc. — shows how jaded I’ve become when I thought this was going to be a joke with the punchline “strong reject”

2022-08-22 16:41:48 I wrote an article about how robotics can help us figure out tough problems in machine learning: Self-Improving Robots and the Importance of Datahttps://t.co/VSp5otL6Ml

2022-08-19 02:24:19 @dav_ell @ikostrikov We'll get the code released shortly so you can see for yourself :) but there is a bit of discussion in the arxiv paper

2022-08-18 03:58:36 BTW, much as I want to say we had some brilliant idea that made this possible, truth is that the key is really just good implementation, so the takeaway is "RL done right works pretty well". Though I am *very* impressed how well @ikostrikov &

2022-08-18 01:14:11 By Laura Smith &

2022-08-18 01:13:53 Here are some examples of training (more videos on the website). Note that the policy on each terrain is different, dealing with dense mulch, soft surfaces, etc. With these training speeds, the robot adapts in real time. https://t.co/b99PDlXFpK

2022-08-18 01:13:34 Careful implementation of actor-critic methods can train very fast if we set up the task properly. We trained the robot entirely in the real world in both indoor and outdoor locations, each time learning policies in ~20 min. https://t.co/tC1nR664hf

2022-08-18 01:13:13 RL is supposed to be slow and inefficient, right? Turns out that carefully implemented model-free RL can learn to walk from scratch in the real world in under 20 minutes! We took our robot for a "walk in the park" to test out our implementation.https://t.co/unBt7nouiathread->

2022-08-17 05:25:37 RT @hausman_k: We have some exciting updates to SayCan! Together with the updated paper, we're adding new resources to learn more about thi…

2022-08-17 05:01:10 Fun article on Reuters about SayCan: https://t.co/SpSaEM0Iar(that's @xf1280 in the picture)But it's really not *just* a soda-fetching robotand my other thread about the difficulty of adding skills notwithstanding, it does get new skills added regularly.

2022-08-17 04:33:49 But these are big open problems. What I think is great about SayCan is that it clearly shows that if we can just get enough robot skills, composing them into complex behaviors is something that state-of-the-art LMs can really help us to do. So the work is cut out for us :)

2022-08-17 04:33:39 At Berkeley, we also looked at...learning from automatically proposed goals: https://t.co/Jq03a0M0FPlearning by playing two-player games: https://t.co/EtNpMR2dy8Even learning by translating videos of humans! https://t.co/2nP5zpjQ8O

2022-08-17 04:33:28 There are *a lot* of ways that people have thought about this, including us. For ex, just at Google we studied:learning from human demos: https://t.co/rV1BsZIK0nlearning with all possible goals: https://t.co/daYtlhy6sflearning from reward classifiers: https://t.co/R7hNPwThts

2022-08-17 04:33:17 But for all of this to work, we need the skills on which to build this, and getting these at scale is an open problem. SayCan used imitation learning, but automated skill discovery, real-world RL, and mining skills from humans are all important directions to take this further...

2022-08-17 04:33:06 In a sense, the robot represents the "world" to the LM in terms of the value functions of its skills, following a scheme we developed in the VFS paper: https://t.co/yJqJwwCT6r https://t.co/sNneWnCj9J

2022-08-17 04:32:54 Language models provide us with a powerful tool to access a kind of "smart knowledge base," but the challenge here is to parse this knowledge base correctly. SayCan does this with a joint decoding strategy to find skills that are likely under the LM and feasible for the robot. https://t.co/GWTrjN4fzk

2022-08-17 04:32:39 SayCan was a fun project: turns out connecting natural language to a robot works well once the robot has a broad repertoire of skills. Though we still have a lot to do in terms of making it scalable and automatic to equip robots with even broader skill repertoires. A thread: https://t.co/FkNNCP6tZN

2022-08-10 04:10:20 Successor features can enable a really simple and effective "meta-IRL" style method that can infer reward functions from just a few trials by using prior experience of solving RL tasks! by @marwaabdulhai &

2022-08-09 15:59:41 RT @xbpeng4: We've released the code for ASE, along with pre-trained models and awesome gladiator motion data from @reallusion!https://t.c…

2022-08-03 17:55:04 A fun project on training quadrupedal robots to score a goal: https://t.co/gYcQxpY7zaw/ Yandong Ji, @ZhongyuLi4, Yinan Sun, @xbpeng4, @GlenBerseth, Koushil Sreenathhttps://t.co/HW7ExqTfvd

2022-07-19 03:56:37 Many of us will also be at the workshops! Hope to see you at #ICML2022

2022-07-19 03:56:13 To enable goal-conditioned policies to make analogies, @phanseneecs, @yayitsamyzhang, @ashvinair will present "Bisimulation Makes Analogies":Thu 21 Jul 4:45 p.m. — 4:50 p.m. EDT, Room 309https://t.co/kNftFEfL8Ohttps://t.co/68Kvb7ejGshttps://t.co/FeB56WRzAa https://t.co/q7w5T2miAa

2022-07-19 03:55:49 We'll describe how unlabeled data can be included in offline RL just by setting rewards to zero (!) (@TianheYu, @aviralkumar2):Thu 21 Jul 4:40 pm - 4:45 pm ET Room 309https://t.co/UJAvKEJ9k0https://t.co/23sEday3aehttps://t.co/aMRqaNzq6r

2022-07-19 03:55:28 We'll present a suite of benchmarks for data-driven design (model-based optimization), called Design Bench (@brandontrabucco, @younggeng, @aviral_kumar2):Thu 21 Jul 1:50 p.m. — 1:55 p.m. EThttps://t.co/GBsQUTjMoOhttps://t.co/JQs4K8jn9zhttps://t.co/6rlzrO3kLe https://t.co/CfeswO2x9G

2022-07-19 03:55:13 To enable rapid learning of new tasks, @Vitchyr will present SMAC, an algorithm that meta-trains from offline data and unlabeled online finetuning:Thu 21 Jul 10:50 a.m. — 10:55 a.m. PDThttps://t.co/bwzIQG7Tcyhttps://t.co/XS9Hn6jF5Uhttps://t.co/mtgZTivlVs https://t.co/XZTQaJsxeI

2022-07-19 03:54:57 We'll present Diffuser (@michaeljanner, @du_yilun), which uses diffusion models for long-horizon model-based RL, in a full-length oral presentation:Wed 7/20 1:45-2:05 pm ET, Room 307, RL trackhttps://t.co/HdRTMNmtmehttps://t.co/ViPDTvCWfhhttps://t.co/BKs9EuoP5G https://t.co/ktgIpoBXJh

2022-07-19 03:54:44 Katie Kang (@katie_kang_) will present Lyapunov Density Models, which enable safety by mitigating distributional shift:Tue Jul 19 02:35 PM -- 02:40 PM (EDT) @ Room 309https://t.co/4sWYHloK6Bhttps://t.co/STGj2CU6pohttps://t.co/tx1RlPAxmm https://t.co/PMIDb2syXx

2022-07-19 03:54:30 We'll discuss how offline RL policies must be adaptive to be optimal (@its_dibya, Ajay, @pulkitology), and introduce a Bayes-adaptive offline RL method, APE-V:Tue 19 July 2:15 - 2:35 PM EDT @ Room 309https://t.co/x14skPtkJYhttps://t.co/3B0MvqY0jphttps://t.co/4NfMrxxcf1 https://t.co/y6nRmglpfh

2022-07-19 03:54:02 Students &

2022-07-16 02:07:40 RT @lgraesser3: So excited to share i-S2R in which we train robots to play table tennis cooperatively with humans for up to 340 hit rallie…

2022-07-14 19:35:09 We have several large-scale offline RL robotics papers coming soon, and some already out this year: https://t.co/8Is9AoyTifA number of other groups are also making great progress on this topic, e.g.:https://t.co/mZo782IBp8https://t.co/P9myggpBLzhttps://t.co/UCueqbH4sn

2022-07-14 19:34:50 This work is a step toward general-purpose pre-trained robot models that can finetune to new tasks, just like big vision &

2022-07-14 19:34:39 by @HomerWalke, Jonathan Yang, @albertyu101, @aviral_kumar2, Jedrzej Orbik, @avisingh599 Web: https://t.co/Za9qhENqKwPaper: https://t.co/i2AGx7LlafVideo: https://t.co/J45Dzk3lRQ

2022-07-14 19:34:25 Here is an example of some reset-free training. Of course, the new tasks have to have some structural similarity with the prior tasks (in this case, pick and place tasks, ring on peg, open/close drawer etc.). https://t.co/T7yj9YQT7r

2022-07-14 19:33:53 The idea in ARIEL is to use a large prior dataset with different tasks (in this case with scripted collection) to initialize a forward and backward policy that learns to both perform a task and reset it, as shown below. The task can be learned from scratch, or with a few demos. https://t.co/YLHJFaThMf

2022-07-14 19:33:42 Don't Start From Scratch: good advice for ML with big models! Also good advice for robots with reset-free training: https://t.co/Za9qhENqKwARIEL allows robots to learn a new task with offline RL pretraining + online RL w/ forward and backward policy to automate resets.Thread: https://t.co/3yAkLhfZc4

2022-07-14 01:22:04 See thread by @hausman_k (and website above): https://t.co/hpdHN5Mqrhw/ @wenlong_huang, @xf1280, @xiao_ted, @SirrahChan, @jackyliang42J, @peteflorence, @andyzengtweets, @JonathanTompson, @IMordatch, @YevgenChebotar, @psermanet, Brown, Jackson, Luu, @brian_ichter, @hausman_k

2022-07-13 18:48:02 See thread by @hausman_k (and of course the website above): https://t.co/hpdHN5uP2Hw/ @wenlong_huang, @xf1280, @SirrahChan, @jackyliang42J, @peteflorence, @andyzengtweets, @JonathanTompson, @IMordatch, @YevgenChebotar, @psermanet, Brown, Jackson, Luu, @brian_ichter, @hausman_k

2022-07-13 18:46:22 Robots can figure out complex tasks by talking... to themselves. In "Inner Monologue" we show how LLMs guide decision making and perception through an "monologue" with themselves, perception modules, and even dialogue with humans: https://t.co/snpSj8fPeShttps://t.co/7Q8ztC9jO7 https://t.co/4Sk4G6xN7m

2022-07-13 07:30:08 @SOURADIPCHAKR18 I only found I’m speaking two days ago, a bit short notice :) I’m just the substitute...

2022-07-13 05:03:36 I'll be delivering this talk tomorrow at the "New Models in Online Decision Making" workshop: https://t.co/7QJ3BkcShyIf you're attending the workshop, come watch it live (well, live over Zoom)!11:00 am CT/9:00 am PT

2022-07-13 05:03:29 Happy to share a talk on Lyapunov density models and epistemic POMDPs for offline RL. Describes how we can use offline data to get safety and rapid online learning.Covers LDM: https://t.co/tx1RlPAxmmAnd APE-V: https://t.co/4NfMrxxcf1See the talk here: https://t.co/hf9kZJInxN

2022-07-12 02:32:52 If you want to play with LM-Nav in the browser, check out the colab demo here: https://t.co/GKrOr24hU6Of course, our colab won't make a robot materialize and drive around, but you can play with the graph and the LLM+LVM components

2022-07-12 02:31:43 LM-Nav uses the ViKiNG Vision-Navigation Model (VNM): https://t.co/oR1bplIvMtThis enables image-based navigation from raw images.w/ @shahdhruv_, @blazejosinski, @brian_ichter Paper: https://t.co/ymU9kALQ9GWeb: https://t.co/EVsFOj1JhSVideo: https://t.co/QRhQDAZydM

2022-07-12 02:30:24 The results in fully end-to-end image-based autonomous navigation directly from user language instructions. All components are large pretrained models, without hand-engineered localization or mapping systems. See example paths below. https://t.co/phhVOgvvSP

2022-07-12 02:30:11 It uses a vision-language model (VLM, CLIP in our case) to figure out which landmarks extracted from the directions by the LLM correspond to which images in the graph, and then queries the VNM to determine the robot controls to navigate to these landmarks. https://t.co/3gOOeFUaUW

2022-07-12 02:29:57 LM-Nav first uses a pretrained language model (LLM) to extract navigational landmarks from the directions. It uses a large pretrained navigation model (VNM) to build a graph from previously seen landmarks, describing what can be reached from what. Then... https://t.co/lCnf7Cqru4

2022-07-12 02:29:48 Can we get robots to follow language directions without any data that has both nav trajectories and language? In LM-Nav, we use large pretrained language models, language-vision models, and (non-lang) navigation models to enable this in zero shot!https://t.co/EVsFOj1JhSThread: https://t.co/lL76BlNVst

2022-07-09 21:24:20 A (non-technical) talk about Moravec's paradox, the state of AI, and why robot butlers are hard to build: https://t.co/EBcoEbivMj

2022-07-07 20:24:33 @michal_pandy You can think of this (roughly) as an analysis of Bayesian RL in the offline setting, with a particular algorithmic instantiation based on ensembles. The trick is that the ensemble still needs to be aware of the belief state (which is updated incrementally).

2022-07-07 17:09:51 Paper: https://t.co/3B0MvqY0jpby @its_dibya + A. Ajay, @pulkitology, meThis builds on the ideas we previously introduced in our work on the epistemic POMDP: https://t.co/u5HaeJPQ29

2022-07-07 17:09:39 We can instantiate this by using an ensemble to track uncertainty about Q-functions, and conditioning the policy on the ensemble weights (i.e., the belief state). Adapting these weights via Bayes filtering updates leads to improved performance! https://t.co/kZl1WbFQUs

2022-07-07 17:09:24 Formally, the finite sample size in offline RL makes any MDP actually a POMDP, where partial observability is due to epistemic uncertainty! The state of this POMDP includes the belief over what the MDP really is, which we can update as we run the policy. https://t.co/UxEkCUvy1C

2022-07-07 17:09:10 Another example: we have sparse data on narrow roads, dense data on big roads. We should try the narrow roads to see if they provide a "shortcut," but if they don't, then we update our policy and go with the safer option. Again, adaptive strategies win over any conservative one. https://t.co/vbyLIPrwGr

2022-07-07 17:08:56 The right answer is to adapt the policy after trying the first door, thus updating the agent's posterior about what kind of MDP it is in -- basically, seeing that the door didn't open gives more information beyond what the agent had from the training set.

2022-07-07 17:08:48 This is *not* the same as regular classification, since it gets multiple guesses. The optimal policy (try doors in order) is *not* optimal in the training data, where it's easy to memorize the classes of each image. Optimal strategy is only apparent if accounting for uncertainty.

2022-07-07 17:08:41 Here is an example: there are 4 doors, the agent gets a (CIFAR) image (lower-left), and can only open the door corresponding to the class of the image. If it knows the label, that's easy. But if it's uncertain? It should "try" doors in order of how likely they are to be right. https://t.co/yGqzvfjzDu

2022-07-07 17:08:26 Offline RL methods have to deal with uncertainty due to fixed-sized datasets. Turns out that it is provably more optimal to recover *adaptive* policies rather than a static conservative policy. We study this setting in our new paper: https://t.co/3B0MvqY0jpThread w/ intuition: https://t.co/HHhkmjoJwv

2022-07-07 16:55:12 RT @its_dibya: What does an offline RL policy need to do to be maximally performant? In our new paper, we discovered that being optimal…

2022-07-07 03:29:53 Also very appropriate that @Mvandepanne is receiving the Computer Graphics Achievement Award the same year that his former student gets the Dissertation Award Very well deserved, Michiel's papers were a big inspiration for me in my own PhD as well.

2022-07-07 03:28:06 Congratulations @xbpeng4 on receiving the SIGGRAPH Outstanding Dissertation Award!Awesome recognition for making virtual characters robustly recover from being pelted with big boxes, small boxes, and medium boxesand sometimes making them hit backhttps://t.co/MIIYJ1cpJs https://t.co/NmXlxHkdgO

2022-07-06 16:45:59 If you are applying for PhD and want to create the most awesome animation methods and the most lifelike robots with one of the leading researchers in the field, Jason is starting a new lab at SFU!! https://t.co/IXcQqy39Jr

2022-07-06 03:39:31 A simple fixed point differentiation algorithm that can allow significantly more effective training of models that contain iterative inference procedures (e.g., models for object representations). https://t.co/W8O3qoNLUs

2022-07-01 01:45:17 If you want to check out the talk, you can find it here (along with a recording of the entire workshop!): https://t.co/E7HE1MtgxYCovers our recent work on offline RL + robotics, including a few soon-to-be-released papers! https://t.co/IbdgptCHYv

2022-06-29 18:15:21 If you're at #RSS2022, check out Yanlai Yang's talk on the Bridge Dataset, a dataset of 7k+ demos with 71 tasks in many kitchens + an evaluation of how such a large dataset can boost generalization for new tasks!11:05 am ET/Thu/Poster 12More here: https://t.co/JbIbC9X1I1 https://t.co/VmPT9e27nX

2022-06-28 21:00:01 @pcastr It may be that the similarity is more than just superficial, as similar math can explain both to some degree (though I suppose that's true for any two things if interpreted broadly enough): https://t.co/qx4ArX5ZTu

2022-06-26 21:19:38 @truman8hickok Let me see, hopefully it's recorded by the organizers, but if not, I think I can post my practice talk &

2022-06-26 19:32:29 I'll be speaking at #LDOD at #RSS2022 tmrw (Mon 6/27), 9:00am ET. I'll cover a few brand new large-scale robotic learning works! Here is a little preview:New Behaviors From Old Data: How to Enable Robots to Learn New Skills from Diverse, Suboptimal Datasets https://t.co/59wkTUojOv https://t.co/uKcctIGiEs

2022-06-26 18:01:13 RT @shahdhruv_: On Tuesday, I’m stoked to be presenting ViKiNG — which has been nominated for the Best Systems Paper award — at the Long Ta…

2022-06-24 18:19:11 The arxiv posting is finally out! https://t.co/xrPkESnGKb

2022-06-22 19:00:23 @jekbradbury So they're tackling somewhat orthogonal problems, and conceivably could be combined effectively.

2022-06-22 19:00:05 @jekbradbury RLHF deals more with the question of what kind of human feedback to get (preferences, feedback, etc.). In RL, feedback is somehow turned into a reward (see, e.g., Paul Christiano's preferences paper), which is used with some kind of RL method. ILQL is one choice for RL method.

2022-06-22 17:40:46 By @katie_kang_, Paula Gradu, Jason Choi, @michaeljanner, Claire Tomlin, and myselfweb: https://t.co/tx1RlPAxmmpaper: https://t.co/STGj2CU6poThis will appear at #ICML2022!

2022-06-22 17:40:37 Some results -- here, we use model-based RL (MPC, like in PETS) to control the hopper to hop to different locations, with the LDM providing a safety constraint. As we vary the threshold, the hopper stops falling, and if we tighten the constraint too much it stands in place. https://t.co/xxjeuWqna7

2022-06-22 17:40:20 Intuitively, the LDM learns to represent the worst-case future log density we will see if we take a particular action, and then obey the LDM constraint thereafter. Even with approximation error, we can prove that this keeps the system in-distribution, minimizing errors! https://t.co/JH9XP2PjIs

2022-06-22 17:40:01 LDM can be thought of as a "value function" with a funny backup, w/ (log) density as "reward" at the terminal states and using a "min" backup at other states (see equation below, E = -log P is the energy). In special cases, LDM can be Lyapunov, density model, and value function! https://t.co/ZAro9Jex2q

2022-06-22 17:39:37 We can take actions that are high-density now, but lead to (inevitable) low density later, so just like the Lyapunov function needs to take the future into account, so does the Lyapunov dynamics model, integrating future outcomes via a Bellman equation just like in Q-learning. https://t.co/igWYdk8Zkd

2022-06-22 17:38:15 By analogy (which we can make precise!) Lyapunov functions tell us how to stabilize around a point in space (i.e., x=0). What if we want is to stabilize in high density regions (i.e., p(s) >

2022-06-22 17:38:00 Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this. https://t.co/38AZmVCiMe

2022-06-22 17:36:48 What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: https://t.co/tx1RlPAxmmA thread: https://t.co/yPlPVIRTam

2022-06-21 23:51:10 @KevinKaichuang @KathyYWei1 Nothing to see here, just a super-intelligent AI sending messages to its friends by secretly encoding them into proteins.

2022-06-21 18:54:36 Note: we tried to post our paper on arxiv, but arxiv has put it on hold (going into the third week now). Hopefully arxiv posts it soon, but for now it's just hosted on the website above. I guess arxiv mods are a bit swamped these days...

2022-06-21 18:52:58 by @sea_snell, @ikostrikov, @mengjiao_yang, Yi Supaper: https://t.co/Pt2oGnBnfKwebsite: https://t.co/ssxXvihWitcode: https://t.co/mrbFVX2GarYou can also check out our prior work on offline RL for dialogue systems (NAACL 2022):https://t.co/hZny7dV0dEhttps://t.co/eT08Q95v6U

2022-06-21 18:52:49 One cool thing is that, just by changing the reward, we can drastically alter generated dialogue. For example, in the example on the right, the bot (Questioner) asks questions that minimize the probability of getting yes/no answers. https://t.co/WD1FZIPni3

2022-06-21 18:52:26 The result is that we can train large transformer Q-functions finetuned from GPT for arbitrary user-specified rewards, including goal-directed dialogue (visual dialogue), generating low-toxicity comments, etc. It's a general tool that adds rewards to LLMs. https://t.co/NpCB7Dl0w6

2022-06-21 18:52:11 Implicit Q-learning (IQL) provides a particularly convenient method for offline RL for NLP, with a training procedure that is very close to supervised learning, but with the addition of rewards in the loss. Our full method slightly modifies IQL w/ a CQL term and smarter decoding. https://t.co/h6YjSq2gv8

2022-06-21 18:51:58 We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data! https://t.co/m0EAVwIyar

2022-06-21 18:51:31 NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: https://t.co/ssxXvihWitCode: https://t.co/mrbFVX2GarThread ->

2022-06-19 01:20:59 @b_shrir Thanks for pointing that out! Definitely looks relevant, we'll take a closer look.

2022-06-17 23:42:46 This work led by @setlur_amrith, with @ben_eysenbach &

2022-06-17 23:42:36 Decision boundaries look much better too. Here, vertical axis is spurious, horizontal is relevant. Pink line (RCAD) always gets the right boundary, ERM (black) is often confused by the spurious direction. So RCAD can really "unlearn" bad features. https://t.co/aT4PYjEiSa

2022-06-17 23:42:23 It's also possible to prove that this adversarial entropy maximization "unlearns" spurious features, provably leading to better performance. This unlearning of bad features happens in practice! Here we start w/ ERM (red) and switch to RCAD (green), spurious feature wt goes down! https://t.co/xZdncIj0s9

2022-06-17 23:42:11 This leads to improved generalization performance on the test set, and can be readily combined with other methods for improving performance. It works especially well when training data is more limited. https://t.co/AFziLNhIv7

2022-06-17 23:41:58 The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training. https://t.co/Q4Db3VhY66

2022-06-17 23:41:49 Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: https://t.co/lJf5aVv3JrA thread: https://t.co/pHj76WHnUA

2022-06-17 01:17:51 See also Ben's thread on this work here: https://t.co/x2UzoAq8th

2022-06-17 01:17:35 ...then we don't need to worry about separate representation learning for RL, just use task-agnostic RL objectives (like goal reaching) to learn your representations!w/ @ben_eysenbach, @tianjun_zhang, @rsalakhuwebsite: https://t.co/mdyQrG2FMCpaper: https://t.co/UtLxMjgMGp

2022-06-17 01:17:26 More generally, I think contrastive RL poses a very interesting question: instead of asking how representation learning can help RL, can we instead ask how RL can help representation learning? If representation learning and RL objectives are cast in the same framework...

2022-06-17 01:17:18 This also leads to a very effective offline goal-conditioned RL method (albeit with a couple modifications), and can outperform state-of-the-art methods on the difficult ant maze tasks when provided with the task goal. https://t.co/8t0PQPDQjn

2022-06-17 01:17:03 This is very simple, and works very well as a goal-conditioned RL method, naturally leads to the (obvious) relabeling strategy, and outperforms a wide range of prior methods when doing goal reaching from images. https://t.co/DEZHOEY7h1

2022-06-17 01:16:51 High-level idea is simple: contrastive learning contrasts positives vs negatives (left). We can do RL by contrastive current state-action tuples with future states, sampled from a discounted future distribution (right). Then condition on the future goal, and pick the max action. https://t.co/QsIu8DpSS6

2022-06-17 01:16:38 Turns out we can do RL directly with contrastive learning, and it leas to good goal-conditioned performance on images without any separate explicit representation learning objective: https://t.co/mdyQrG2FMCA short thread: https://t.co/IboAGJaeHu

2022-06-14 01:53:25 @unsorsodicorda @keirp1 @SheilaMcIlraith Yes, there is also this (though focusing specifically on return conditioned): https://t.co/mE9qZDitQnSeems pretty clear that just return conditioned policies don't generally work. In RvS we argued that goal-conditioning should be better (though still requires alg changes).

2022-06-14 01:22:40 @Caleb_Speak Hmm that’s an interesting thought. We focus specifically on RL tasks, but it would be interesting under which conditions this would cause problems for general supervised systems with feedback — perhaps non trivial because it depends a lot on how they are supervised.

2022-06-13 19:20:28 Read more in Ben's tweet here: https://t.co/TB46OUutxV

2022-06-13 19:20:18 This leads to much better results particularly in stochastic environments, and we hope that this "normalized OCBC" can serve as a foundation for a better and more principled class of supervised learning RL methods! https://t.co/Fnz2r7OkIt

2022-06-13 19:20:06 Fortunately, we can fix this problem with two subtle changes. First, we reweight the data to account for the average policy (i.e., prior prob of actions). Second, we can't relabel tasks for which the policy is too different. With these two changes, we can prove convergence. https://t.co/orzea5yOd5

2022-06-13 19:19:53 In practice, this shows up as a kind of "winner take all" problem, where some tasks (e.g., some goals) get better, while others get worse. That makes sense -- if a goal gets worse, we have only bad data to relabel into that goal, and it will not get much better. https://t.co/TcmZfrRtPe

2022-06-13 19:19:41 ...and in general, the maximization step does *not* improve it beyond the "badness" introduced by the averaging step. This can become a big problem especially in stochastic environments. This is a problem, b/c/ this is a popular supervised learning alternative to RL https://t.co/iHQq7htR9w

2022-06-13 19:19:30 In our paper (@ben_eysenbach, @0602soumith, @rsalakhu), we show that in general, naive relabeling + supervised learning *doesn't* work. We can interpret it as two steps (analogously to EM): averaging (pooling data from all trials) and training. Averaging makes the policy worse... https://t.co/HtPH5Uloni

2022-06-13 19:19:20 An appealing alternative to classic RL is goal-conditioned BC (GCSL, etc.), or generally outcome-conditioned BC: condition on a future outcome and run supervised learning. Does this work? Turns out the answer, in general, is no: https://t.co/2zJFTt51CmA thread:

2022-06-13 19:04:07 A talk that I preparing on how reinforcement learning can acquire abstractions for planning. Covers some HRL, trajectory transformer, offline RL: https://t.co/LKl3OgAjstThis was made for the "Bridging the Gap Between AI Planning and Reinforcement Learning" workshop at ICAPS.

2022-06-05 04:39:28 RT @avisingh599: A ~35minute talk I gave at the ICRA 2022 Behavior Priors in Reinforcement Learning for Robotics Workshophttps://t.co/aNB

2022-05-26 19:55:24 @yukez @snasiriany @huihan_liu @UTCompSci @texas_robotics Awesome, congratulations @snasiriany, @huihan_liu, @yukez!!

2022-05-26 01:43:52 Finally, a method that will let us communicate with aliens if we don't know their language. Unfortunately, we couldn't find aliens for the study, so we decided to let the algorithm figure out how to control a lunar lander from hand gestures. Video here: https://t.co/PluzJe5vPT https://t.co/7s6r0OlaTh

2022-05-25 23:22:46 @mathtick @NandoDF I think that would be quite neat. The current model is still in discrete time though, so this would be a nontrivial extension — the diffusion here roughly corresponds to steps of plan refinement, rather than time steps.

2022-05-25 02:33:48 w/ @KuanFang Patrick Yin @ashvinair website with videos: https://t.co/EK3eQb0we9arxiv: https://t.co/8Is9AoQuGP

2022-05-25 02:32:23 PTP builds on a number of prior workson planning with goal-conditioned policieshttps://t.co/i6wCVW1dK2https://t.co/Qr3qUYzC0Thttps://t.co/DBvTUShGGBhie... vision-based RLhttps://t.co/cHpFdS1fIvhttps://t.co/Rre1xThTdxFor some reason a lot of these are from 2019

2022-05-25 02:28:48 PTP can use planning to significantly improve goal-conditioned policies with online finetuning, and then those goal-conditioned policies can solve multi-stage tasks (here, push an object and then close the drawer) in the real world! https://t.co/swZovHEroP

2022-05-25 02:27:10 The subgoals make it much easier to further finetune the goal-conditioned policy online to get it to really master multi-stage tasks. Intuitively, the affordance model tells the robot what it should do to achieve the goal, and the goal-conditioned policy how to do it. https://t.co/pZB9mEjtBl

2022-05-25 02:25:58 PTP trains a goal-conditioned policy via offline RL, as well as an "affordance model" that predicts possible outcomes the robot can achieve. Then it performs multi-resolution planning, breaking down the path to the goal into finer and finer subgoals. https://t.co/lCApBvnAOS

2022-05-25 02:24:40 Planning to practice (PTP) combines goal-based RL with high-level planning over image subgoals to finetune robot skills. Goal-conditioned RL provides low-level skills, and they can be integrated into high-level skills by planning over "affordances": https://t.co/8Is9AoQuGP->

2022-05-23 19:19:33 w/ @michaeljanner, @du_yilun, Josh TenenbaumPaper: https://t.co/ViPDTvUxDRWebpage: https://t.co/UDwLsWBoqaCode: https://t.co/QSLozrOD2Y

2022-05-23 19:18:19 This scales to more complex tasks, like getting a robotic arm to manipulate blocks, and does quite well on offline RL benchmark tasks! https://t.co/10Lavbx3iT

2022-05-23 19:17:51 Why is this good? By modeling the entire trajectory all at once and iteratively refining the whole thing while guiding toward optimality (vs autoregressive generation), we can handle very long horizons, unlike conventional single-step dynamics models that use short horizons https://t.co/FRRkRNtXKt

2022-05-23 19:17:04 The architecture is quite straightforward, with the only trajectory-specific "inductive bias" being temporally local receptive fields at each step (intuition is that each step looks at its neighbors and tries to "straighten out" the trajectory, making it more physical) https://t.co/MAWjMVnTw0

2022-05-23 19:15:42 The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method. https://t.co/XiAaYAonwk

2022-05-23 19:14:24 Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!https://t.co/UDwLsWBoqa->

2022-05-20 08:11:00 CAFIAC FIX

2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…

2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…

2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…

2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…

2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc

2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU

2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK

2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu

2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR

2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh

2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc

2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU

2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK

2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu

2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR

2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh

2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc

2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU

2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK

2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu

2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR

2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh

2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-11-27 22:34:00 There are additional talks, posters, and workshop talks from RAIL that I'll post about later! Also check out work by Jakub Grudzien (who is rotating with us this semester) + colleagues on discovering policy update rules: https://t.co/dM8KvMLi39 arxiv: https://t.co/jIYv6VcBZ0

2022-11-27 22:33:59 Also at 4:30, Hall J #333, Han Qi, @YiSu37328759 &

2022-11-27 22:33:57 Then at 4:30 pm CST, Hall J #505, @mmmbchang will present his work on implicit differentiation to train object-centric slot attention models. Talk: https://t.co/zG4sFVk8DF Web: https://t.co/XvsCYJYUUE YT video: https://t.co/VeiYJplJgN https://t.co/NjxdilM2Ih

2022-11-27 22:33:56 Also at 11, Hall J #303, @ben_eysenbach &

2022-11-27 22:33:55 Also at 11, Hall J #928, @ben_eysenbach &

2022-11-27 22:33:53 At 11:00 am CST Hall J #610, @ben_eysenbach &

2022-11-27 22:33:52 We'll be presenting a number of our recent papers on new RL algorithms at #NeurIPS2022 on Tuesday 11/29, including contrastive RL, new model-based RL methods, and more. Here is a short summary and some highlights, with talk &

2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc

2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU

2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK

2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu

2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR

2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh

2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-11-27 22:34:00 There are additional talks, posters, and workshop talks from RAIL that I'll post about later! Also check out work by Jakub Grudzien (who is rotating with us this semester) + colleagues on discovering policy update rules: https://t.co/dM8KvMLi39 arxiv: https://t.co/jIYv6VcBZ0

2022-11-27 22:33:59 Also at 4:30, Hall J #333, Han Qi, @YiSu37328759 &

2022-11-27 22:33:57 Then at 4:30 pm CST, Hall J #505, @mmmbchang will present his work on implicit differentiation to train object-centric slot attention models. Talk: https://t.co/zG4sFVk8DF Web: https://t.co/XvsCYJYUUE YT video: https://t.co/VeiYJplJgN https://t.co/NjxdilM2Ih

2022-11-27 22:33:56 Also at 11, Hall J #303, @ben_eysenbach &

2022-11-27 22:33:55 Also at 11, Hall J #928, @ben_eysenbach &

2022-11-27 22:33:53 At 11:00 am CST Hall J #610, @ben_eysenbach &

2022-11-27 22:33:52 We'll be presenting a number of our recent papers on new RL algorithms at #NeurIPS2022 on Tuesday 11/29, including contrastive RL, new model-based RL methods, and more. Here is a short summary and some highlights, with talk &

2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc

2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU

2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK

2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu

2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR

2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh

2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-11-27 22:34:00 There are additional talks, posters, and workshop talks from RAIL that I'll post about later! Also check out work by Jakub Grudzien (who is rotating with us this semester) + colleagues on discovering policy update rules: https://t.co/dM8KvMLi39 arxiv: https://t.co/jIYv6VcBZ0

2022-11-27 22:33:59 Also at 4:30, Hall J #333, Han Qi, @YiSu37328759 &

2022-11-27 22:33:57 Then at 4:30 pm CST, Hall J #505, @mmmbchang will present his work on implicit differentiation to train object-centric slot attention models. Talk: https://t.co/zG4sFVk8DF Web: https://t.co/XvsCYJYUUE YT video: https://t.co/VeiYJplJgN https://t.co/NjxdilM2Ih

2022-11-27 22:33:56 Also at 11, Hall J #303, @ben_eysenbach &

2022-11-27 22:33:55 Also at 11, Hall J #928, @ben_eysenbach &

2022-11-27 22:33:53 At 11:00 am CST Hall J #610, @ben_eysenbach &

2022-11-27 22:33:52 We'll be presenting a number of our recent papers on new RL algorithms at #NeurIPS2022 on Tuesday 11/29, including contrastive RL, new model-based RL methods, and more. Here is a short summary and some highlights, with talk &

2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc

2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU

2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK

2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu

2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR

2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh

2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA

2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX

2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig

2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao

2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt

2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka

2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v

2022-03-17 20:14:02 RT @GoogleAI: Presenting PRIME, a data-driven approach for architecting hardware accelerators that trains a #DeepLearning model on existing… 2022-03-17 20:09:15 If you want to learn more about data-driven (offline) model-based optimization and COMs, you can also check out our earlier blog post on model-based optimization algorithms on the BAIR blog: https://t.co/E0d2ZNcstK https://t.co/t2iJmvFhqE 2022-03-17 20:05:58 Neural nets can design microchips from data to speed up neural nets! Check out @ayazdanb's &amp With offline model-based optimization based on COMs, PRIME can optimize accelerators from data. https://t.co/zXr2IIk9Db 2022-03-17 16:46:42 RT @hausman_k: We're releasing one of the biggest real-world multi-task RL robotic datasets! Our MT-Opt dataset has ~ 1M (!) RL trajectori… 2022-03-12 08:11:00 CAFIAC FIX 2022-01-17 08:11:00 CAFIAC FIX 2022-01-11 08:11:00 CAFIAC FIX 2021-12-27 08:20:00 CAFIAC FIX 2021-11-06 23:20:00 CAFIAC FIX 2021-12-18 23:00:05 @George_Mazz I sometimes think that good AI research really should make people at least *somewhat* uneasy. Obviously we should be thoughtful and responsible, but meaningful, important technological progress is always at least a little threatening. 2021-12-18 22:56:49 @UgandanDev Thanks. That part is also somewhat inspired by this rather interesting writeup by Gerald Loeb ("Optimal isn't good enough"): https://t.co/0HSc6RMNY5 2021-12-18 22:53:13 @ylecun @pmddomingos Thank you Yann! That means a lot to me. 2021-12-15 18:25:24 As usual, let me know if there is some paywall nonsense that medium starts to put up. I think it shouldn't happen, but if it does, I might migrate this over to a personal blog or something... 2021-12-15 18:24:48 I wrote an article about how we should situate our RL agents in realistic environments if we want them to acquire flexible behaviors: An Ecological Perspective on Reinforcement Learning Goes with the real-world RL talk &lt Article: https://t.co/pbPG9l3QhZ 2021-12-15 04:52:40 I'm posting my talk from the deep RL workshop here for anyone who didn't catch it: https://t.co/dFojc3ixts The Case for Real-World Reinforcement Learning Or: Why Robinson Crusoe didn't need to be good at chess in the hopes we can all keep it real when it comes to RL 2021-12-14 19:16:31 The full paper on Intrinsic Control via Information Capture (IC2) (Information is Power) is now on arxiv: https://t.co/ZWkhsDXOQk Find out how to implement "Maxwell's demon" as an entropy-minimizing self-supervised agent that hoards information &amp https://t.co/4gN62o3W17 https://t.co/U4SdAlUkQn 2021-12-14 05:06:52 Can we meta-train on one task at a time, and accelerate learning of each new task? Continual meta-policy search (CoMPS) enables online meta-RL, so that each new task learns faster by learning to learn https://t.co/yEeFuiPKhw w @GlenBerseth @WilliamZhang365 G. Zhang @chelseabfinn https://t.co/aQGcDBwKHT 2021-12-14 01:00:07 @ethancaballero @agarwl_ I don't think it fully fixes it. It mitigates it, but I think it's quite possible that with more research one could discover better methods to address this issue! 2021-12-13 21:31:52 My talk on why we should do RL in the real world is starting now at the deep RL workshop at @NeurIPSConf: https://t.co/18t15Op9ly 2021-12-13 18:57:35 This work was presented as a full-length oral at the NeurIPS Deep RL workshop: https://t.co/Is3skVO2mV Paper: https://t.co/esuL8Xfubf Work led by @aviral_kumar2, with @agarwl_, @tengyuma , @AaronCourville, @georgejtucker 2021-12-13 18:57:25 We can simply add DR3 to standard offline RL algorithms and boost their performance, basically without any other modification. We hope that further research on overparameterization in deep RL can shed more light about why deep RL is unstable and how we can fix it! https://t.co/VQhgh8dFg6 2021-12-13 18:57:13 In practice, we can simply add some *explicit* regularization on the features to counteract this nasty implicit regularizer. We call this DR3. It simply minimizes these feature dot products. Unlike normal regularizers, DR3 actually *increases* model capacity! https://t.co/B7ZBVgypsW 2021-12-13 18:57:00 But if we apply similar logic to deep RL as supervised learning, we can derive what the "implicit regularizer" for deep RL looks like. And it's not pretty. First term looks like the supervised one, but the second one blows up feature dot products, just like we see in practice! https://t.co/95tZrA4g3C 2021-12-13 18:56:46 When training with SGD, deep nets don't find just *any* solution, but a well regularized solution. SGD finds lower-norm solutions that then generalize well (see derived reg below). We might think that the same thing happens when training with RL, and hence deep RL will work well. https://t.co/GkEOO2fe3h 2021-12-13 18:56:31 Well, that's weird. Why do TD feature dot products grow and grow until the method gets unstable, while SARSA stays flat? To understand this, we must understand implicit regularization, which makes overparam models like deep nets avoid overfitting. 2021-12-13 18:56:17 Simple test: compare offline SARSA vs offline TD. TD uses the behavior policy, so same policy is used for the backup, but SARSA uses dataset actions, while TD samples *new* actions (from the same distr.!). Top plot is phi(s,a)*phi(s',a'): dot prod of current &amp 2021-12-13 18:55:51 Deep RL is hard: lots of hparam tuning, instability. Perhaps there is a reason for this? Turns out the same thing that makes supervised deep learning work well makes deep RL work poorly, leading to feature vectors that grow out of control: https://t.co/esuL8Xfubf Let me explain: https://t.co/f0EfM4wjGI 2021-12-13 02:22:40 WILDS, but now with unlabeled data! Representation learning from unlabeled data has the potential to be a powerful tool for improving model robustness, and WILDS v2.0 will hopefully enable further research on this. https://t.co/tpu0riVEWz 2021-12-09 19:42:17 @burgalon @NeurIPSConf @michaeljanner It's already out: https://t.co/98D21UVSMC 2021-12-09 04:34:33 Find out about how the principles behind CQL and MOPO can be combined into COMBO, a conservative model-based offline RL method! @TianheYu will present this tmrw 12/9 Thu at @NeurIPSConf 4:30 pm PT: https://t.co/xHf1vXA96q https://t.co/vU9JaarCuy 2021-12-09 04:18:03 Rewarding a policy w/ the accuracy of an underlying (latent space) model allows us to relate compression to reinforcement learning! @ben_eysenbach will present robust predictable control (RPC) @NeurIPSConf Thu 12/9 4:30 pm PT: https://t.co/QwuoMLRygK Spot: https://t.co/psTdS1HtTq https://t.co/JXbjNlXohP 2021-12-09 03:02:54 If you want to find out what kinds of mutual information maximization objectives are sufficient for RL, check out Kate Rakelly's @NeurIPSConf presentation tomorrow Thu 12/9: https://t.co/8vGfJ3Cnl9 Thu 12/9 4:30 pm PT! The paper is here: https://t.co/m2jCP1V6dL https://t.co/EkcezTPFUx 2021-12-08 21:09:28 Variational inference can be used to show that shaped goal-reaching objectives provide principled lower bounds on goal-reaching sparse rewards. Find out more at @timrudner and @Vitchyr's poster at @NeurIPSConf 8:30 am PT 12/9 Thu: https://t.co/JzF6Pth5oA Paper link https://t.co/hkCYH7D3fy 2021-12-08 21:06:46 RT @_prieuredesion: RECON was featured on Computer Vision News @RSIPvision: https://t.co/EFLMT9uzt2 I also presented RECON at @corl_conf i… 2021-12-08 05:25:42 Come find out how we can "compress" images so that user reactions (actions) in response to those images are preserved! Human-in-the-loop RL for compression with GANs. @sidgreddy will present PICO at @NeurIPSConf tomorrow at 4:30 pm PT: https://t.co/5YRx0ViEGf https://t.co/R9pNQx4jQM 2021-12-08 02:27:56 This makes RCE quite simple, and very effective at learning tasks without user-programmed reward functions, using only successful outcome examples. 2021-12-08 02:27:30 This problem statement is similar to VICE, which we've been using for many of our robotics experiments to learn rewards in the real world: https://t.co/0rLibIGN0H However, unlike VICE, there is only one model, which is both the classifier and the value function! 2021-12-08 02:26:43 The basic idea: instead of coding up a reward function by hand, provide a few example outcomes (states) that denote "success". RCE trains a classifier, which predicts whether an action will lead to "success" *in the future* https://t.co/f2cSfpLpYe 2021-12-08 02:25:32 Poster link: https://t.co/oCdbHOJgqr Oral link: https://t.co/UEGIXFjtA5 Blog post: https://t.co/H8Ri4uS0QI Paper: https://t.co/Djiv7RJzqY 2021-12-08 02:24:01 Classifiers can act as value functions, when a task is specified with a set of example outcomes. We call this method RCE. It's simple and works well! Find out more at @ben_eysenbach's poster Wed 12/8 4:30 pm at @NeurIPSConf, and full-length talk Fri 4:20 pm PT (oral sess 5)&gt 2021-12-08 02:19:33 RT @nick_rhinehart: Excited to present IC2, an agent that learns to discover and stabilize sources of uncertainty in partially-observed env… 2021-12-07 19:42:33 Find out how to "train on the test data" and get reliable likelihoods under distributional shift with BACS. Aurick Zhou will be presenting this work at @NeurIPSConf Wed 12/8 at 4:30 pm PT, poster session 4 poster C3! And you can check out the paper here: https://t.co/juFpiE6mb3 https://t.co/XucA8sKzCl 2021-12-07 19:11:33 Find out how Transformers can enable state-of-the-art offline RL results at @NeurIPSConf tomorrow! @michaeljanner will be presenting Trajectory Transformers 4:30 - 6:00 pm PT 12/8 (Wed): https://t.co/TZYIqRSwf4 Preview of the poster below. Paper is here: https://t.co/EiKXQGOPfx https://t.co/uxwfTmNx3Q https://t.co/0cJxKsHTWd 2021-12-07 04:46:54 IC2 will be presented at @NeurIPSConf by @nick_rhinehart tomorrow, Tue 12/7, at 4:30 pm PT in poster session 2, poster C0. You can check out the paper here: https://t.co/xZm2b4gDqO 2021-12-07 04:45:42 In the vizDoom video game environment, IC2 will look around to find enemies, and then shoot them, so that unpredictable enemies aren't there anymore (OK, this one is a bit violent... and maybe cause for some concern, but we'll find a way to apply it to more peaceful ends). https://t.co/vkN721AnXc 2021-12-07 04:44:27 For example, in a simple gridworld domain with moving objects that stop when the agent "tags" them, IC2 causes the agent to track down every object and tag it to stop its motion -- thus the agent always knows where everything is! https://t.co/57q02LUF4R 2021-12-07 04:43:27 Minimizing belief entropy forces the agent to do two things: (1) figure out where everything is (find &amp 2021-12-07 04:42:08 The idea behind IC2 (intrinsic control via information capture) is to instantiate this "belief entropy minimization" intuition into a practical unsupervised RL algorithm! There are a few variants of this principle, but they all train a latent belief model &amp 2021-12-07 04:40:49 This seems to violate the second law of thermodynamics. The explanation for why it does not is that information about the particles itself is exchangeable with potential energy (that's a gross oversimplifications, but this is just a tweet...). 2021-12-07 04:39:28 The "Maxwell's demon" thought exercise describes how information translates into energy. In one version, the "demon" opens a gate when a particle approaches from one side, but not the other, sorting them into one chamber (against the diffusion gradient). This lowers entropy. https://t.co/j8SKtceKAc 2021-12-07 04:37:56 Intrinsic motivation allows RL to find complex behaviors without hand-designed reward. What makes for a good objective? Information about the world can be translated into energy (or rather, work), so can an intrinsic objective accumulate information? That's the idea in IC2. A : https://t.co/H3xPmltZAI 2021-12-07 01:48:18 Come find out how the principles behind conservative Q-learning can facilitate selecting which data to share between tasks in multi-task RL! @TianheYu &amp 2021-12-06 22:47:19 If you want to find out why generalization in RL is hard, check out @its_dibya and @jrahme0's presentation at @NeurIPSConf tomorrow 12/7, 8:30 - 10:00 am PT poster session 1, poster C3! Meanwhile, check out @its_dibya's blog post on the epistemic POMDP: https://t.co/FBwQUrosQO https://t.co/Ieoc16DwLS 2021-12-06 22:42:00 In the real world, robots must train without someone resetting the world. Turns out that this can be an opportunity for a curriculum that can actually make learning easier! Check out Archit's presentation on VaPRL at @NeurIPSConf 8:30-10:00 am PT 12/7 poster sess 1, poster F1 https://t.co/P6yae6adwE 2021-12-06 22:38:46 Paper: https://t.co/u1FZBiUJZ1 Detailed blog post: https://t.co/3GReH3Z6vA 2021-12-06 22:38:23 The idea: construct unlabeled batches with different shifts at training time, and train the model to find the pattern (correlations) so as to do well on a labeled val set. Then at meta-test time, adapt to a new (unlabeled) batch. E.g., figure out the handwriting for a new user. https://t.co/E6cZ3ay6yt 2021-12-06 22:37:07 Can meta-learning be used to learn how to adapt to distributional shift *without* labels? On Tue 12/7 8:30-10:00 am PT, Marvin Zhang &amp poster session 1, poster A1 a short summary below: https://t.co/Aaaq2qtJPZ 2021-11-24 18:48:53 This leads to much better generalization for minigrid maze navigation and a complex image-based robotic manipulation environment. w/ @_prieuredesion, Peng Xu, Yao Lu, @xiao_ted, @alexttoshev, @brian_ichter https://t.co/Uh7A9trb5E 2021-11-24 18:46:31 This state can then be used with a high-level model based planner, or a high-level model-free RL method. Insofar as the low-level skills can generalize, the high-level policy has a much easier problem to solve, because VFS abstracts away perception. https://t.co/6fuVVqJZuR 2021-11-24 18:45:45 The idea is simple: if we have a bunch of low-level skills, each skill contributes its value function as one entry in a high-level state. So if we have skills for opening a door and opening a drawer, higher level sees the state as "can open door, can't open drawer." 2021-11-24 18:44:28 Value function spaces (VFS) uses low-level primitives to form a state representation in terms of their "affordances" - the value functions of the primitives serve as the state. This turns out to really improve generalization in hierarchical RL! https://t.co/yJqJwwCT6r Short &gt 2021-11-23 18:34:24 Which MI rep learning objectives are sufficient for RL? Kate Rakelly analyzes this in her new blog post https://t.co/L8PteaDk7y If you use an obj that is insufficient, you may get a representation that can't solve your task. Based on our NeurIPS paper https://t.co/Wg1vUe8Ouh https://t.co/oWI9xLSjbO 2021-11-22 20:52:27 An awesome opportunity for students who want to work on RL, robotics, and other cool topics at MILA :) https://t.co/gg8tTgX5vz 2021-11-22 18:52:07 The full bridge dataset that we are releasing uses a cheap widowX arm to collect 7,200 demonstrations for 71 tasks across 10 different domains (themed around household tasks). Find dataset, code, and paper link on the project website: https://t.co/fGG35ikRus 2021-11-22 18:50:46 However, if it has enough *different* tasks and domains, this turns out to actually work! In fact, including bridge data when training new tasks in new domains (in this case with imitation learning) leads to about a 2x improvement in performance on average! https://t.co/n6cmw22Uwb 2021-11-22 18:50:03 So if I collect a little bit of data for my task in my new domain, can I use a reusable dataset to boost generalization of this task? This is not a trivial question, since the "bridge data" does not contain either the new domain or the new task. https://t.co/876qCH8aKf 2021-11-22 18:49:22 The reason reusing data in robotics is hard is that everyone has a different lab environment, different tasks, etc. So to be reusable, the dataset needs to be both multi-domain *and* multi-task, so that it enables generalization across domains. 2021-11-22 18:48:23 Reusable datasets, such as ImageNet, are a driving force in ML. But how can we reuse data in robotics? In his new blog post, Frederik Ebert talks about "bridge data": multi-domain and multi-task datasets that boost generalization of new tasks: https://t.co/JbIbC9X1I1 A thread: 2021-11-19 20:01:51 Also, if you want to read the paper that we "borrowed" the Q-function from for Ant Maze, it's here: https://t.co/pksjjvvCkH @ikostrikov makes some really nice Q-functions 2021-11-19 19:58:46 This is joint work with Michael Janner &amp https://t.co/98D21Vdubc https://t.co/EiKXQGxenZ Code: https://t.co/gG30OKwjgJ 2021-11-19 19:58:34 This is significant because only dynamic programming methods perform well on Ant Maze (e.g., Decision Transformer is on par with simple behavioral cloning) -- to our knowledge Trajectory Transformer + IQL is the first model-based approach that improves over pure DP on these tasks 2021-11-19 19:58:13 But if we *combine* Trajectory Transformer with a good Q-function (e.g., from IQL), we can solve the much more challenging Ant Maze tasks with state-of-the-art results, much better than all prior methods. Ant Maze is much harder, because it requires temporal compositionality. https://t.co/OzNjmbM37X 2021-11-19 19:57:24 For control, we can simply run beam search, using reward instead of likelihood as the score. Of course, we could use other planners too. On the (comparatively easy) D4RL locomotion tasks, Trajectory Transformer is on par with the best prior method (CQL). https://t.co/JkWJx6oDWN 2021-11-19 19:57:07 It also makes *very* long-horizon rollouts successfully, far longer than standard autoregressive models p(s'|s,a). So something about a big "dumb" model works very well for modeling complex dynamics, suggesting it might work very well for model-based RL. https://t.co/b1zj5O4HYZ 2021-11-19 19:56:10 Although the Transformer is "monolithic", it *discovers* things like near-Markovian attention patterns (left) and a kind of "action smoothing" (right), where sequential actions are correlated to each other. So the Transformer learns about the structure in RL, to a degree. https://t.co/WPIpajbmFs 2021-11-19 19:55:53 Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens. https://t.co/d1Vpz1YWgl 2021-11-19 19:55:39 We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method! https://t.co/EiKXQGxenZ https://t.co/xoOMQznKLW A thread: 2021-11-17 04:56:44 @CsabaSzepesvari @daibond_alpha @nanjiang_cs Good motivation for another reunion workshop next year? 2021-11-12 02:10:52 AW-Opt combines AWAC and QT-Opt into a scalable algorithm for real-world offline RL with online finetuning, particularly for combining imitation learning and RL. Evaluated on multiple complex tasks from our work on large-scale robotic learning at Google! https://t.co/CmPDowzQNH 2021-11-10 20:01:31 I'll be presenting (and discussing) this position paper at @corl_conf tomorrow (Thu 11/11) at 6:30 am PT (sorry...), which is 14:30 GMT! See you there. Meanwhile, here are a couple slides from my talk to give you some idea of what I'm going for. https://t.co/yRDYdeMAVY https://t.co/wdrLOTordg 2021-11-10 19:56:00 Check out Natasha's tutorial tomorrow at @corl_conf, which will cover cool work on social learning, emergent complexity, and other topics! https://t.co/twAVYEc9d1 2021-11-10 18:50:05 On Thu 11/11, @katie_kang_ will present this work at @corl_conf! Poster session at 5 pm GMT (9 am PT). Come find out how different robots (and even humans!) can share data to learn a single unified navigational system with different per-robot dynamics. https://t.co/HLTTjJD11S https://t.co/iSzgtVTQIb 2021-11-10 18:47:04 Want to know how to get a robot to learn to clean up your room? Thu 11/11, we'll be presenting ReLMM at @corl_conf, poster 11:30 am GMT. Find out about autonomous real-world robotic RL that combines mobility + manipulation! https://t.co/RiPiTFkLL0 https://t.co/qYQJJbWVni 2021-11-09 18:12:26 How do we tune offline RL methods? In our full-length @corl_conf talk, @aviral_kumar2 &amp 2021-11-09 17:02:35 @pcastr @its_dibya The principle actually applies to any generalization across anything, but it’s easiest to describe across contexts in a contextual MDP, because this makes it simplest to define a train/test split. But the same thing happens if we have unseen (but in dist) states at test time 2021-11-08 18:44:17 MT-Opt will be presented at @corl_conf Tue 11/9 in the 11:30 am GMT poster session! Meanwhile, you can find out more about how large-scale robotic deep RL can learn diverse tasks here: https://t.co/yPQ7wYFjM5 Paper: https://t.co/gweqaiT7oC Video: https://t.co/R3Wg6TYnaP https://t.co/jH6qPBdxbT 2021-11-08 18:42:25 We'll present RECON in the 10 am GMT oral session on Tue 11/9 at @corl_conf (poster 11:30 am). If you want to get an early preview, you can find a recorded talk here https://t.co/tvCAXZvhcs Come find out how robots can learn to search for new goals! Paper https://t.co/WsEYRIb6P9 https://t.co/PP9csCOijU 2021-11-07 22:30:51 @minakhan01 That's very kind of you to say, thank you This also reminded me that I really need to prep for my class discussion tomorrow. 2021-11-07 20:53:06 Also, sorry about the slightly awkward format with the giant face taking up half the screen, that was the format I recorded it in that was requested by the organizers But I guess it's nice if you prefer to see the speaker in a talk... 2021-11-07 20:52:13 I'm reposting my BayLearn 2021 keynote talk on offline RL here: https://t.co/YgCyeFknzz Thank you to the BayLearn organizing committee for the invitation, it was a lot of fun. This talk covers CQL, IQL, and other algorithms &amp 2021-11-07 17:40:25 @sanjeevanahilan Conceptually quite related! We discuss this in the related work section of our NeurIPS paper (just wanted to keep the blog post brief and accessible). 2021-11-06 23:20:00 CAFIAC FIX 2021-11-06 19:50:00 CAFIAC FIX 2021-11-06 18:59:00 CAFIAC FIX 2021-11-01 19:20:00 CAFIAC FIX 2021-11-01 17:30:00 CAFIAC FIX 2021-10-13 19:14:28 RT @tonyzzhao: "Insert anything into anything!" New paper applying offline RL to industrial insertion. Test it with 12 new tasks, 100/100… 2021-10-13 18:41:06 RT @ikostrikov: Excited to present our work with @ashvinair and @svlevine, Offline RL with Implicit Q-Learning (IQL), a simple method that… 2021-10-13 18:30:10 Now we'll see if we can use IQL for our robotics applications. So far seems like a great choice, as it's very fast, performs great, and finetunes really well. 2021-10-13 18:29:33 This leads to state-of-the-art results on D4RL offline RL benchmarks, with particularly large gains on the hardest task (e.g., large ant mazes). Also excellent online finetuning results. And really fast runtime. https://t.co/6vwifK12Qv 2021-10-13 18:29:18 Just by changing the loss function for V(s) like this, we can get V(s) to converge to the maximum *in-support* value function, resulting in honest-to-goodness dynamic programming! This allows learning the optimal policy in the data support. https://t.co/SPgwP4Z6T4 2021-10-13 18:29:06 Now the kicker: change the loss for V(s) 2021-10-13 18:28:47 In general, we could train another network to be the value function V(s), and then use a loss like the one below. If V(s) is trained by regressing onto Q(s,a) using dataset actions (remember, we *only* use dataset actions), it's *still* SARSA, no improvement. https://t.co/CXzS8Ev9cH 2021-10-13 18:28:30 Here is the idea: if we want to prevent *all* OOD action issues in offline RL, we could use *only* actions in the dataset. That leads to a SARSA update, which is very stable. But it learns the *behavior policy* value function, not the optimal value function: https://t.co/6omLFHuHWb 2021-10-13 18:28:17 Implicit Q-learning, or "I can't believe it's not SARSA": state-of-the-art offline RL results, fast and easy to implement w/ @ikostrikov, @ashvinair -&gt 2021-10-13 03:58:33 If you want to learn more about the algorithms that went into this, check out: PEARL (meta-learning with probabilistic encoder): https://t.co/Ebiwsdg4nH AWAC (actor-critic offline RL): https://t.co/913iRX0ppL SMAC (offline meta-RL algorithm): https://t.co/McmMFHkzJ8 2021-10-13 03:57:27 Even works on some super-hard challenge tasks, like RAM insertion. Find out more on the project website: https://t.co/XIjDBtfhwx arxiv: https://t.co/LOqgMH5Z48 video: https://t.co/zUbHMWgHu9 https://t.co/GGs73yVmTS 2021-10-13 03:57:04 For the connectors that don't work well, run additional online finetuning, which gets their performance up to 100% in about 5-10 minutes! https://t.co/0iezSp2Ugu 2021-10-13 03:56:09 The method: use 11 meta-training tasks to meta-train a PEARL-style encoder using AWAC. Use this to then adapt to new connectors: some work well, some don't. https://t.co/L1vJpNHeYY 2021-10-13 03:55:47 Offline RL + meta-learning enables industrial robots to learn new insertion tasks with near-perfect success rates with AWAC + PEARL + finetuning! w/ @tonyzzhao, @jianlanluo, @DeepMind, Intrinsic https://t.co/XIjDBtfhwx Short summary below: https://t.co/TjYuftKpwj 2021-10-12 17:45:46 For more, check out: Paper: https://t.co/66ztHfpms3 Website: https://t.co/TkJw1eZ4dJ Video: https://t.co/bly0VuzQAa w/ Laura Smith, Chase Kew, @xbpeng4, Sehoon Ha, Jie Tan. An awesome collaboration between @berkeley_ai and @GoogleAI! 2021-10-12 17:43:46 Why is this important? We can never pretrain robots so that they never fail. But with RL, the robot can just keep on learning, and adjust to recover from failure even in unexpected conditions. This is going to be crucial for real-world robustness. https://t.co/ufdtkAT5CL 2021-10-12 17:42:25 The main idea in our work is to use RL to finetune walking policies to the particular environment where they are deployed. We learn policies for going forward and backward with motion imitation and a highly agile reset controller, which enable automated real-world finetuning. https://t.co/wbWubTtEql 2021-10-12 17:42:05 Can walking robots be made robust enough to finetune with RL directly in the real world? In "Legged Robots that Keep on Learning", we explore this question, using learned highly agile reset controllers and sample-efficient RL: https://t.co/TkJw1eZ4dJ https://t.co/66ztHfpms3 -&gt 2021-10-12 04:00:53 RT @chelseabfinn: Robot learning is bottlenecked on good, reusable datasets We introduce: * a new dataset with demos of 71 tasks over 10 e… 2021-10-11 20:51:34 @maxhkw We tried ImageNet in prior projects. It can help a little, but not a lot. My sense is that visual features are only a modest part of the issue. We need to analyze what is transferred from bridge data more, but intuitively it seems like reaching toward objects is a part of it... 2021-10-11 17:34:10 Here is another bonus animation of more cool robot behaviors https://t.co/SSOPhAjw4a 2021-10-11 17:33:05 The dataset we collected consists of 7,200 demonstrations, over 71 tasks, and 10 different domains. This project was led by @febert8888 with many great collaborators! Paper: https://t.co/25guMobOVS Website w/ videos, data, code: https://t.co/fGG35ikRus 2021-10-11 17:32:55 Experimentally, this works great: blue is with bridge data, red is without. All tasks are unseen in the bridge data, and the environment is *also* unseen. Just including bridge data (containing other tasks and other environments) boosts generalization by 2x! https://t.co/athpGqQLan 2021-10-11 17:32:37 The point of "Bridge Data" is to serve as a large task-agnostic dataset that can be included (as pretraining or joint training) to boost generalization for a new task in a new domain. So everyone uses the same bridge data, and small in-domain datasets in their lab, for their task https://t.co/qCHYGoHqOl 2021-10-11 17:32:18 In robotic learning, we start from scratch for every experiment, with custom per-task data. Generalization is poor, because the dataset is narrow. In supervised learning domains, there are large reusable datasets (e.g., ImageNet, MS-COCO). What would be "ImageNet" in robotics? https://t.co/qyC9vfac5C 2021-10-11 17:31:54 When training your robot, don't start from scratch! In "Bridge Data", we study how big datasets robotic datasets (71 tasks, 7.2k demos!) bridge the generalization gap. For new tasks, use bridge data along with task data to boost generalization: https://t.co/fGG35ikRus A thread: https://t.co/ed8m1hS95s 2021-10-08 04:09:50 This leads to better accuracy *and* better calibration on unlabeled test points. Inspired by some classics on entropy minimization: Y. Grandvalet, Y. Bengio. Semi-supervised Learning by Entropy Minimization M. Seeger. Input-dependent Regularization of Conditional Density Models 2021-10-08 04:09:06 To avoid needing to store all training data, we can learn posterior q(theta) using any BNN approach, and then incorporate this as a regularizer when minimizing entropy at test time on unlabeled data. https://t.co/cHlCQLDQRu 2021-10-08 04:08:52 This naturally leads to an entropy minimization procedure at test time: get some unlabeled points, and then update the parameter posterior to get lower entropy on test points, but don't stray too far from parameter distribution on training set! https://t.co/TdgRlHmEZQ 2021-10-08 04:07:50 We need a better graphical model. What if we assume new datapoints are not arbitrary: if we are asked to classify a new OOD, it likely belongs to *one of* the classes, we just don't know which one! Now there is a relationship between theta and phi for each distribution! https://t.co/2BPaQCR0yh 2021-10-08 04:07:35 First: what information do we gain from observing unlabeled datapoints from a new distribution? We can draw a graphical model for this: x is input, y is label, theta is classifier params, phi parameterizes x distro. Unfortunately, if y is unobserved, x tells nothing about theta! https://t.co/E2w5hxaYyV 2021-10-08 04:06:49 Is there a principled way to adapt a model to distributional shift without labels? In our new paper, "Training on Test Data", we propose a Bayesian adaptation strategy based on BNNs and entropy minimization. w/ Aurick Zhou: https://t.co/juFpiE6mb3 A thread: 2021-10-07 20:52:02 RT @ben_eysenbach: Unsupervised skill learning methods based on mutual information (e.g., DIAYN) learn a wide range of skills, which are of… 2021-10-07 20:35:03 This could be viewed as a kind of good initialization for downstream policy finetuning, formally it optimizes this objective. https://t.co/sB93xNC0Oz 2021-10-07 20:34:46 Turns out they don't. It means no matter how many skills you try to learn, you will not capture all possible optimal policies! But there is some good news -- what you do learn is a group of skills whose average minimizes distance to the "hardest" (furthest) policy. https://t.co/3vYP2cPAuH 2021-10-07 20:33:39 RL objective is linear in state marginal space, which means optimal policies must be vertices of the orange polytope. Do unsupervised RL methods get all vertices (all optimal policies)? 2021-10-07 20:33:28 We can use a geometric interpretation of RL policies: the triangle represents a polytope over state probabilities (3 states here). Orange region is states you can reach. https://t.co/lf9QP2uZ0s 2021-10-07 20:33:14 For a while, @ben_eysenbach, @rsalakhu, and I have been trying to understand unsupervised RL, such as DIAYN https://t.co/ZH51PsSkqj Which skills do they learn? Can we develop a theory of unsupervised RL? Our paper https://t.co/zst5S8K3CC Thread on what we found (and that gif): https://t.co/dz2u1tnUC1 2021-10-07 18:29:46 Generative models can be used for compression. But if we combine them with an "RL-like" GAN, we can make it so that, when a user sees the compressed image, they do the same thing as if they saw the original. This is the idea behind pragmatic compression: https://t.co/N2wFFyV0uw 2021-10-07 04:23:13 The model is still tilted toward optimistic outcomes, but this actually aids in exploration, leading to faster learning. See more analysis in the paper: https://t.co/GII1kNcNiN Code for a toy example here: https://t.co/bflWYjhrG4 w/ @ben_eysenbach, @AlexKhazatsky, @rsalakhu 2021-10-07 04:20:51 The model optimizes the *same* objective: maximize reward (value function) and also fool the discriminator, thus producing trajectories that are realistic and have high reward. The discriminator punishes the model for cheating (going to high reward regions unrealistically). https://t.co/vr0FEfP6W4 2021-10-07 04:19:48 The policy objective is to maximize reward and also minimize difference between learned and true dynamics *on the states the policy visits*. This can be quantified w/ a discriminator -- like the generator in a GAN, the policy seeks out "realistic" trajectories (with high reward). https://t.co/AHWp01IiiK 2021-10-07 04:18:03 MnM instead optimizes a *lower bound* on the policy reward with respect to both the model and the policy, using a GAN-like model formulation. While this might seem a bit counterintuitive, it leads to an algorithm that provably improves a bound on the reward. https://t.co/AfpVq2ZfVe 2021-10-07 04:16:36 Regular model-based RL methods train the model with MLE, but better models don't necessarily translate to better policies, hence understanding convergence is difficult. Can we devise a MBRL method where the model optimizes an objective that makes it better *for the policy*? https://t.co/0EfpOa0bmD 2021-10-07 04:14:53 Can we devise model-based RL methods that use the *same* objective for the model and the policy? The idea in Mismatched no More (MnM) is to devise a single objective that can be optimized by both model and policy that provably lower bounds reward. A thread: https://t.co/zxFy7nfpii 2021-09-27 17:08:07 How do we get robots to learn what they *can* do from prior data, so that they can learn what they *should* do via RL? Visual affordance learning (VAL) does this via goal-conditioned RL and generative models. New blog post by @ashvinair &amp https://t.co/S1H0UnKy1X https://t.co/OkgBoq73FD 2021-09-23 17:09:48 This paper will be presented at CoRL 2021, with @aviral_kumar2, Anikait Singh, Stephen Tian, @chelseabfinn https://t.co/Yg9YDMrIA0 https://t.co/WqeCOQeGQ5 2021-09-23 17:08:03 A few things that I think are interesting: (1) we can do capacity/arch/hyperparam tuning *without* full OPE (which is very hard) 2021-09-23 17:06:29 We evaluate these guidelines on a simulated robotic task, and two different real-world robots, and find that it works well across the board, using the same alpha=1.0 CQL parameter and fully offline selection of model capacity, regularization, etc. https://t.co/zeCp1Gdhzt 2021-09-23 17:05:19 Of course, the true return is unknown during offline training, but we can still use our understanding of the trends of estimated Q-values to provide guidelines for how to adjust model capacity. These guidelines are not guaranteed to work, but seem to work well in practice. https://t.co/TTaI5x2iwP 2021-09-23 17:02:17 In supervised learning, we have training and validation sets, and this works great for tuning. There is no equivalent in RL, making tuning hard. However, with CQL, when there is too much or too little model capacity, we do get very characteristic estimated vs true Q-value curves https://t.co/s5YE6O7Ger 2021-09-23 17:00:42 Offline RL lets us run RL without active online interaction, but tuning hyperparameters, model capacity etc. still requires rollouts, or validation tasks. In our new paper, we propose guidelines for *fully offline* tuning for algorithms like CQL https://t.co/Yg9YDMrIA0 A thread: https://t.co/ibqI3X7FDG 2021-09-17 20:57:18 RT @hausman_k: When working on MT-Opt, we saw that sharing data in multi-task RL can make or break the algorithm. Conservative data shari… 2021-09-17 16:54:21 CDS was developed by @TianheYu &amp 2021-09-17 16:52:14 In practice, CDS is quite easy to implement, using the Q-function from CQL to estimate what to share, and leads to significant improvements across a range of multi-task offline RL problems. https://t.co/rRREPShlrE 2021-09-17 16:51:05 At the same time, sharing as much as possible is good, because it gives you more samples. This tradeoff between sharing more to get samples and sharing less to get distributional shift is captured by our main theoretical result, which formalizes this tradeoff. https://t.co/nMY5oqJmwr 2021-09-17 16:50:06 Now it's clear we should choose what to share so that pi_Beff is close to what we think a good policy pi should be, to minimize the second term (the divergence). This intuitively means: share from tasks that are similar to how you *think* the new current task should be done! 2021-09-17 16:49:22 The induced pi_Beff is the behavior policy you constrain too, and you can't go too far from pi_Beff (else the Q-function diverges). But choosing what to share *changes* pi_Beff. Imagine that offline RL optimizes an objective of this form (turns out CQL does this more or less): https://t.co/cFOG90HNqM 2021-09-17 16:47:35 More technically, imagine "pi_Beff" is the "effective" policy induced by data sharing for task B, which is a mixture of the behavioral policy for task A and task B. For offline RL methods that are conservative (necessary to avoid divergence!), pi_Beff is very important. 2021-09-17 16:45:08 We first hypothesize that the reason sharing data across tasks (i.e., using data for task A to help task B) makes offline RL hard is that it increases distributional shift: if task A is irrelevant, we're adding data from a very different distribution. 2021-09-17 16:44:13 Multi-task RL is hard. Multi-task offline RL is also hard. Weirdly, sharing data for all tasks (and relabeling) can actually make it harder. In conservative data sharing (CDS), we use conservative Q-learning principles to address this. arXiv: https://t.co/4hhl639bgh A thread: https://t.co/wo3ZLrj52A 2021-09-17 16:40:37 RT @chelseabfinn: In offline RL, adding data from other tasks can boost generalization but can surprisingly *hurt* performance We analyze… 2021-09-17 15:30:40 RT @gradientpub: Check out our interview with @berkeley_ai professor @svlevine! We talk about his start in research, DiscoRL, sequence mod… 2021-09-16 17:29:26 A fun discussion with Andrey on robotic learning, RL, and where the arm farm came from. https://t.co/Q2oZsMW2az 2021-09-14 05:07:10 RT @_prieuredesion: RECON accepted as an Oral Talk at @corl_conf 2021!! What are the odds we actually get a live audience in London? 2021-09-12 17:23:24 @Shixo I suspect that is right, but it’s not intended to improve exploration but rather final robustness and generalization (via compression). I suspect that, like SMiRL, it can be combined with novelty seeking exploration in the obvious way (it might seem contradictory but it’s not) 2021-09-10 16:47:00 @TongzhouWang It's not a generative model of states 2021-09-10 16:45:56 @pcastr The policies trade off information and reward, so fully open-loop policies are definitely not going to be optimal in general, but the goal is just to get them to minimize unnecessary input (that's of course an old idea, the difference is in the approach) 2021-09-10 03:14:51 @Julius_Frost The policy still gets a standard task reward -- all this stuff is just added on top of regular RL. 2021-09-10 02:13:50 This work is by Ben Eysenbach, w/ @rsalakhu Paper: https://t.co/xIgZPiKiKr Website with short video: https://t.co/NmR8Aa8EaU Code: https://t.co/MnNpeSXiTB 2021-09-10 02:13:39 Empirically, this leads to much better policies if we want low bitrates, more robust behavior to sensor dropout and perturbations, and more compressed representations that help generalization! https://t.co/A9LKoXz9dz 2021-09-10 02:13:27 Theoretically, one nice consequence of this is that, if we use a big enough multiplier on the accuracy term, we train a policy that tries to be as good as possible in open loop (see theorem). https://t.co/VafSpzDZcY 2021-09-10 02:13:10 This means the policy, encoder, and model "collude" to find a z that is minimal (easiest to predict) but still enough to take good actions. This is exactly the same as compressing the state to the smallest number of bits needed to get high reward! 2021-09-10 02:12:57 This means we train an encoder phi(z | s), a policy pi(a | z), and a model m(z[t+1]|z[t],a[t]). The only thing that makes z non-trivial is that it must decode to actions for the policy, so it only contains the info necessary to take actions. https://t.co/vFN00M9keT 2021-09-10 02:12:42 The idea is simple: learn a model m(z[t+1]|z[t],a[t]) that predicts a future latent state, and reward the policy for taking transitions for which the predicted latent state matches what we get by encoding the *actual* next latent state. https://t.co/vNJnxHkZuV 2021-09-10 02:12:26 What happens if, when we learn a model in RL, we reward the policy for doing things where the model is correct? Turns out that this is (surprisingly) the same as minimizing the number of bits of state that the policy needs access to. Paper: https://t.co/KIDQOM3iaW Thread below: https://t.co/9HBbJ9GrnM 2021-08-17 16:05:01 RT @SeungmoonS: "Deep reinforcement learning for modeling human locomotion control in neuromechanical simulation" is now published at JNER:… 2021-08-12 07:57:06 RT @xbpeng4: We will be presenting our work on adversarial motion priors tomorrow 4pm PST at #SIGGRAPH2021. Drop by the session if you want… 2021-08-09 19:29:21 @pmddomingos But can one get a Nobel Prize for 280 characters? 2021-08-08 22:51:49 @KuanFang Congratulations Kuan! Looking forward to working with you soon 2021-08-06 17:17:43 RT @ColinearDevin: So excited to release our work on real world autonomous learning! https://t.co/CpgZrp5Zhk 2021-08-06 03:26:05 RT @GlenBerseth: New work on fully autonomous real-world robotic learning with mobile manipulators. One of the exciting parts of this work… 2021-08-05 03:30:18 RT @abhishekunique7: New work on learning how to grasp and navigate with mobile robots using RL. What I find very exciting is the ability o… 2021-08-05 03:24:59 Arxiv: https://t.co/RiPiTFkLL0 Website with video: https://t.co/DNSnELotdX w/ Charles Sun, @jendk3r, @ColinearDevin, @abhishekunique7, @GlenBerseth 2021-08-04 20:30:22 Arxiv: https://t.co/RiPiTFkLL0 Website with video: https://t.co/DNSnELotdX w/ Charles Sun, Jedrzej Orbik, Coline Devin, Brian Yang, Abhishek Gupta, Glen Berseth 2021-08-04 20:30:12 The more ReLMM trains, the better it gets. It can train for 60+ hours autonomously (with just battery changes), and performance just keeps getting better and better, as shown below. https://t.co/RacLGhjPFB 2021-08-04 20:29:58 The cool thing is that we can run it autonomously in many different rooms, with the only human effort being to change batteries. https://t.co/0j99Jqtxdo 2021-08-04 20:29:46 ReLMM makes it possible for robots to learn continually and without resets. In our evaluation, the task is to clean up a room, picking up every object and depositing it in a basket. ReLMM uses a grasping and navigation policy, with uncertainty from grasping providing exploration. https://t.co/ZuvB4Ly9Ro 2021-08-04 20:29:30 Robots with RL can learn autonomously in the real world, and this allows them to keep getting better and better. We explore this in our recent paper, introducing ReLMM, a system for fully autonomous real-world robotic with mobile manipulators: https://t.co/DNSnELotdX A thread: https://t.co/PCGdXStDUp 2021-08-04 03:23:17 RT @HaoSuLabUCSD: Excited to announce the SAPIEN Manipulation Skill challenge, a learning-from-demonstrations challenge from 3D visual inpu… 2021-07-30 04:23:54 @Brugzstopher One way to think of it is that it's just like regular meta-RL during this step, only the rewards for the tasks are not real rewards, but produced by the model. This still works, because those tasks just need to be varied and representative of downstream tasks for meta-training. 2021-07-29 20:25:14 When RL learns in the real world, it can't push a "reset button" to try again, but must do lifelong reset-free learning. Maybe harder, but also easier, because it gives an opportunity to automatically build a curriculum! We leverage this in our new paper: https://t.co/TdsAmp8XPw https://t.co/P6yae6adwE 2021-07-29 18:57:11 @pfau I prefer "intellectrician". Feels more apt sometimes... 2021-07-29 04:17:30 RT @hausman_k: Resets are one of the most limiting, often under-emphasized requirements of current robotic RL methods. They are hard to aut… 2021-07-28 17:19:47 And of course I should add that theoretical RL research *has* been examining environment properties through assumptions (e.g., linear MDP), but the empirical RL research community has given less attention to this, preferring standardized benchmarks. 2021-07-28 17:17:30 This is almost obvious: we know in supervised learning data is just as critical as algorithm we know that what animals learn is extremely dependent on their environment Likely impossible to get "general-purpose" intelligent agents unless they are situated in a proper environment 2021-07-28 17:15:44 I'm excited to co-organize this WS on a unique topic. The "ecological" perspective on RL is one that examines the role of environment, task, data, etc. rather than just the algorithm, and I think this is really important. We had an initial paper on this: https://t.co/35ibxthqyY https://t.co/SII4rcDyFH 2021-07-27 17:16:27 We evaluated it on some fun robotic manipulation tasks, where we test generalization to entirely new tasks with similar objects. These are based on tasks from the VAL paper: https://t.co/pf7qqtM8i1 w/ @Vitchyr @ashvinair Laura Smith, Catherine Huang https://t.co/WqjZfbMv8s 2021-07-27 17:13:59 This works very well. SMAC outperforms prior offline meta-RL methods by a huge margin, and more or less matches the performance of standard online meta-RL (which are not limited in how many episodes w/ rewards they get), even on relatively difficult suboptimal datasets. https://t.co/UHO8R3nvdH 2021-07-27 17:12:26 This is much easier: it requires online data collection, but doesn't require real tasks, hence the online phase is self-supervised. The setup: get *offline* data labeled with rewards, then some self-supervised online data, and then we're done. https://t.co/iRLOBaNHnS 2021-07-27 17:11:46 Fortunately, fixing this problem just requires "reconciling" the adaptation (i.e., learned alg) with the exploration (i.e., the trajs meta-trained policy produces) -- it does *not* require real tasks! So SMAC just invents fake tasks, rolls them out, and uses this to "reconcile" 2021-07-27 17:10:31 The reason is subtle: meta-RL learns to adapt *and* explore, but when it explores for a new task, those exploration trajs look different from the offline data. So any meta-learned alg that is used to adapting to the offline trajs will get messed up when exposed to the online ones https://t.co/Omh5pu9lrK 2021-07-27 17:08:51 Offline RL suffers from distribution shift &lt 2021-07-27 17:07:37 Offline RL+meta-learning is a great combo: take data from prior tasks, use it to meta-train a good RL method, then quickly adapt to new tasks. But it's really hard. With SMAC, we use online self-supervised finetuning to make offline RL work: https://t.co/mtgZTivlVs A thread: https://t.co/mx7caQSrlx 2021-07-25 22:30:22 Website: https://t.co/gMEncxzxzA Paper: https://t.co/URRPDUl6tL Talk video: https://t.co/EbHk7GpHMQ (the paper is currently held up for moderation on arxiv, we'll post the arxiv link as soon as that goes through... its been stuck for several weeks) 2021-07-25 22:29:13 While this sort of "pragmatic compression" is in its infancy, I think it could be tremendously valuable in the future: it's easy to do the A/B testing necessary to train the discriminator, and PICO does not need knowledge of the downstream task, just which action users took. 2021-07-25 22:28:17 This is obviously a proof of concept, and we crank up the compression factor way too high. See below for example: if the user's downstream goal is to check if a person is wearing glasses, eventually PICO scrambles *everything* else, including their gender, but keep the glasses! https://t.co/k3i6TCDHRV 2021-07-25 22:27:03 The compression itself is done with a generative latent variable model (we use styleGAN, but VAEs would work great too, as well as flows). PICO basically decides to throw out those bits that it determines (via its GAN loss) won't change the user's downstream decision. 2021-07-25 22:25:50 The idea is pretty simple: we use a GAN-style loss to classify whether the user would have taken the same downstream action upon seeing the compressed image or not. Action could mean button press when playing a video game, or a click/decision for a website. https://t.co/ldVqu4yf8o 2021-07-25 22:24:55 An "RL" take on compression: "super-lossy" compression that changes the image, but preserves its downstream effect (i.e., the user should take the same action seeing the "compressed" image as when they saw original) https://t.co/gMEncxzxzA w @sidgreddy &amp &gt 2021-07-25 22:21:42 RT @sidgreddy: PICO is a “pragmatic” image compression system that preserves *downstream user behavior* instead of appearance. Idea is to c… 2021-07-24 20:55:59 An extended version of my talk on self-supervised offline RL from today's ws. Discusses how offline RL could provide a general self-supervised pretraining objective. Covers CQL https://t.co/TYL7RTNbO7 AM https://t.co/aLDnuH9an7 VAL https://t.co/Gf5LBr1ZV2 https://t.co/UABGbZOmDg 2021-07-24 16:47:35 I’ll be giving a talk about how reinforcement learning might provide an effective self supervised learning approach at the Self-Supervised Learning Workshop at #ICML2021 at 1:20 pm PT today (Sat): https://t.co/VYuxTXLYj3 Hope to see you there! 2021-07-24 01:54:58 @abhishekunique7 @uwcse @berkeley_ai @pabbeel Congratulations Abhishek! It was great to work together these past years, looking forward to many more collaborations in the future! 2021-07-23 21:32:44 We'll also present conservative data sharing (CDS), a new algorithm that provides a principled way to select which data to share between tasks in offline RL to minimize distributional shift! CDS will also be presented 6 pm PT Sat in the RL theory WS: https://t.co/4mrKiQVQ3r https://t.co/HePVTEVFmR 2021-07-23 21:31:33 We'll present CoMPS, an algorithm for online continual meta-learning, where an agent meta-learns tasks one by one, with each task accelerating future tasks. By @GlenBerseth, WilliamZhang365, @chelseabfinn https://t.co/Fo9an7NJW7 2021-07-23 21:30:11 We'll show how RL can control robots that learn to clean up a room, entirely in the real world. By Charles Sun, @ColinearDevin, @abhishekunique7, @jendk3r, @GlenBerseth. https://t.co/sKjUlGnuhd 2021-07-23 21:29:11 RAIL will be presenting a number of exciting late breaking poster results at the RL4RealLife WS #ICML2021 (8 pm PT today!): https://t.co/dgpL57T6ll Algorithms for real-world RL w/ mobile manipulators, lifelong meta-learning methods, principled multi-task data sharing. A thread: https://t.co/v5r4s5oCHu 2021-07-23 21:24:53 @simsamsom @michaeljanner @qiyang_li You don't have to outrun the mountain lion, you just have to outrun the humanoid that is not using the trajectory transformer 2021-07-23 17:08:17 We'll be presenting how robotic hands can learn complex tasks entirely in the real world at #ICML2021 RL 4 Real Life WS: https://t.co/dgpL57T6ll Fri 7/23 (today), 4pm PT spotlights, 8pm PT poster Full talk here: https://t.co/82u7QxaK3v Paper: https://t.co/iUBV1Cid7K https://t.co/OFEB8f5GdE 2021-07-23 05:05:43 Want to learn about trajectory transformers? @michaeljanner &amp Unsupervised RL 9:30am PT poster https://t.co/70bUUxvF6i A6 RL for Real Life WS 8pm PT poster https://t.co/2OAZWg9XP2 A1 https://t.co/jiMFsZ5sYK 2021-07-23 05:02:19 RT @agarwl_: A thread by @svlevine about this work: https://t.co/MylcoFbTIR Joint work with @aviral_kumar2, @AaronCourville, @tengyuma and… 2021-07-23 03:43:09 @ramav_matsci My guess is probably not, because limited expressivity causes some major issues (see eg https://t.co/utlkdqTcLn). In some sense our result shows that this issue causes big networks to lose expressive power when used in TD updates. 2021-07-23 03:21:42 The workshop paper is here: https://t.co/VpF7tvbXcJ We'll be releasing a full-length paper on arxiv after a few finishing touches and revisions after the workshop. 2021-07-23 03:19:26 This helps across the board for offline RL, for different offline RL algorithms, improves the stability of these methods (making it easier to pick the # of gradient steps), and mitigates the implicit under-parameterization effects that we analyzed in our prior work. https://t.co/xMvpLWmO7F 2021-07-23 03:18:19 This is actually very bad, because time-correlated features alias good actions to bad actions, and lead to horrible solutions. Fortunately, once we recognize this, we can "undo" the bad part of implicit regularization with good *explicit* regularization, which we call DR3. 2021-07-23 03:17:38 We provide both empirical and theoretical analysis that suggests the possibility that the face implicit regularization that makes supervised learning work actually *harms* RL with TD backups, due to bootstrap updates producing time-correlated features. 2021-07-23 03:16:54 One might surmise the same will be true in deep RL: deep RL will work well because it enjoys the same implicit regularization as supervised learning, whatever that might be -- seems reasonable, right? But deep RL seems really finnicky, unstable, and often diverges... 2021-07-23 03:16:07 Deep networks are overparameterized, meaning there are many parameter vectors that fit the training set. So why does it not overfit? While there are many possibilities, they all revolve around some kind of "implicit regularization" that leads to solutions that generalize well. 2021-07-23 03:14:50 You can watch the talk in advance here: https://t.co/NItsnNcI9M And then come discuss the work with Aviral at the poster sessions! This work is not released yet, but it will be out shortly. We're quite excited about this result, and I'll try to explain why. 2021-07-23 03:13:44 In RL, "implicit regularization" that helps deep learning find good solutions can actually lead to huge instability. See @aviral_kumar2 talk on DR3: 7/23 4pm PT RL for real: https://t.co/4mrKiQVQ3r 7/24 5:45pm PT Overparameterization WS talk https://t.co/1lN1IIwbVs #ICML2021 &gt 2021-07-22 00:02:32 @CsabaSzepesvari @lawrennd @tdietterich @rodneyabrooks Worth mentioning that the specific strategy matters: they focused on a *very* simulation heavy approach, which doesn't benefit from real-world data and does not have any notion of "flywheel" -- unsurprising that this would be incompatible with the rest of what they do. 2021-07-22 00:00:29 @CsabaSzepesvari @lawrennd @tdietterich @rodneyabrooks I'm also uneasy about this take (though perhaps there is a phrasing ambiguity). I think there has been lots of progress on motor ctrl and robotics since 2015, including in RL. OpenAI picked a strategy, which didn't pan out for them, that doesn't mean no progress is happening. 2021-07-21 18:45:55 (small typo, "algorithmic causality" in the first post should have been "algorithmic independence," which is a property used to *show* causality) 2021-07-21 18:44:53 When actions compete over who gets to go, the right type of auction will cause them to bid their real Q-value! And this preserves modularity, leading to better transfer. You can also read about our previous work on CVS here: https://t.co/7tTgAsChV0 2021-07-21 18:44:13 It turns out that societal decision making (CVS) from our previous paper has this property. In CVS, each action is a different "agent," and the actions "compete" to decide who goes: each action bids some utility, the winner goes and collects reward (positive or negative). https://t.co/Gjlfr16tEy 2021-07-21 18:43:11 Q-learning is better than policy gradient, and tabular Q-learning *is* modular (because values at different states are independent), but with function approximation, we need some kind of independence between the mechanisms that choose different actions. 2021-07-21 18:42:07 The basic idea is very simple: if decisions are made by independent mechanisms, then transfer is easier if only a few of those mechanisms need to change. What is interesting is which existing algorithms have this kind of independent mechanisms property. 2021-07-21 18:41:23 Want to learn how algorithmic causality leads modular RL methods to transfer better? Check out @mmmbchang's long talk at #ICML2021 tomorrow 7/22 at 5:00 am PT, or watch it right here: https://t.co/fz69J6KMWg https://t.co/aNM5T37e1E https://t.co/lmyhZTgBs3 Short thread: https://t.co/Aqdh1hjktU https://t.co/N4dO1baLOi 2021-07-21 04:11:57 Meta-learning can do some pretty useful things when applied properly :) Congrats to @mike_h_wu, @chrispiech &amp 2021-07-20 19:31:16 This work is philosophically related to some of our prior work on surprise minimization (i.e., maintaining homeostasis) as a way to get intrinsic motivation: https://t.co/H2aL3n7it0 2021-07-20 19:30:24 Adversarial surprise: an "intrinsic motivation arms race" where Agent A tries to surprise Agent B, and Agent B wants to avoid getting surprised. Both a way to get emergent complexity, and provide some appealing coverage guarantees for exploration. Check out the paper below: https://t.co/USsuRlmvGH 2021-07-20 16:41:16 Model-based RL with images and actual trajectory optimization methods, that are more sophisticated than random whatever (sorry, CEM craze is partly my fault... but we really can do better!). Cool thing here is that collocation methods can be applied directly to latent states. https://t.co/aEuj03ivAn 2021-07-20 16:37:54 This work on learning by observing other agents presented at #ICML2021, 5:25 pm PT today (7/20) https://t.co/cMf5ZASnuC Can RL exploration &amp 2021-07-20 01:05:17 Spotlight is Tue Jul 20 05:35pm PT here: https://t.co/rm4imSshjt Arxiv: https://t.co/aLDnuH9an7 Video: https://t.co/7z1x9ppmoH My favorite thing about this paper is how it gives a glimpse into how goal-conditioned RL can provide "universal offline self-supervised pretraining" 2021-07-20 01:03:35 Goal-conditioned offline RL, with real data, can enable robots to perform diverse tasks and provide general self-supervised objectives for RL that help with dwnstream tasks. See @YevgenChebotar's talk on this (Actionable Models) Tue 7/20 5:35pm #ICML2021: https://t.co/GMsIUMZlPM https://t.co/wyp8YbZ7fW 2021-07-19 23:49:19 RT @shiorisagawa: Just in time for ICML, we’re announcing WILDS v1.2! We've updated our paper and added two new datasets with real-world di… 2021-07-19 19:00:06 This makes calibration much better on OOD inputs, and scales well even to very large conv nets! You can find out more in Aurick's blog post: https://t.co/ES4l6oP6vJ The ICML spotlight (6:25 am PT Tue 7/20) is here: https://t.co/zv2uUAs9Ph 2021-07-19 18:58:36 This is not new: NML is a well known concept. In our paper, we show how to make it tractable for deep nets. Instead of retraining the model for each test point, we pretrain a Bayesian posterior over parameters, and finetune on *only* the new test point with this regularizer https://t.co/wcCygKwQfX 2021-07-19 18:57:16 If we visualize the normalized max likelihood (ratio of probabilities of models assigning different labels), we see that points far from the data are highly uncertain (unlike the original model, which is very certain but very incorrect). Here shown for different reg levels lambda https://t.co/P11NAbmXBF 2021-07-19 18:55:43 It's easiest to get an intuition for this in a 2D logistic regression example. When we label the pink point (lower right) with either class, the decision boundary changes a lot -- the model class can explain *either* label with high confidence, so the label is unknown. https://t.co/wyGFHT9QFA 2021-07-19 18:54:43 Let's say we have an OOD input (below), and we don't know what class it is. Standard neural nets might give an overconfident response. The idea in (C)NML is to label it with each label, finetune the model, and then use the ratio of resulting likelihoods. https://t.co/K0bt3qdDWW 2021-07-19 18:53:28 Training on the test set is usually bad, but with amortized conditional normalized max likelihood (ACNML), it can improve calibration on OOD inputs (without "cheating") -- Aurick Zhou will present ACNML at #ICML2021 tmrw Tue 7/20 6:25 am PT https://t.co/piwcU5wMs4 Short summary: 2021-07-19 17:36:09 RT @brandontrabucco: Exciting collaboration with @aviral_kumar2, @YoungGeng, and @svlevine in BAIR! Want to design drug molecules, protein… 2021-07-18 19:25:48 @EliSennesh Kevin Li and @abhishekunique7 made them, my contribution was to annoy the heck out of them by telling them to make the figures better (with minimal info how to do so) There are definitely better tools, but #1 advice (as with any art/presentation thing) is to iterate a lot 2021-07-18 19:21:33 We'll present MURAL at #ICML2021. Tue Jul 20 07:35 PM -- 07:40 PM (PDT): https://t.co/VOj1A2PtVz w/ Kevin Li, @abhishekunique, Ashwin Reddy, Vitchyr Pong, Aurick Zhou, Justin Yu 2021-07-18 19:19:50 This ends up working very well across a wide range of manipulation, dexterous hand, and navigation tasks. To learn more about NML in deep learning, you can also check out Aurick Zhou's excellent blog post on this topic here: https://t.co/ES4l6oP6vJ https://t.co/gum3xgd5R7 2021-07-18 19:18:37 Doing this tractably is hard, because we need two new classifiers for *every* state the agent visits, so to make this efficient, we use meta-learning (MAML) to meta-train one classifier to adapt to every label for every state very quickly, which we call meta-NML. https://t.co/ehBLef43QB 2021-07-18 19:17:50 This provides for exploration, since novel states will have higher uncertainty (hence reward closer to 50/50), while still shaping the reward to be larger closer to the example success states. This turns out to be a great way to do "directed" exploration. 2021-07-18 19:16:35 This is where the key idea in MURAL comes in: use normalized max likelihood (NML) to train a classifier that is aware of uncertainty. Label each state as either positive (success) or negative (failure), and use the ratio of likelihoods from these classifiers as reward! https://t.co/CGKTddhpEJ 2021-07-18 19:14:59 The website has a summary: https://t.co/wE2ybwVYBn If the agent gets some examples of high reward states, we can train a classifier to automatically provide shaped rewards (this is similar to methods like VICE). A standard classifier is not necessarily well shaped. https://t.co/kX5ULbXiUy 2021-07-18 19:13:25 Can we devise a more tractable RL problem if we give the agent examples of successful outcomes (states, not demos)? In MURAL, we show that uncertainty-aware classifiers trained with (meta) NML make RL much easier. At #ICML2021 https://t.co/dsD7nLKggG A (short) thread: https://t.co/KzqXZpvt7D 2021-07-18 02:04:54 NEMO was published in ICLR 2021, and you can find the ICLR 2021 presentation here: https://t.co/F4kMV2fZai If you want to learn more about NML and how it can be used in supervised learning, also check out Aurick Zhou's blog post: https://t.co/ES4l6oP6vJ 2021-07-18 02:04:41 This corresponds to the normalized maximum likelihood (NML) distribution, which has appealing regret guarantees, which we extend in NEMO to provide regret guarantees on offline MBO as well! This is more complex than COMs, but potentially more powerful as we get a full posterior. 2021-07-18 02:04:20 The basic idea, unlike COMs (which learn pessimistic models) is to get a posterior over values for a new design x. Justin's method (NEMO) trains a separate model *for every possible value y* for the design x (discretized), and uses the likelihood from these to get the posterior. https://t.co/qvY7ps0cO5 2021-07-18 01:57:57 Since many people were interested in our recent offline MBO work, I'll also write about a recent paper on MBO by Justin Fu, which trains forward models for each possible objective value and uses them to compute a posterior via NML: https://t.co/I9I8SRNLXs A thread: 2021-07-18 01:54:36 @bodonoghue85 We didn't try generating images with this approach, that could be interesting to try -- a number of recent approaches to image generation with EBMs employ similar style training objectives. We did study image generation as MBO in our prior work MINs: https://t.co/z3la4MGvUw 2021-07-16 21:18:47 @alfcnz If you have any ideas, let us know :) 2021-07-16 21:18:23 @alfcnz Yes, absolutely. The same energy model connection also holds for conservative Q-learning (CQL): https://t.co/TYL7RTNbO7 The form of the gradient is the EBM gradient. We're not sure what to do with this connection, and our theory doesn't really use it, but it is intriguing. 2021-07-16 21:01:18 We'll be presenting COMs at #ICML2021: Thu Jul 22, 6:40am-6:45am PT https://t.co/Bgff1gLseh Poster: Thu Jul 22, 6am - 8 am PT Or just watch the talk in advance here: https://t.co/JxIMy2hlaB 2021-07-16 20:59:54 Great collaboration w/ Brandon Trabucco, @aviral_kumar2, @YoungGeng. For more offline MBO papers, also check out these from our lab: https://t.co/z3la4MGvUw https://t.co/I9I8SRNLXs And these great papers from @jlistgarten's group: https://t.co/5oVYjknwD1 https://t.co/ZsEK3TC5jR 2021-07-16 20:56:44 Just like CQL, we can guarantee that for a suitable regularizer weight, COMs gets us a lower bound on the true value of a design, providing a principled solution. It's also simple to use and implement. Website: https://t.co/v14SkeEIHD ICML talk: https://t.co/JxIMy2hlaB 2021-07-16 20:55:41 Our method, conservative objective models (COMs), is very similar to adversarial training: just find "adversarial" designs that the model thinks are good, and make them look bad, while making the designs in the data look good to counterbalance this. https://t.co/jjiU3hVtYG 2021-07-16 20:54:48 Just like in offline RL, the problem is distributional shift: the new designs we want to evaluate might be out of distribution for our training data. So we can borrow some ideas from conservative Q-learning (CQL) to "robustify" our models, thus making them suitable for design. 2021-07-16 20:54:07 This is very important: lots of recent work shows how to train really good predictive models in biology, chemistry, etc. (e.g., AlphaFold), but using these for design runs into this adversarial example problem. This is actually very similar to problems we see in offline RL! 2021-07-16 20:53:15 The basic setup: say you have prior experimental data D={(x,y)} (e.g., drugs you've tested). How to use it to get the best drug? Well, you could train a neural net f(x) = y, then pick the best x. This is a *very* bad idea, because you'll just get an adversarial example! 2021-07-16 20:52:17 Data-driven design is a lot like offline RL. Want to design a drug molecule, protein, or robot? Offline model-based optimization (MBO) tackles this, and our new algorithm, conservative objective models (COMs) provides a simple approach: https://t.co/ivfCvN147Z A thread: https://t.co/1olEYCE74y 2021-07-16 03:19:52 Check out sigma-VAE at #ICML2021 next week, for some simple and effective methods for training calibrated VAE decoders. Poster: Thu 22 July 17:35 - 17:40 PT https://t.co/L3rJFPWttI Paper: https://t.co/4ED8dsKr9A Web: https://t.co/Dx55xb1Jv3 Talk: https://t.co/5WRvaPbnae https://t.co/YtCeX7C7WP 2021-07-15 17:48:28 Bringing together IRL and multi-agent: look at other agents &amp https://t.co/2aCBKW2Rng https://t.co/5XcPN19BkB https://t.co/DEoyfKT9TC 2021-07-14 18:19:03 @qcjefftqc That is a really good question! I do think it might provide part of the explanation, though such methods do not (in principle) capture the right "type" of uncertainty. But it seems likely they "accidentally" get some epistemic uncertainty, esp. with regularization/early stopping. 2021-07-14 18:03:23 @gershbrain @its_dibya Yup, fair enough, thanks! I didn't actually know about the '60s Martin &amp 2021-07-14 17:58:53 @gershbrain @its_dibya Oh, I see -- the POMDP formulation of Bayesian RL is indeed well known (this is in the paper too). We did not see prior work that studies this specifically in the context of generalization (vs explore/exploit tradeoffs), though if you know of any, def. that should be referenced. 2021-07-14 17:44:59 This was a really fun collaboration with @its_dibya, @jrahme0, @aviral_kumar2, @yayitsamyzhang, @ryan_p_adams -- a really fun group to work with on generalization 2021-07-14 17:43:32 Unfortunately, solving (or even estimating) the epistemic POMDP is very hard, and LEEP makes some very crude approximations. Lots more research is needed to utilize the epistemic POMDP, in which case I think we can all make lots of progress on generalization! 2021-07-14 17:42:34 Based on this idea, we developed a new algorithm, LEEP, that utilizes epistemic POMDP ideas to get better generalization. LEEP actually does *worse* on the training environments, but much better on test environments, as we would expect. https://t.co/l5AxpWI1AQ 2021-07-14 17:41:00 What is happening here is that generalization in RL requires taking epistemic (information-gathering) actions at test time, just like we would in a POMDP, but this is never optimal to do at training-time. Hence, MDP methods will not generalize as well as POMDP methods. 2021-07-14 17:40:16 This leads to some counterintuitive things. Look at the "zookeeper example" below: the optimal MDP strategy is to look at the map (which is fully observed) and go to the otters, but peeking through windows generalizes much better (is never optimal in training). https://t.co/2lLv65hTee 2021-07-14 17:38:52 The guessing game is a MDP, but learning to guess from finite data becomes (implicitly) a POMDP -- what we call the epistemic POMDP, because it emerges from epistemic uncertainty. This is not unique to guessing, the same holds eg for mazes in ProcGen, robotic grasping, etc. 2021-07-14 17:38:08 Of course, this is a bad idea -- if it guesses wrong on the first try, it should not guess the same label again. But this task *is* fully observed -- there is a unique mapping from image pixels to labels, the problem is that we just don't know what it is from training data! 2021-07-14 17:37:25 Take a look at this example: the agent has a multi-step "guessing game" to label an image (not a bandit -- you get multiple guesses until you get it right!). We know in MDPs there is an optimal deterministic policy, so RL will learn a deterministic policy here. https://t.co/zih0Jp5h6l 2021-07-14 17:36:33 Empirical studies observed that generalization in RL is hard. Why? In a new paper, we provide a partial answer: generalization in RL induces partial observability, even for fully observed MDPs! This makes standard RL methods suboptimal. https://t.co/u5HaeJPQ29 A thread: 2021-07-14 17:36:29 RT @its_dibya: Fresh on ArXiv! https://t.co/KY9tnogl6Z TL 2021-07-14 17:32:37 @gershbrain @its_dibya Bayes-adaptive POMDPs are formulated explicitly as POMDPs. The epistemic POMDP emerges from solving a fully observed MDP with finite samples. This requires new terminology: the partial observability emerges from epistemic uncertainty, even though the MDP is not partially observed 2021-07-12 21:27:50 Also, it really hammers home the notion that I might be living in some sort of science fiction fantasy when our work literally involves enabling robots to play with stuff. Though having to write a grant proposal tonight somewhat interferes with the illusion. 2021-07-12 21:15:23 The ideal robotics lab is one where there are lots of robots playing, and a few humans working (with the largest ratio of the former to the latter) 2021-07-12 21:13:13 For those of you wondering what project this is, it's the next stage of our reset-free dexterous manipulation work -- the first phase publication can be found here: https://t.co/iUBV1Cid7K 2021-07-12 21:12:30 Always makes my day to come in to lab and see a robot playing with an object in the corner, without adult supervision Though I think this one needs a friend, it's been going at it for some time on its own... https://t.co/cA9bCHFQE8 2021-06-30 17:02:15 To learn more about HInt, check out the: website: https://t.co/srPyLD7jql paper: https://t.co/HLTTjJD11S video: https://t.co/a525gRXKrG 2021-06-30 17:02:15 To learn more about HInt, check out the: website: https://t.co/srPyLD7jql paper: https://t.co/HLTTjJD11S video: https://t.co/a525gRXKrG 2021-06-30 17:01:31 In experiments, we can combine data from three different robots (a TurtleBot, Clearpath Jackal, and a drone), as well as data collected by a person walking with a camera! This helps each robot generalize, even to environments that it hasn't seen before (but others have). https://t.co/2XMTkTJgVB 2021-06-30 17:00:24 This is a model-based method, but it doesn't predict future observations, rather future positions and rewards (similarly to GCG, CAPS, BADGR, and DIMs): https://t.co/QjLiykn3gr https://t.co/qoxAa0oIaK https://t.co/1iQmbzqWPT https://t.co/XGiEcvA78h 2021-06-30 16:58:29 Then, we can compose a robot-specific dynamics model with the shared visual model at test-time to get a single model that predicts future outcomes (e.g., rewards) given the robot's specific motor commands, and still share the visual part across many robots! 2021-06-30 16:57:55 The basic idea is that we learn two models: a model that maps images &amp 2021-06-30 16:56:31 Can we combine data from many robots and learn a single model-based controller? In HInt (Hierarchically Integrated models), Katie Kang &amp https://t.co/HLTTjJD11S -&gt 2021-06-30 04:23:12 The intuition behind the result is that, if each virtual "agent" has a different "job," and these jobs are independent, then the whole society can adapt more quickly when transfer requires changing just a sparse subset of these jobs. 2021-06-30 04:22:24 In this new paper, we further show that such decentralized and modular reinforcement learners satisfy an algorithmic independence property that allows them to adapt much more quickly in transfer learning scenarios, by utilizing tools for studying independent causal mechanism. 2021-06-30 04:21:20 In our prior work, we showed that we can design populations ("societies") of virtual agents that compete to get the largest share of extrinsic reward, "bidding" over which action to take at each time step. We called this the "cloned Vickrey society": https://t.co/PKOBuatwYK https://t.co/6Jf4nRK2i5 2021-06-30 04:19:01 Algorithmic independence can explain why modularity can result in better transfer -- in this paper, we study this for the particular case of modular algorithms, building on the our prior work on a "societal competition" model of RL (which is inherently modular). -&gt 2021-06-25 05:58:27 Frustratingly simple video prediction first time I saw overfitting as the main problem for video prediction (though data augmentation can address that). Great results on video pred many frames into the future, and also works for real-world robot control! (and yes there is code) https://t.co/jxulSzweSG 2021-06-25 04:45:07 Visual model-based RL with a real trajectory optimizer -- collocation + visual model-based RL can "imagine" moving objects to the goal, and then figure out how to actually accomplish it. See below for @_oleh's excellent summary, check out the paper here: https://t.co/9AGIzCY9jz https://t.co/Q7vecze7K3 2021-06-22 04:06:55 This is perhaps a bit sad, because I(Z[t+1] 2021-06-22 04:04:54 What *is* sufficient? Turns out that under some mild assumptions, I(Z[t+1] 2021-06-22 04:03:45 Another common choice is MI between past states and future states: I(Z[t+1] 2021-06-22 04:02:00 A popular choice is MI between between actions and future states: I(A[t] 2021-06-22 03:59:39 Mutual info maximization is often used for representation learning. Which MI objectives lead to representations that are sufficient to learn optimal policies in RL? In recent work by Kate Rakelly et al we show some popular choices are not sufficient: https://t.co/Wg1vUe8Ouh &gt 2021-06-17 03:17:47 The updated meta-world has significantly improved visuals and many bug fixes, improved reward functions, and other changes to make it easier to work with. If you're looking for 50 robotic manipulation tasks for meta-learning &amp 2021-06-16 22:42:11 A kind of GAN for system ID: Train simulator parameters so that the simulated *trajectories* are indistinguishable from real trajectories, treating the simulator as a "policy" that tries to replicate long-horizon behavior seen in the real data. https://t.co/6LcPukaVY7 https://t.co/F0BHJspW8u 2021-06-15 03:15:33 Prior to this, I didn't think that goal-conditioned RL could handle such diverse real-world scenes. But what's neat here is not just the vis. goal results, but that it provides pretraining &amp 2021-06-14 20:28:11 RT @minerl_official: We are so excited to announce that the first round of the 2021 @NeurIPSConf MineRL Diamond Competition has officially… 2021-06-09 03:24:14 What I find most interesting about this work is the notion that information-theoretic skill learning (e.g., DIAYN) can be thought of as a kind of goal-conditioned RL with a learned representation. This opens up a number of interesting design choices. https://t.co/bdhzbMw0kM 2021-06-08 19:13:27 RT @siggraph: This #SIGGRAPH2021 Technical Papers selection will really get you moving with new #research on character control in "AMP: Adv… 2021-06-04 20:31:00 @rsalakhu That is a really beautifully illustrated blog post! ...is that a pig walking across the field in the Unreal Engine/AirSim screenshot? 2021-06-04 18:00:04 @FeryalMP @AlexKhazatsky @ashvinair @khimya Thank you for pointing out, definitely an oversight on our part that we should rectify. But aside from the title (which I agree is an unfortunate collision, though unintentional on our part), the two papers are trying to do very different things. Which is why we missed it. 2021-06-04 06:56:52 I'll give a talk tomorrow at ICRA Safe Robot Control workshop https://t.co/puXIQyDVHj (Zoom link on website) 9:30 am PT Fri 6/4! "Safety in Numbers: How Large Prior Datasets Can Put RL into the Real World" (it's a workshop talk, so this title is perhaps a bit aspirational). 2021-06-03 20:49:40 @AlbertBuchard I suspect it depends a lot on the data, but the analysis in Figure 2 suggests that at least for prediction accuracy, it's about having better state representations rather than being non-Markovian (the story may be different if we look at actions, depending on the behavior policy) 2021-06-03 20:32:40 @ethancaballero The geographic extent of BAIR while everyone is working from home (in various locations) has a radius of roughly 12,000 miles. 2021-06-03 20:22:41 4/ Concurrently (just yesterday!) Decision Transformer validated the same hypothesis: that big sequence models can do offline RL effectively. There are some differences (roughly, trajectory transformer is more model-based), but the conclusion is the same: https://t.co/cnfyxh85ZE 2021-06-03 20:22:00 3/ Interestingly, the model is significantly better than prior models in model-based RL for forward prediction too, able to predict humanoid trajectories with physical accuracy for 100 steps into the future (typically, model-based RL rollouts are much shorter than this) https://t.co/ZqZr83uRoh 2021-06-03 20:20:49 2/ For control, we can just decode with beam search, but using reward instead of likelihood, which is competitive with or better than prior offline RL methods on standard benchmarks, can solve goal-conditioned tasks, etc. https://t.co/t0dgl8uf6O 2021-06-03 20:20:29 1/ The basic idea is simple: take all dimensions of states, actions, and rewards, discretize them, and model them as a long sequence of discrete tokens. Then use a language model. This can be used to predict, and also to control. https://t.co/5RKQrOgitf 2021-06-03 20:20:10 Can we replace RL with one big sequence model? Trajectory transformer models state/action/reward sequences one token (dimension) at a time -- this allows long-horizon prediction and offline RL w/o separate actors, critics, constraints, etc.: https://t.co/98D21UVSMC A thread: https://t.co/YwJeVL02tZ 2021-06-03 05:48:28 This will be in the session "Reinforcement Learning for Robotics I" at 19:30 GMT+1/11:30 am PT Thu 6/3. You can find the paper here: https://t.co/SDEtxYcxgd Code: https://t.co/uPVACeN9cI Video: https://t.co/I1iyXQYSXt 2021-06-03 05:47:24 We do a lot of work on manipulation. But what about aerial manipulation? Can we use meta-learning to get drones to pick up cargo? At #ICRA2021 19:30 GMT+1/11:30 am PT Thu 6/3, Suneel will present meta-learning w/ suspended payloads: https://t.co/0DgaNGWgSJ https://t.co/uQdhgZJHog 2021-06-03 04:18:46 From driving on sidewalks and paved paths, we can also learn to reach visually indicated goals with a mobile ground robot. Check out @_prieuredesion's ViNG talk, #ICRA2021 tmrw 1900 GMT+1/11am PT, "Autonomous Vehicle Navigation II" https://t.co/XPrUpNSHhz https://t.co/hyTUbxavG1 https://t.co/Ib0E6IXIph 2021-06-03 04:15:55 RT @ashvinair: Super excited to release latest work on visuomotor affordance learning (VAL)!! Highlights: 1. VAL utilizes a large prior dat… 2021-06-03 03:26:53 Check @nick_rhinehart's talk at ICRA tmrw 18:00 GMT+1/10 am PT (Thu 6/3)! Find out how robots can plan for contingencies, handling autonomous driving scenarios where humans can react in unpredictable ways. If you can't make the talk, recording is here: https://t.co/ol3WmIziGy https://t.co/xhAdLSaR2E 2021-06-02 03:23:32 Paper: https://t.co/Gf5LBr1ZV2 Website: https://t.co/pf7qqtM8i1 Video: https://t.co/XthcnLGsii VAL will be presented at #ICRA2021 on Thu, 19:15 GMT+1/11:15 am PT, in the Vision-Based Manipulation session. 2021-06-02 03:21:35 Here are a few animations of new tasks that VAL masters in unseen test environments: for each task, the robot was first allowed to interact without any supervision, using its offline-trained policy + generative model to propose &amp 2021-06-02 03:20:31 We train VAL on a large dataset of diverse environments with many different objects, using offline RL (based on AWAC https://t.co/JYyprRInhR), with a conditional VQ-VAE model for candidate goal generation. After offline training, it autonomously learns in new settings. https://t.co/UWT2xES0Dl 2021-06-02 03:18:38 Our method, visuomotor affordance learning (VAL), trains a generative model that predicts what the robot *could* do, and a goal-conditioned policy to do it. In a new environment, it then imagines what might be doable, and practices doing it. https://t.co/yMJFqN8DvO 2021-06-02 03:17:14 When a robot is put in a new scene, can it "imagine" what tasks could be possible, and then practice them? In "What Can I Do Here?", @AlexKhazatsky, @ashvinair, &amp 2021-06-02 01:04:23 For those who want to hear the whole thing: Spotify: https://t.co/f9XX1btw9c Soundcloud:https://t.co/3kTgd6KEsZ YouTube: https://t.co/Ha1nWCS7M7 2021-06-02 01:03:06 A fun podcast, thank you @MarwaEldiwiny for inviting me! https://t.co/7uDkQtR5cj 2021-06-01 18:56:49 @SmallpixelCar @OpenRoboticsOrg Not yet, we were not brave enough to try to train it to do that -- we only have one of these. There is nothing technical that in principle would prevent this though, just a bit beyond the production and QA capabilities of a university lab. 2021-06-01 05:45:51 At #ICRA2021, @abhishekunique7 will present present our work on dexterous hands, showing that by practicing many skills at the same time, such robots can learn without human intervention! 21:15 GMT+1/1:15 pm PT in Manipulation: Reinforcement Learning II https://t.co/iUBV1Cid7K https://t.co/2XmgmXzDBV 2021-06-01 05:13:29 Also at #ICRA2021 tmrw, S. Nasiriany will present Disco RL, which extends goal-conditioned RL to train policies conditioned on arbitrary goal distributions! 20:15 (GMT+1)/12:15 pm PT Tue, in Manipulation: Reinforcement Learning I https://t.co/LVchQkoIGC https://t.co/KzbaNiFR1V https://t.co/4GXPo17vRT 2021-06-01 04:33:24 Can robots to learn to drive on Berkeley sidewalks, entirely from images? Also at #ICRA2021, Tue 18:30 GMT+1/10:30 am PT, Greg Kahn will present LaND, which can learn to navigate kms of Berkeley sidewalks! In the Field Robotics: Control session. https://t.co/ZujiRF6PGt https://t.co/bmx6ky924G 2021-06-01 04:02:02 Can we learn image-based open-world navigation from diverse data in the real world? At #ICRA2021, Tue 10:45 GMT+1/2:45 am PT, Greg Kahn will present BADGR, our system for real-world navigation, in the Service Robotics Award Session! https://t.co/zHleTYhO1C https://t.co/gdZNWHlkHe 2021-06-01 03:11:01 RT @HelgeRhodin: Will reinforcement learning solve computer vision? With or w/o simulation? @svlevine showcased his recent work and shared… 2021-05-29 17:58:20 RT @siggraph: Quench your #research thirst with an interview with Jason Peng + Edward Ma about "AMP: Adversarial Motion Priors for Stylized… 2021-05-27 07:08:27 @_sam_sinha_ Might have solved AGI, might have trained on the test set, stand by... 2021-05-27 07:01:46 Really stoked to share some of the results in the papers we're submitting on Friday, once we get this nicely written up. We've figured out some things. ... ... ...unless we realize before that that we were wrong, that's always a possibility... 2021-05-26 17:10:17 @F_cossio @abhishekunique7 Each task's policy is only rewarded for that task, not for the performance of future tasks. 2021-05-25 17:40:04 Want to know how robots can learn to give you a hand with your NeurIPS submissions? So do I. In the meantime, you can check out @abhishekunique7's ICRA 2021 talk, how to train robotic hands to do lots of other stufffrom scratch, in the real world https://t.co/vDxDnvNgJr 2021-05-12 17:52:12 Congratulations Prof. Berseth!! For students: working with Glen as he starts his lab at @Mila_Quebec will be a lot of fun. Check out some of his work here: https://t.co/mrSdXe4e39 From control of walking robots to reversing the second law of thermodynamics https://t.co/y3vMyGmJbt 2021-05-12 03:24:29 Of course, these experiments in MT-Opt take that to a whole new level: once the robot is trained on a huge dataset, it actually starts to generalize in meaningful ways, so that it actually reacts *intelligently* when we mess with it (whereas Brett just did funny random stuff). 2021-05-12 03:23:47 What I enjoyed most when I ran robot experiments (before I got too busy to do it myself) was messing with a robot that had learned a task. It's so much fun to see what it tries to do in different situations. E.g., these demos (w/ @chelseabfinn &amp 2021-05-08 18:46:40 RT @hausman_k: Actionable Models got accepted to #ICML2021! Excited to talk about this work at the conference! https://t.co/I1Mg40AupR 2021-05-07 22:02:00 @davidgordian I suspect it could work. Karl Zipser at Berkeley had a cheap little RC car that he was using for a while for research along similar lines, which seemed to work pretty well -- I'm not sure if he ever open-sourced the design, but I think it was a few hundred $ in cost. 2021-05-07 19:25:06 This is starting in 36 min (1 pm PT) here: https://t.co/It721zZYoM Meanwhile, check out Abhinav Gupta's talk, which starts shortly (12:30 pm PT)! #ICLR2021 https://t.co/cwDQ92ShP5 2021-05-07 05:43:36 @kolari My recommendation to the students to tackle this particular problem for the evaluation had absolutely nothing to do with the fact that I have a toddler at home. Completely. not. related. 2021-05-07 04:56:50 I'll be giving a talk on how robots can learn autonomously from diverse datasets tomorrow (Fri) at 1 pm PT at the Embodied Multimodal Learning workshop! There will be many robots, learning many things. ICLR page: https://t.co/It721zZYoM https://t.co/JFOt1vtJXD 2021-05-07 03:29:15 @AleksandraFaust @iandanforth @JDCoReyes Oh wow good catch, I didn't even know he had one 2021-05-07 02:36:08 In "RL for Autonomous Mobile Manipulation with Applications to Room Cleaning" we show how a hierarchical RL approach can enable continual, unattended real-world learning: https://t.co/20fxjDtfD6 2021-05-07 02:34:57 Fri 6 am PT &amp https://t.co/VkNGqn73gP https://t.co/51Az37MsXx 2021-05-07 01:51:14 @iandanforth He's too cool for us (or maybe just too busy getting his papers invited to workshops and getting ICLR oral presentations... while I'm just here posting on twitter) 2021-05-07 01:22:03 To improve RL, we try to build better algorithms. But what if we ask how to set up environments to make RL (esp. lifelong RL w/ sparse rewards) easier? Check out Ecological RL, which was invited to the never-ending RL ws: https://t.co/VkNGqn73gP #ICLR2021 https://t.co/35ibxthqyY https://t.co/b3GBHA35g8 2021-05-06 23:44:32 Robots can learn on their own, if they reset their own task. Multi-task learning is a great way to do this -- check out the poster by @abhishekunique7 et al tmrw 6am PT &amp https://t.co/VkNGqn73gP Web: https://t.co/iUBV1Cid7K Video: https://t.co/IfXEhwAybN https://t.co/sCqpoysa1u 2021-05-06 21:46:53 Can causality and algorithmic independence help RL transfer better? Tmrw, @mmmbchang will present "Modularity in RL via algorithmic independence" in #ICLR2021 ws: Generalization beyond... 1 pm PT: https://t.co/lraDFJO45r Learning to learn 8:40 am PT: https://t.co/T8cNaMFUft https://t.co/qra9AbhJh3 2021-05-06 20:21:52 One of the things that I find particularly exciting about this is that it demonstrates the power of offline RL: we used the same dataset collected by Greg Kahn in spring 2020 to build this navigation system, even though that data was collected for a completely different project. 2021-05-06 20:17:37 How can robots learn to search for goals in new environments? Tmrw, Fri 10:50am PT, @_prieuredesion will present a full-length talk at the NERL ws at #ICLR2021 on RECON, which addresses this problem with real-world data! WS: https://t.co/N6AMQvMlpE paper: https://t.co/Es7YKyv8KX https://t.co/sgHPoT0SmE 2021-05-06 04:50:37 Want to know how to train goal-reaching policies w/o reward, with just state-prediction density? See Ben Eysenbach's poster on C-Learning Thu 9 am PDT at #ICLR2021! https://t.co/fAL5Tea6kk Web: https://t.co/GUhYjmAIXC Code: https://t.co/5tYQKx5kGw ICLR: https://t.co/6DbcScG7bL https://t.co/mVSSqwveSb 2021-05-05 17:37:09 How can RL agents explore *safely*? Conservative safety critics aim to provide this via a pessimistic safety critic that believes that anything you haven't done before is dangerous. At #ICLR2021 today, 5 pm PT, rm B5! arxiv: https://t.co/qsYhXBeipZ ICLR: https://t.co/2Oqg8mDF20 https://t.co/ejTBMHCkT7 2021-05-05 17:33:06 Today (Wed) at 11:45 am PT, JD Co-Reyes et al. will present Evolving RL Algorithms -- how we can search over the space of Q-learning methods to discover more effective objective functions! Poster at 5 pm PT https://t.co/pmMBI1gAmz https://t.co/3MzyubeqdP https://t.co/5RAOHC6YNt https://t.co/Mo8mgT5XcP 2021-05-05 03:49:58 And by "Tue 9 am PT" I actually mean "Wed 9 am PT", oops 2021-05-05 03:07:47 Can hierarchical latent variable models help solve offline RL problems? Tue at 9 am PT, Anurag Ajay et al. will present OPAL, a method that combines learned latent action models with higher-level offline RL. Paper: https://t.co/8phIcOMkyC Poster: https://t.co/zA4oGud51H https://t.co/tYekSrTL8S 2021-05-04 21:15:58 Deep RL has an _implicit_ under-parameterization problem: TD backups cause aliasing in the features, even for large nets! Today 5 pm PT, @aviral_kumar2 and @agarwl_ present analysis of IUP and some solutions! #ICLR2021 ICLR: https://t.co/2sg6WRLjT2 web: https://t.co/NogUM92z1o https://t.co/kvoj6um3C4 2021-05-04 19:54:32 Interested in offline model-based RL with images and goals? Check out MBOLD, presented at #ICLR2021 this evening (Tue 7:15 pm PT): https://t.co/OfHRyV4M9M S. Tian will describe how models + temporal distance values can allow diverse visual goal-reaching. https://t.co/8XuwTI4oaN https://t.co/D7ZNR55mhP 2021-05-04 17:47:02 The great thing about this is that the algorithm is _extremely_ simple. Paper: https://t.co/fn8sEiLOsT Blog: https://t.co/8wZp0pEiOx Poster: https://t.co/Jf0gqpsMu3 Talk recording (if you want a sneak-peek!): https://t.co/XqRcNINlz5 2021-05-04 17:46:38 The basic idea in GCSL is that we can imitate our own prior experience with suitable relabeling, and this actually corresponds to an RL algorithm that we can analyze, show per-iteration improvement, etc.! And then we can use it to learn all sorts of RL tasks, like lunar lander. https://t.co/UeGnNrPB3l 2021-05-04 17:46:20 Can regular *supervised* learning solve RL problems from scratch (no demos, no prior data)? Does that even make sense? Well, yes! Check out the GCSL presentation at #ICLR2021, poster at 5 pm PT Tue (today), oral presentation 11 am PT Wed: https://t.co/Jf0gqpsMu3 short thread-&gt 2021-05-04 05:08:55 You can also find out more about this (admittedly somewhat peculiar) unsupervised RL method here: Paper: https://t.co/H2aL3n7it0 Project website: https://t.co/UykHFkRJQl BAIR Blog: https://t.co/DJWhkZMONz 2021-05-04 05:08:24 If you missed @GlenBerseth's talk on SMiRL, catch the poster tomorrow (Tue) at 9 am PT at #ICLR2021. SMiRL is an unsupervised RL method based on *minimizing* surprise -- by aiming to reduce state entropy, SMiRL can produce meaningful behavior in "chaotic" environments. https://t.co/if5aKjBHrS 2021-05-04 02:55:51 I would highly recommend attending Avi's Q&amp 2021-05-04 02:54:20 RT @avisingh599: I'll be answering questions 8.05pm - 8.18pm PT https://t.co/txEQAjrTKX 2021-05-04 00:30:18 RT @avisingh599: I'm presenting our work "Parrot: Data-Driven Behavioral Priors for Reinforcement Learning" at #ICLR2021 right now! If you'… 2021-05-01 22:56:19 @ylecun My brain pattern-matched it to "self-supervised decision transformer" and I got really excited. I don't know what that is, but now I want to write a paper about it. 2021-04-30 19:24:55 That is a great way of putting it -- humans are indeed the rate limiting step. As anyone who has run a robot learning experiment knows, current robot learning is 90% human learning and human labor, and 10% robot effort. We need to change this. Make the robots work harder. https://t.co/nbKVWoi7St 2021-04-29 20:06:15 @antoniloq I think that is a really important goal in the future, because ultimately if we want large datasets, we'll have to pool from many robots. We have a little bit of work in this direction: https://t.co/RGlNrsMWBi But there is a long way to go in this direction. 2021-04-29 17:06:03 I wrote a short, non-technical article summarizing some recent progress in robotic learning, particularly how robots can learn (1) from large offline datasets Robots like the one below: https://t.co/1RwkkydBxC 2021-04-28 16:52:51 New talk covering recent progress on robotic learning from Berkeley and Google -- how can we build data-driven robotic systems with offline RL that learn directly in the real world with minimal human intervention? https://t.co/0J6QfBpfMH 2021-04-28 03:38:03 This generalization of goal-conditioned RL Paper: https://t.co/KzbaNiFR1V Project Website: https://t.co/LVchQkoIGC Poster: https://t.co/MUaM4MeQET YouTube video: https://t.co/lvEnVPy2qX w/ @snasiriany, @Vitchyr, @ashvinair, @AlexKhazatsky, @GlenBerseth 2021-04-28 03:35:34 DisCo RL (distribution conditioned RL) explores this: condition a policy on a distribution, so the policy can capture a wide range of behaviors. E.g., a distribution that has high correlation between positions of two objects induces the policy to put those objects together. 2021-04-28 03:34:13 We can train policies to reach goal states, and then use them to accomplish a range of goals. What if we want policies that can accomplish any task? Distributions over goals can, in theory, represent any task, so we could condition on the parameters of a distribution! a thread&gt 2021-04-28 02:54:32 RT @abhishekunique7: Some cool new updated results for offline pre-training followed by online fine-tuning with AWAC (advantage-weighted ac… 2021-04-27 17:31:47 Pretraining with offline data makes it much easier to get complex real-world robotics tasks, like the dexterous manipulation task above, to work reliably and efficiently. Arxiv: https://t.co/JYyprRInhR Videos: https://t.co/913iRX0ppL w/ @ashvinair, @abhishekunique7, M. Dalal 2021-04-27 17:30:59 AWAC works much like SAC, but with a modified actor update that implicitly enforces a policy constraint, mitigating the OOD action problems that plague offline RL. This works well for *both* offline and online RL, making it great for offline pretraining+finetuning! 2021-04-27 17:30:24 How can we get robots to solve complex tasks with RL? Pretrain with *offline* RL using prior data, and then finetune with *online* RL! In our updated paper on AWAC (advantage-weighted actor-critic), we describe a new set of robot experiments: https://t.co/913iRX0ppL thread -&gt 2021-04-26 16:42:26 w/ @timrudner, @Vitchyr, @rowantmc, @yaringal paper: https://t.co/alF8sGVqU0 2021-04-26 16:42:15 Discount factors and iteratively adjusted shaping over time emerge naturally from our framework, such that the method actually adjusts (automatically) how shaped the reward is while optimizing a valid lower bound. 2021-04-26 16:42:04 This is really nice, because log-probabilities of learned dynamics (like conditional Gaussian models) are much more well-shaped than commonly used goal-conditioned RL rewards, like indicators or epsilon-ball rewards. https://t.co/ToMQDPhiuE 2021-04-26 16:41:47 The main idea: if we treat states and actions as variables in a graphical model, we can derive a variational lower bound on the probability of reaching a state at some point in the future (without knowing when!) and get an RL-like objective where the reward is the *dynamics* https://t.co/rsQhK1fYmt 2021-04-26 16:41:33 Can control as inference give us RL algs that don't need manual rewards? In Outcome-Driven Actor-Critic (ODAC), we show how VI provides principled reward shaping, derive RL as a bound on an inference problem, and show great goal-conditioned results: https://t.co/alF8sGVqU0 &gt 2021-04-23 22:38:42 @TheOneRavenous I just can't make a gif that shows the whole thing (it would be too big) -- the YouTube video on the project website shows the whole thing. 2021-04-23 04:43:07 (this is not actually an exaggeration, we actually had a different commercially available hand, which shall not be named, start smoking after some heavy-duty RL training a few years back) 2021-04-23 04:40:24 Learning algorithms are cool and all, but Vikash's D'Hand design was essential to make this work -- that thing is built like a tank, a lesser robotic hand would be a smoking wreck after this kind of training process https://t.co/Xjea4A9Tb5 2021-04-23 03:59:20 RT @abhishekunique7: We've been working on getting robots to learn in the real world with many hours of autonomous reset free RL! Key idea… 2021-04-23 03:44:18 Paper: https://t.co/sTT9588hC0 Web: https://t.co/iUBV1Cid7K Video: https://t.co/IfXEhwAybN w/ @abhishekunique7, Justin Yu, @tonyzzhao, @Vikashplus, A. Rovinsky, @imkelvinxu, T. Devlin 2021-04-23 03:41:18 This allows a robotic hand+arm to learn a variety of dexterous manipulation skills completely autonomously, practicing by trial and error without any human intervention or oversight. For example, here it is learning a connector task. https://t.co/dNafvYY1re 2021-04-23 03:40:46 The idea in MTRF learning (multi-task reset-free learning) is that, in order to learn complex tasks in the real world, robots should learn many tasks together, where the tasks form a graph such that each task can be "reset" with other tasks. https://t.co/m4QGQYRwiX 2021-04-23 03:40:29 After over a year of development, we're finally releasing our work on real-world dexterous manipulation: MTRF. MTRF learns complex dexterous manipulation skills *directly in the real world* via continuous and fully autonomous trial-and-error learning. Thread below -&gt 2021-04-22 22:11:40 RT @GoogleAI: Today we present a new approach for automated discovery of generalizable #ReinforcementLearning algorithms that evolves a pop… 2021-04-22 05:08:34 I summarized this paper with the following TL It's PRECOG+planning. A cool property that PRECOG has is that it encodes closed-loop behaviors as open-loop sequences of latent codes, and hence can allow for planning of closed loop behaviors easily, which is neat. 2021-04-22 03:17:57 w/ @nick_rhinehart, Jeff He, Charles Packer, Matthew Wright, @mejoeyg, @rowantmc Web: https://t.co/aZTobrV5h5 Arxiv: https://t.co/W7k5yegTIM 2021-04-22 03:16:30 The main idea is to train a latent variable model that represents the intentions of the ego-agent and other agents with latent variables, and instead of planning over actions, plan over the ego-agent's future intentions, while averaging out intentions of other agents. 2021-04-22 03:15:46 Planning in multi-agent settings requires considering uncertain future behaviors of other agents. So we need contingent plans. In our recent work, we tackle this by learning models where open-loop latent space plans lead to closed loop control https://t.co/W7k5yegTIM &gt 2021-04-20 00:25:22 RT @hausman_k: In addition to MT-Opt, we are releasing Actionable Models, which addresses the problem of defining tasks (which becomes quit… 2021-04-19 21:00:23 RT @YevgenChebotar: Excited to present our new work on Actionable Models, an approach for learning functional understanding of the world vi… 2021-04-19 20:14:43 RT @GoogleAI: Presenting two new approaches to robotic #ReinforcementLearning at scale — MT-Opt, an RL system for automated multi-task data… 2021-04-19 20:14:32 RT @hausman_k: Happy to share a project we've been working on for past 2+ years: MT-Opt https://t.co/SyHab7YBkp Multi-task RL at scale on r… 2021-04-19 20:14:24 RT @julianibarz: Our team has reached another milestone towards general robotics: a robot that can learn many tasks at once and be fine-tun… 2021-04-19 20:14:11 RT @hausman_k: Website: https://t.co/OJune1M2Tw Blog: https://t.co/W55VHRpvBx Arxiv: https://t.co/KTVD9cfWzV Done together with a great tea… 2021-04-19 19:34:19 Blog post: https://t.co/rlLTIOgaKf MT-Opt: https://t.co/gweqaiT7oC AMs: https://t.co/aLDnuH9an7 This research was carried out by Google Research, and led by @YevgenChebotar @hausman_k Dmitry Kalashnikov Jake Varley, with contributions from many others! 2021-04-19 19:32:50 In both cases, the resulting multi-task policies can be efficiently finetuned to solve new downstream tasks. In a sense, this provides a kind of general pre-training for robotic policies, learning representations that are effective for a breadth of downstream behaviors. 2021-04-19 19:32:04 Actionable models takes an automated approach, using goal-reaching policies for multi-task training w/o user-specified rewards. By running offline RL, via an algorithm based on conservative Q-learning (CQL), AMs can pretrain goal-reaching policies on huge multi-task datasets. https://t.co/jXf50xbp0P 2021-04-19 19:30:43 Instantiating this requires a few choices -- e.g., how do we define tasks? In MT-Opt, this is done with a set of task reward classifiers, and relabeling that intelligently shares data across different tasks, enabling under-represented tasks to benefit from over-represented ones. https://t.co/W1kZhBWuoI 2021-04-19 19:29:14 The basic principle is simple: gather data for many different tasks across multiple robots, with months of real-world experience, and then use this data to train multi-task policies with offline reinforcement learning. https://t.co/Rd1KSmm01q 2021-04-19 19:28:29 How can we learn diverse multi-task policies via offline RL from huge robotic datasets? In 2 papers (actionable models and MT-Opt), offline RL enables both multi-task pretraining and efficient finetuning to new tasks in real-world settings. https://t.co/rlLTIOgaKf thread -&gt 2021-04-18 01:46:17 Final set of CS182 Deep Learning lectures now added to the course playlist: https://t.co/21GvfFRcHo GANs (Lec 19), adv. examples (20), and meta-learning (21)! More materials on the course website: https://t.co/0PaBOJElo9 2021-04-15 20:11:43 RT @ancadianadragan: assistive typing: map neural activity(ECoG)/gaze to text by learning from the user "pressing" backspace to undo 2021-04-14 22:57:22 RT @_prieuredesion: We're making strides towards truly "in-the-wild" robotic learning systems that can operate with no human intervention.… 2021-04-14 22:57:18 RT @_prieuredesion: Some great demos of exploring unseen cafeterias and fire stations under different seasons and lighting on the project p… 2021-04-14 04:15:08 Check out the full video below Paper: https://t.co/WsEYRIb6P9 Website: https://t.co/Es7YKyv8KX w/ @_prieuredesion @ben_eysenbach @nick_rhinehart https://t.co/tvCAXZvhcs 2021-04-14 04:13:54 The RECON model is trained entirely with offline data. We used data collected over the course of 2020 for several different projects! Offline training across diverse environments allows generalization to lighting, weather, and different settings. https://t.co/uHqK50ZFuM 2021-04-14 04:12:31 As the robot explores a new environment, it builds a "mental map" -- a graph, with edges connected by distances predicted by the learned model. Diverse prior data is used to train the model across many environments, while the graph represents a mental map of the new environment. https://t.co/DXUGKmAf2c 2021-04-14 04:11:30 The idea in RECON is to train a latent goal model, which bottlenecks the goal through a latent variable. By sampling this latent variable from the prior, the robot can explore an entirely new environment (essentially by proposing random goals) much more effectively. https://t.co/ugNYvdmEqo 2021-04-14 04:09:26 Can robots navigate new open-world environments entirely with learned models? RECON does this with latent goal models. "Run 1": search a never-before-seen environment, and build a "mental map." "Run 2": use this mental map to quickly reach goals https://t.co/Es7YKyv8KX &gt 2021-04-09 16:02:53 @zhaomingxie @Mvandepanne For what it's worth -- I agree with you that this article should have done a better job of putting the research in the context of prior work. 2021-04-09 15:57:19 A few folks pointed out to me that there are some factual errors in the MIT TR article. Certainly we are *not* claiming that our paper is the first to show RL for bipedal locomotion! Our prior work section covers lots of prior papers on this: https://t.co/4518eydPzH 2021-04-09 01:48:38 Nice article in MIT Tech Review about our recent paper (@ZhongyuLi4 &amp 2021-04-06 17:07:23 The task reward encodes _what_ we want the character to do, the AMP encodes _how_ we want it to do it. This recipe is general, and I think it will provide a simple and powerful approach to character animation that is much easier to use than state machines, motion graphs, etc. 2021-04-06 17:05:35 This image captures the basic (and simple) idea: take a dataset that describes the desired style of the behavior you want (walks, turns, stylized "zombie walks," whatever), use it to train the motion prior (a discriminator), and then combine it with a task reward. https://t.co/hGpX6OzB2C 2021-04-06 17:03:54 Adversarial motion priors (AMP) provides a simple and scalable data-driven method for physics-based animation: RL is used to train the character to accomplish the task, while data is used to train a "motion prior" that encodes a notion of naturalness. -&gt 2021-04-06 05:23:05 @DhuriHardik Generally I recommend undergraduate level statistics (e.g., a good theory of probability course), or equivalent to Berkeley's CS70, and college-level linear algebra. However, the class goes fast through basic machine learning, so it's better to have taken an intro ML course. 2021-04-06 04:44:43 If you would like to evaluate your off-policy evaluation (OPE) methods, check out our OPE benchmarks. Deep off-policy evaluation (DOPE) -- though you can use it with non-deep methods too if you prefer! https://t.co/31TjOjctdu https://t.co/FCejmT50uw https://t.co/KDXHNw3HgU https://t.co/RDEpJ7z9f5 2021-04-05 04:08:33 Of course, if you want a more in-depth grad-level course on deep RL, you can also check out the full-length CS285 course here: https://t.co/tAGzTDfvUc CS182 covers a broad range of deep learning materials (not just RL), whereas CS285 goes deep (heh) on RL especially. 2021-04-05 04:07:45 I've updated the CS182: Deep Learning with new lectures on RL (policy gradients, actor critic, Q-learning), autoencoders, latent variable models, and VAEs! Website: https://t.co/0PaBOJElo9 Playlist (lectures 15-18): https://t.co/21GvfFRcHo GANs and adv examples coming next! 2021-04-02 23:53:19 @SiddKotwal Yup, we checked this. The github has all the homework materials, and the slides should all be accessible on the website! 2021-04-01 20:39:35 RT @shaneguML: Policy Information Capacity (aka PIC, reward empowerment) is accepted as 1 of 3 contributed talks @iclr_conf Never-ending RL… 2021-03-30 04:44:10 This was a really fun collaboration to get motion imitation + RL for bipedal locomotion with the Cassie robot! Some really exciting experimental results, and a few unintended "emergent" recoveries https://t.co/J4RGtI2OtC 2021-03-29 17:34:18 Paper: https://t.co/uRwMubdlsj Website: https://t.co/e5QrrGRylv @sidgreddy's talk for ICLR: https://t.co/M8jM4oMv1N An awesome collaboration: Jensen Gao &amp @GlenBerseth &amp Nicholas Hardy, @NikhileshNatraj, @KaruneshGanguly (UCSF) 2021-03-29 17:30:42 Our colleagues at UCSF conducted experiments with X2T with a patient who has an ECoG electrode array implanted into their brain, and showed that X2T could allow the patient to type words more effectively using the BCI interface than a non-adaptive baseline learning system! https://t.co/IFvsSX9IiM 2021-03-29 17:29:29 The main idea is that each time the user presses "backspace" to reverse the previous word or character, the system receives negative reinforcement. Over time, it tailors its behavior to the particular user, allowing for human-machine "co-adaptation". Here is a writing example: https://t.co/uQq5nw9gr8 2021-03-29 17:28:37 X2T is a reinforcement learning system that helps users control computers with: gaze, handwriting, and brain-computer interfaces (!). The user performs a task (e.g., typing), while rewarding the interface for correctly parsing their signals. -&gt https://t.co/e5QrrGRylv https://t.co/Ez4zqdC0bC 2021-03-25 23:14:53 @jmac_ai @abhishekunique7 All the same tricks for Q apply as far as I know, though we didn't study in great detail precisely which of these tricks are most important in this case, though maybe @ashvinair has other thoughts on this? 2021-03-25 03:12:31 RCE enables RL to learn behaviors from data -- RCE does not require a reward function, and does not attempt to learn one, but directly learns policies that match a distribution of user-provided success states. It generalizes C-Learning to arbitrary tasks https://t.co/ieEmqI8aFq https://t.co/HPjVaYexks 2021-03-20 21:18:13 @SiddKotwal Can you let me know what you were trying to access that requires a UCB ID? The homework github repos appear to be accessible to everyone. 2021-03-15 03:24:18 @ChrisDeCstVerde Yes, CS285 homeworks are linked here: https://t.co/tAGzTDfvUc For CS182, we currently have HW1 on the website: https://t.co/0PaBOJElo9 We'll make the rest of the CS182 homeworks available on the above website shortly. 2021-03-13 23:23:17 Note that this is separate from my deep RL course -- CS182 covers deep learning broadly (only a few lectures about RL, contrary to the typo I had in the description...). If you want to go deep (heh) on RL in particular, check out my other course: https://t.co/SzCxe71WVB 2021-03-13 23:20:05 Big thanks to the following courses and faculty that I borrowed material and curriculum ideas from: Andrej Karpathy et al. (CS231n) Chris Manning (CS224n) John Canny (prior offerings of CS182) I did my best to credit whenever possible, but I probably missed many, sorry! 2021-03-13 23:18:13 Topics include: - basic machine learning - convolutional neural networks - recurrent neural networks - transformers - NLP - reinforcement learning - generative models &amp (so far we've covered everything up to RL, which will be posted in a few weeks) 2021-03-13 23:17:22 I'm releasing all the lectures (so far) for my deep learning class, CS182! This is an introductory deep learning course (advanced undergraduate + graduate) covering a broad range of deep learning topics. Website: https://t.co/0PaBOJElo9 Playlist: https://t.co/21GvfFRcHo -&gt 2021-03-13 04:14:16 @Vikashplus @chelseabfinn Yeah, the model is actually a badly butchered version of the PR2 model that I believe you converted from the WG URDF. 2021-03-12 21:01:14 As @chelseabfinn just pointed out to me, there is even proof (complete with awkward comment from me) https://t.co/XnikaLRIzS 2021-03-12 18:50:48 Also, here is a cool animation of a robust (stochastic) policy pushing an object around a barrier Ben did all the work here, but I can claim a small contribution: in 2014 I made the robot model, and in 2015, on suggestion of Marvin Zhang and @chelseabfinn, I added the eyes... https://t.co/A3v3UCTKMh 2021-03-12 18:41:31 But precisely *how* MaxEnt RL is robust (ito what set of perturbations) has been a mystery. This work expands our earlier work on how MaxEnt RL is robust to *reward* perturbations: https://t.co/xOdhcvO19J But we now show dynamics robustness, which is arguably much more useful. 2021-03-12 18:40:03 Of course, relationships between maximum entropy models and robustness in general are widely known, and the robustness perspective is in fact the origin of maxent modeling in the literature. Hence people suspected for a long time that MaxEnt *RL* must be robust in some way. 2021-03-12 18:39:04 The particular stochasticity induced by MaxEnt RL results in an objective that lower-bounds a min-max robust control objective, meaning that policies learned by MaxEnt RL should be provably robust to certain perturbations -- this is not the case for regular RL. 2021-03-12 18:37:47 We often hear that MaxEnt RL is "robust" -- but what does that mean? In his new paper &amp -&gt Paper: https://t.co/LJs50mwdLb Post: https://t.co/j5X2MW6GcK 2021-03-11 05:08:12 RT @shiorisagawa: We’ve released v1.1 of WILDS, our benchmark of in-the-wild distribution shifts! This adds the Py150 dataset for code comp… 2021-03-09 19:15:08 What is offline reinforcement learning? I made a talk giving a *non-technical* overview that explains offline RL. Since these days we pre-record our talks, I figured I would also share it with all of you! If in a hurry, watch at 2x speed... https://t.co/jJxanVSV5X 2021-03-05 23:50:41 New blog post from @natashajaques &amp https://t.co/7vTTsA9RKS https://t.co/pjeZgNJjUz 2021-03-01 20:47:04 Great article by @TiernanRayTech on robot learning, including our work at Berkeley &amp Hopefully will encourage even more researchers to tackle the difficult problems in robotic RL! 2021-02-17 06:12:19 RT @chelseabfinn: A new algo for RL from offline data: COMBO pushes down the est. value of states &amp 2021-02-17 06:06:37 Also, I couldn't be more proud of such a wonderful group of collaborators for this project: @TianheYu, @aviral_kumar2, @rmrafailov, @aravindr93, @chelseabfinn 2021-02-17 05:33:25 The reason I am really excited about this paper: the empirical results are very promising, for both state and image-based tasks, and the theory provides us with some insight about why this works. This gives us good confidence that the lessons will generalize to other tasks too. 2021-02-17 05:20:17 A principled offline model-based RL method: COMBO combines conservative Q-learning (CQL) with model-based learning, providing state-of-the-art offline RL results and formal guarantees! https://t.co/qi3u9erxOV w/ @TianheYu, Aviral Kumar, R. Rafailov, @aravindr93, @chelseabfinn https://t.co/IFGbStG2Oc 2021-02-08 21:52:05 RT @julianibarz: Our latest paper summarizing what we learned for the last 5+ years when applying RL to robotics. https://t.co/SbXppEXH9j 2021-02-08 19:12:17 This project was led first and foremost by @julianibarz, with a wonderful group of colleagues who contributed portions of the paper: Jie Tan, @chelseabfinn, Mrinal Kalakrishnan, Peter Pastor. 2021-02-08 19:11:25 It is important in robotic RL to think about not just the math in the algorithms, but the practicalities of getting learning systems to work in the real world: ops (can the robot keep training for a long time), safety, resets, scalability, etc. 2021-02-08 19:10:49 It's also a little bit out of date at this point (it's a journal paper, which took nearly a year to clear review, despite having very few revisions... but that's life I suppose). But we hope it will be pretty valuable to the community. 2021-02-08 19:09:37 This is somewhat different from the usual survey/technical paper: we are not so much trying to provide the technical foundations of robotic deep RL, but rather describe the practical lessons -- the stuff one doesn't usually put in papers. 2021-02-08 19:09:01 What did we learn from 5 years of robotic deep RL? My colleagues at Google and I tried to distill our experience into a review-style journal paper, covering some of the practical aspects of real-world robotic deep RL: https://t.co/fYGQfFYlKu -&gt 2021-02-03 20:20:45 RT @doomie: Blog post (with @babaeizadeh) describing our work and the library that can create some of the magic! https://t.co/ClqOtrUUuz 2021-02-03 20:20:38 RT @babaeizadeh: Blog post on our latest experiments on visual model based reinforcement learning https://t.co/XlS1aME66z One of the most… 2021-01-18 19:00:30 @pyoudeyer @FlowersINRIA It's inspiring how simple, cheap robots can be used to do cutting-edge research like this. For intrinsic motivation, lifelong learning, and other "persistent exploration" things, it's a great formula. Also safer and more accessible than bulky, expensive industrial robots :) 2021-01-10 22:25:36 Covers these papers: AWAC: https://t.co/JYyprRInhR MOPO: https://t.co/53VtOZKbcx CQL: https://t.co/TYL7RTNbO7 COG: https://t.co/i4rVXZoQbF 2021-01-10 22:24:25 This was a fun talk! An extended version of my offline RL + common sense talk, also covering model-based offline RL. https://t.co/6I2kADwdTD 2021-01-10 01:24:37 On a personal note, this was a really enjoyable collaboration with an amazing team: Stephen Tian, @SurajNair_1, Frederik Ebert, Sudeep Dasari, Ben Eysenbach, @chelseabfinn (Berkeley, Stanford, CMU) Couldn't have wished for a better group of collaborators for this :) 2021-01-10 01:22:35 This outperforms a range of prior methods in simulation, and can even control a real-world robot to open/close a drawer! (robot experiments were done by Suraj Nair, in @chelseabfinn's lab at Stanford) https://t.co/yV3ggQ5ai4 2021-01-10 01:21:28 The sampling based (CEM) planner then rolls out sampled action sequences, and selects the one that ends up with the shortest (predicted) dynamical distance to the goal. Learning the distance function turns out to be just an offline Q-learning problem. https://t.co/LP9x22VEzU 2021-01-10 01:20:42 The idea: we learn a predictive model (predict what will happen next given image+action), but this is only good for short(ish) horizon planning. For longer horizons, we also learn a "dynamical distance" that predicts how many steps to the goal, which we use as a planning cost https://t.co/PpOZXAxRSU 2021-01-10 01:19:52 Offline model-based RL for goal reaching: learn a distance "Q-like" function from offline data, and a video prediction model, then use them to accomplish visually indicated goals. w/ Stephen Tian et al. https://t.co/pmXL8fGHXv https://t.co/x9XXI7PN06 &gt 2021-01-07 03:05:55 My talk from the Knowledge Based RL Workshop at IJCAI can be watched here: https://t.co/xB3csuo9zw The talk covers COG and CQL, with a perspective on how offline RL can implicitly acquire knowledge and "common sense" from prior data! 2021-01-07 02:28:28 At IJCAI? I'm giving a talk in the Knowledge Based RL Workshop at 6:30 pm PT (in 5 min!), on how offline RL can distill knowledge &amp 2021-01-06 20:01:18 How can we train generative models to predict possible future outcomes, without regard for time steps, and then use them for control? @michaeljanner's BAIR blog post on gamma-models explores this question, with some neat visualizations! https://t.co/KnzYVERXyq 2020-12-18 02:27:41 I think this method is promising for a few reasons: - offline training using large prior datasets - fully autonomous - flexible, new goals specified just using images In experiments, data was collected ~6 mo prior (for another project), so it's even robust to changing seasons :) 2020-12-18 02:26:14 Or you could command the robot to patrol an area by giving it visually indicated waypoints, simply by snapping photographs of each waypoint. 5/n https://t.co/s4AmMafOSK 2020-12-18 02:25:30 This system can then be used in a few interesting ways: we can define "contactless delivery" targets just by having someone take a photo of their front door, and the robot then navigates to their front door to deliver their package. 4/n https://t.co/r8gIKBk1qR 2020-12-18 02:24:44 Once we have a distance function, policy, and graph, we search the graph to find a path for new visually indicated goals (images), and then execute the policy for the nearest node. A few careful design decisions (in the paper) make this work much better than prior work. 3/n https://t.co/45BR3xCsCi 2020-12-18 02:23:25 The idea: use RL + graph search to learn to reach visually indicated goals, using offline data. Starting with data in an environment (which in our case was previously collected for another project, BADGR), train a distance function and policy for visually indicated goals. 2/n https://t.co/2HtNr6N894 2020-12-18 02:21:31 RL enables robots to navigate real-world environments, with diverse visually indicated goals: https://t.co/r6m5yJYrQW w/ @_prieuredesion, B. Eysenbach, G. Kahn, @nick_rhinehart paper: https://t.co/MRKmGStx6Y video: https://t.co/RZVVD2pku7 Thread below -&gt 2020-12-17 20:15:53 Aviral Kumar and I have posted our NeurIPS offline reinforcement learning tutorial on YouTube for your enjoyment :) Slides, colab exercise, etc.: https://t.co/S639WkAroh Part 1: https://t.co/OozPaXLVhF Part 2: https://t.co/MPLhyipS1K 2020-12-17 20:14:03 All of the main lectures for UC Berkeley's fall 2020 deep RL course are now posted: https://t.co/SzCxe71WVB Newly posted lectures: 21 (transfer learning), 22 (meta-learning), 23 (open problems)! 2020-12-16 02:34:45 Distribution shift is a fact of life for models in the real world: the test distribution will never really match the training distribution. But evaluating methods that account for distribution shift is difficult. WILDS aims to address that, across a range of realistic domains. https://t.co/WNP1nRUXvb 2020-12-15 18:48:50 @rowantmc @ToyotaResearch @adnothing Congrats Rowan on your new job, we'll miss you! 2020-12-15 05:21:36 This online setting is I think an especially good fit for meta-learning: the final model after seeing many tasks can generalize in zero-shot, but long before it gets there, meta-learning can accelerate how few tasks/images are needed to get to that point! 2020-12-15 05:20:50 One neat thing that this method can do is open-domain classification: a new class is added with each task, and the network adapts by gradient descent, using a Siamese architecture. Learning is online, the classifier for each class must improve after every single new datapoint. https://t.co/ZDWO1YR5ps 2020-12-15 05:19:17 Online meta-learning methods learn to learn continually, becoming better at learning new tasks with each task. In new work, we demonstrate incremental online meta-learning, which learns one datapoint and task at a time @TianheYu, X. Geng, @chelseabfinn https://t.co/LWbH2hUHnB https://t.co/PtwlTNR5Pp 2020-12-12 18:53:00 @unsorsodicorda So far it does seem to work quite well, though the tasks are a bit different (eg we didn’t test it on images). We are still finalizing the paper, so this is a “fresh off the press” kind of thing (not on arxiv yet, just at the workshop) 2020-12-12 00:39:06 At WS on challenges of real-world RL, B. Eysenbach will present DARC. Domain adaptation for RL: how to train RL agents in one domain, but have them pretend they are in another Pres (1520 PT Sat 12/12) https://t.co/8eczV4hkM8 Paper https://t.co/FqoaM5BxmS https://t.co/Sa9jbozns4 https://t.co/0UiUZlw073 2020-12-11 21:18:14 At ML4Molecules (https://t.co/yyLUzGsRW6), Justin Fu will present Offline Model-Based Optimization via Normalized Maximum Likelihood (NEMO), for optimizing designs from data w/ NML! 8:30am ML4Molecules poster session Paper: https://t.co/N9jWfIZhOS Poster: https://t.co/ZaYycpuhAF https://t.co/gwo1iteH0Q 2020-12-11 20:16:37 RT @agarwl_: Today, we will be presenting our work on how deep neural networks interact with Q-learning (w/t Aviral, @its_dibya, @svlevine)… 2020-12-11 18:43:25 At DRL WS and robot learning WS, @_oleh, @chuning_zhu will present collocation-based planning for image-based model-based RL! By relaxing dynamics, robot images object "flying" to goal before figuring out how to move it Video https://t.co/4ePoBMUvkT paper https://t.co/g8hp7bLas2 https://t.co/ksEfhdJ6V8 2020-12-11 18:36:38 Also at deep RL WS, Abhishek Gupta &amp Room D, C7, Deep RL Workshop, 12:30-1:30 and 6-7 PST Paper: https://t.co/gA6HhPNVqy Slideslive: https://t.co/Q5hE42riUa https://t.co/2lu935FeYT 2020-12-11 18:35:38 At deep RL WS, Abhishek Gupta &amp Room B, B5, 1230-1330 &amp Paper https://t.co/fn8sEiLOsT Pres https://t.co/FO7UwlZ53k Blog https://t.co/8wZp0pEiOx https://t.co/QUlaG5lTZf 2020-12-11 05:32:47 Enjoy all the NeurIPS workshops!! 2020-12-11 05:32:23 At meta-learning workshop, Marvin Zhang will present Adaptive Risk Minimization, how models can learn to adapt to distributional shifts at test time via meta-learning. Paper: https://t.co/u1FZBiUJZ1 Pres: https://t.co/T8bRr8N0UC https://t.co/KvfQubyjEz 2020-12-11 05:31:35 At robot learning WS (8:45am PT poster) and real-world RL WS (12/12 11:20am poster), @avisingh599 will present PARROT: pre-training models that explore for diverse robotic skills. Arxiv: https://t.co/COMGTmCInG Video: https://t.co/2MU2V8VNGa Website: https://t.co/1o1T6rsiHb https://t.co/sttRDBgGp8 2020-12-11 05:31:11 At deep RL WS, and as long oral presentation at offline RL WS, @avisingh599 will present COG: how offline RL can chain skills and acquire a kind of “common sense” Vid: https://t.co/v0yt1j8Lh2 Web: https://t.co/6A9INvrB5G Blog: https://t.co/cAhxunkbAN Offline RL talk 12/12 9:50am https://t.co/EY6AitWuF9 2020-12-11 05:30:42 At deepRL WS, @mmmbchang will present “Modularity in Reinforcement Learning: An Algorithmic Causality Perspective on Credit Assignment” how causal models help us understand transfer in RL! Poster: https://t.co/85qJafdpAg Paper: https://t.co/RIuQchAnjI Vid: https://t.co/ELHZEtn0Eg https://t.co/xspMNqNPyc 2020-12-11 05:29:53 Also at deep RL WS, @snasiriany and co-authors will present “DisCo RL”: RL conditioned on distributions, which provides much more expressivity than conditioning on goals. Paper: https://t.co/SYJqtfHfoT Presentation: https://t.co/c2zGwg8q8u Poster: https://t.co/MUaM4MeQET 2020-12-11 05:27:58 Also at deep RL WS, Aviral Kumar will present “Implicit Under-Parameterization” – our work on how TD learning can result in excessive aliasing due to rank collapse. Paper: https://t.co/haeE1YX4Ue Video: https://t.co/xoEEq2t0Gh https://t.co/Z7nddVBV3z 2020-12-11 05:27:35 Also at deep RL WS posters, @timrudner @vitchyr will present “Outcome-Driven Reinforcement Learning,” describing how goal-conditioned RL can be derived in a principled way via variational inference. Paper: https://t.co/aaG6wm2Kvp Talk: https://t.co/cWS0dH1fIN https://t.co/n0iRAmtUxG 2020-12-11 05:26:24 Ashvin Nair will present AWAC, offline RL with online finetuning, also at the deep RL WS poster session. pres: https://t.co/Ns9kDNsG90 paper: https://t.co/JYyprRInhR blog: https://t.co/4MAkic9mT4 https://t.co/NFhsjmtOSz 2020-12-11 05:26:00 Also at deep RL WS posters, Jensen Gao&amp Paper: https://t.co/ygkCJH8pmJ Talk: https://t.co/60T8YKh3TO https://t.co/nyHZ0dhpee 2020-12-11 05:24:43 Ben will also present C-Learning: a new algorithm for goal-conditioned learning that combines RL with principled training of predictive models. Deep RL poster session, 12:30 pm PT. Paper: https://t.co/Ms0e9wvxaA Website: https://t.co/GUhYjmAIXC Talk: https://t.co/5G30WgNZcy https://t.co/lOSoxSf97V 2020-12-11 05:24:25 At the deep RL workshop, Ben Eysenbach will talk about how MaxEnt RL is provably robust to certain types of perturbations. Contributed talk at 14:00pm PT 12/11. Paper: https://t.co/ZwaGqzwdtb Talk: https://t.co/BvjCFW257j 2020-12-11 05:24:13 At robot learning workshop, @katie_kang_ will present the best-paper-winning (congrats!!) “Multi-Robot Deep Reinforcement Learning via Hierarchically Integrated Models”: how to share modules between multiple real robots 2020-12-11 05:23:28 My favorite part of @NeurIPSConf is the workshops, a chance to see new ideas and late-breaking work. Our lab will present a number of papers &amp thread below -&gt meanwhile here is a teaser image :) https://t.co/OCiU537VPH 2020-12-10 21:14:53 In-depth analysis of image-based model-based RL by @babaeizadeh and colleagues is finally out! Lots of analysis studying why model-based RL with images works, and which design decisions are more or less important. Some of the conclusions may surprise you! (or they might not...) https://t.co/dhHbLOM1Ec 2020-12-10 21:13:20 @kargarisaac Yes, or get labels somewhere (e.g., human labelers). Nice thing with offline data (whether MBO or offline RL) is that the labels only need to be obtained once, just like with any supervised learning application. 2020-12-10 21:12:04 @unsorsodicorda We do have experiments with real-world datasets (images, proteins, etc.). Curious what kind of applications you would have in mind -- we are actively trying to find suitable domains in which to evaluate these methods. 2020-12-10 19:36:20 In contrast, many alternative methods rely on active sampling of the data -- when we are talking about real world data (e.g., biology experiments, aircraft designs, etc.), each datapoint can be expensive and time-consuming, while offline MBO can reuse the same data. 2020-12-10 19:35:36 Why does this problem matter? In many cases we *already* have offline data (e.g., previously synthesized drugs and their efficacies, previously tested aircraft wings and their performance, prior microchips and their speed), so offline MBO uses this data to produce new designs. 2020-12-10 19:31:43 MINs address this issue with a simple approach: instead of learning f(x) = y, learn f^{-1}(y) = x -- here, the input y is very low dimensional (1D!), making it much easier to handle OOD inputs. This ends up working very well in practice. 2020-12-10 19:31:06 This can be mitigated by active sampling (i.e., collecting more data), but this is often not possible in practice (e.g., requires running costly experiments). Or by using Bayesian models like GPs, but these are difficult to scale to high dimensions. 2020-12-10 19:30:30 Classically, model-based optimization methods would learn some proxy function (acquisition function) fhat(x) = y, and then solve x* = argmax_x fhat(x), but this can result in OOD inputs to fhat(x) when x is very high dimensional. 2020-12-10 19:29:52 The problem setting: given samples (x,y) where x represents some input (e.g., protein sequence, image of a face, controller parameters) and y is some metric (e.g., how well x does at some task), find a new x* with the best y *without access to the true function*. 2020-12-10 19:29:08 Tonight 12/10 9pm PT, Aviral Kumar will present Model Inversion Networks (MINs) at @NeurIPSConf. Offline model-based optimization (MBO) that uses data to optimize images, controllers and even protein sequences! paper: https://t.co/ePP7yjAgj0 pres: https://t.co/lsJ1qreCmj more-&gt 2020-12-10 02:49:58 Tmrw (12/10), @imkelvinxu will present "Continual Learning of Control Primitives: Skill Discovery via Reset-Games": lifelong learning without resets, while acquiring a repertoire of skills Check it out at the 9am PT poster session at @NeurIPSConf! Paper: https://t.co/EtNpMR2dy8 https://t.co/xK8uxQ3Tq8 2020-12-09 22:27:59 Come check out Tianhe Yu's PCGrad presentation at @NeurIPSConf, 9pm PT! Links below PCGrad mitigates conflicting gradients in multi-task learning, with a simple project trick that cancels off conflicting portions of gradients, greatly improving performance in multi-task settings https://t.co/UbB6SmJaao 2020-12-09 02:42:24 CQL will be presented in the 9 pm PT poster session at @NeurIPSConf on 12/8 (Tuesday, today). Paper: https://t.co/TYL7RTNbO7 Website: https://t.co/u0eVrncvAF Poster session: https://t.co/mH7c5qi7vC 2020-12-09 02:39:49 Today at 9pm PT (in 2.5 hrs), Aviral Kumar will present Conservative Q-Learning (CQL). CQL learns Q-functions that lower bound the true Q-function, providing some interesting theoretical guarantees, as well as excellent empirical performance. Come by Aviral's poster! more below&gt 2020-12-08 18:09:51 Paper: https://t.co/g0VQv7GlBS Presentation (6 pm PT today 12/8): https://t.co/U11qNs6C2H w/ Ben Eysenbach, Young Geng, @rsalakhu 2020-12-08 18:09:15 The idea: for all collected, infer the posterior over tasks for which that behavior is near-optimal. This is the same as the inverse RL problem. Then use this posterior to relabel this experience. This can work for any reward class: goals, linear rewards, non-linear rewards, etc. 2020-12-08 18:08:25 Today (Tue 12/8) check out Ben Eysenbach's long talk on HIPI: Rewriting History with Inverse RL at @NeurIPSConf, at 6 pm PT: https://t.co/U11qNs6C2H Find out how inverse RL allows us to formalize relabeling of rewards, improving learning of contextual policies. more-&gt 2020-12-08 17:28:04 RT @chelseabfinn: Interested in robustness in reinforcement learning? @SaurabhKRL is presenting this paper at @NeurIPSConf right now (9-11… 2020-12-08 05:52:54 And one more for tomorrow: Alex Lee and Anusha Nagabandi will present SLAC: Stochastic Latent Actor-Critic. Come learn how representation learning can drastically speed up RL! 9 am PT Tue 12/8 Pres: https://t.co/lwHkiXnLrC Web&amp 2020-12-08 05:00:12 Tmrw @NeurIPSConf, @michaeljanner will present gamma-models at 9am PT (12/8): models that predict densities over potential future states over a long horizon, rather than one step at a time. Pres: https://t.co/CwOWRKiCFr Paper: https://t.co/j4FKAQmrWK Web: https://t.co/QYga0n3Vy4 https://t.co/pSwDoa2F6n 2020-12-08 04:54:37 HEDGE trains "tree-structured" models that can generate intermediate subgoals between pairs of states. Paper: https://t.co/GlSlAEuvR0 Website: https://t.co/Rre1xTzu55 Video presentation: https://t.co/XGAFhjJVfw 2020-12-08 04:53:42 Tmrw (Tue 9/8 at 9 am PT) check out HEDGE at @NeurIPSConf: hierarchical planning with learned tree-structured models enables planning complex behaviors one subgoal at a time. w/ @KarlPertsch, @_oleh, @febert8888, @chelseabfinn, @dineshjayaraman https://t.co/nODdDelyKP more-&gt 2020-12-08 02:06:05 This is going to be presented tonight (in 3 hrs) at @NeurIPSConf, 9 pm PT poster session! Come find out about offline model-based RL. https://t.co/OGf7J27O1b 2020-12-07 22:14:03 RT @berkeley_ai: Check out Aviral Kumar and Avi Singh's @avisingh599 new blog post, "Offline RL: How Conservative Algorithms Can Enable New… 2020-12-07 19:06:05 The work in this blog post will also be presented at @NeurIPSConf! CQL will be presented Tue 9 pm PT in the poster session COG will be presented at the offline RL workshop in the 12:40 pm PT contributed talks session 2020-12-07 19:03:03 Want to learn even more offline RL? Aviral &amp Offline Reinforcement Learning: How Conservative Algorithms Can Enable New Applications Overview of CQL &amp 2020-12-07 16:20:14 Our @NeurIPSConf tutorial on offline RL has started! Aviral and I are answering questions live via chat during the presentation, with a Q&amp Website with colab: https://t.co/S639WkAroh 2020-12-07 04:40:04 @SketcherRami @NeurIPSConf I posted the wrong time, but I believe I deleted the previous one, so this should be the correct done 2020-12-07 02:20:23 Paper: https://t.co/2LxmUc6b6J Presentation: https://t.co/6eVsfbVCtm Poster: https://t.co/jMoghV3WdU Blog Post: https://t.co/aU6EGeG3zs Poster session: 9-11pm PT Monday Spotlight talk + Q&amp w/ Aviral Kumar, @abhishekunique7 2020-12-07 02:20:14 Can we stabilize deep RL by only backing up target values that are more likely correct? DisCor aims to achieve this, providing for "corrective feedback" during learning. See Aviral's @NeurIPSConf talk tmrw (Mon), 7:20pm PT: https://t.co/6eVsfbVCtm (poster at 9pm) more below -&gt 2020-12-06 22:52:24 This is joint work with: @MichaelD1729, @natashajaques, @EugeneVinitsky, @alexandrebayen, Stuart Russell, Andrew Critch paper: https://t.co/JCOUElGUTv code: https://t.co/jUWFGYZDsc 2020-12-06 22:51:33 Adversarial environment design: PAIRED learns to generate complex environments via minimax regret. Check out our presentation at @NeurIPSConf: Oral: 6:30 pm PT Mon 12/7: https://t.co/aLK2dbjzP2 Poster: 9 pm PT Mon: https://t.co/aLK2dbjzP2 more below: https://t.co/yDfoj5gI3k https://t.co/3IUMfQ1556 2020-12-05 23:06:10 There are also some more materials on the tutorial website: https://t.co/S639WkAroh Including a colab demo/exercise authored by Aviral for you to play with offline RL in gridworld domains, including an illustrative CQL implementation: https://t.co/QssLGJjVBc 2020-12-05 22:32:49 If you prefer to learn by reading rather than watching, you can also read our tutorial and survey paper on offline RL here: https://t.co/kv4mtVMScG If you want to try out some offline RL problems, try D4RL, the most popular offline RL benchmark suite: https://t.co/dPMs20puRW 2020-12-05 22:31:47 Want to learn about offline RL? Aviral Kumar and I are giving a tutorial on offline RL at @NeurIPSConf. Materials now online! https://t.co/ka741RCymv Main talk: 8 - 10:30 am PT 12/7 Mon Q&amp Recording here: https://t.co/24HFYHiAE2 more -&gt 2020-12-03 04:43:08 @JeffDean @doomie @eigenhector Mine was about animating octopus hand gestures. 2020-11-21 00:34:21 What do pigeons have to do with robots? Both can learn with offline RL and combine skills from prior experience! Find out more about pigeons &amp https://t.co/v0yt1j8Lh2 2020-11-20 17:24:52 I posted the paper link a few days ago for this, here it is again: https://t.co/piwcU5wMs4 High level idea: train on test point with each label, and see which model gives the highest likelihood. Making this tractable with amortization (ACNML) makes for a useful algorithm. 2020-11-20 17:23:57 Training on test points is a no-no. But training on test points with fake labels turns out to be a good idea. Aurick's blog post on amortized conditional normalized likelihood (ACNML) shows how NML provides a principled view of "training on test points": https://t.co/iasW5xhaQs https://t.co/NkU7Ntl9fi 2020-11-20 15:43:21 I'm excited about this work because the exploration enabled by Parrot really looks much more intelligent than what we get with random movements, and evaluating on diverse tasks tests generalization. w/ @avisingh599, Huihan Liu, Gaoyue Zhou, Albert Yu, @nick_rhinehart 2020-11-20 15:42:28 We train a multimodal density model (based on normalizing flows) on past data, obtained from a variety of tasks, which allows a robot to learn basic principles like reaching for objects and picking things up. When faced with a new scene, robots will reach for random objects. https://t.co/06agbo7Onv 2020-11-20 15:40:58 RL agents explore randomly. Humans explore by trying potential good behaviors, because we have a prior on what might be useful. Can robots get such behavioral priors? That's the idea in Parrot. arxiv https://t.co/COMGTmCInG web https://t.co/1o1T6rsiHb vid https://t.co/2MU2V8VNGa https://t.co/cZSxqTHafl 2020-11-19 20:02:19 What I find exciting about this work is that it provides principles for a variety of "hacks" that are used for goal-conditioned policies, explaining "Q-value-like" quantities as densities, explaining relabeling, and what relabeling ratio to use. Website: https://t.co/GUhYjmAIXC 2020-11-19 20:01:28 C-Learning learns goal-conditioned policies using classifiers, without any hand-designed rewards. The theoretical framework in C-Learning helps to explain connections between prediction and Q-functions, relabeling, and others. https://t.co/tGEGEF52Ga w/ Eysenbach, @rsalakhu https://t.co/z3co465aBc 2020-11-18 19:45:47 @CsabaSzepesvari @neu_rips @jacobmbuckman Though I guess these days all of us are sitting at home by ourselves, so maybe the "maverick scientist" mythos feels more true now than it did before 2020-11-18 19:43:20 Today at 11:50 am (in 10 min!) @ryancjulian will present (at @corl_conf) "Never Stop Learning" -- finetuning for robotic RL. Come learn how finetuning enables large-scale robotic RL to adapt. talk https://t.co/o5uU7MMzTS CoRL https://t.co/5bp9Nptcww paper https://t.co/8DlOHcDrGA https://t.co/BRSQcE5H1B 2020-11-18 01:02:24 And here is a bonus animation visualizing the exploration process :) Presentation Link: https://t.co/X5LCIwynpi Paper: https://t.co/tJkg3RM9sE https://t.co/bWvKEtKUJW 2020-11-18 01:02:05 On Wed 11:10 - 11:40 AM at @corl_conf, we will present MELD, meta-RL from images for real-world robotic manipulation! Come find out how to get robots to figure out where to insert ethernet cables from images. w/ Tony Zhao, K. Rakelly, A. Nagabandi, @chelseabfinn https://t.co/UHxgNlHFqJ 2020-11-17 23:03:54 @jacobmbuckman @neu_rips I do think there are several huge problems, I just don't think I'm the only one who is aware of them. Science is a social endeavor, important discoveries from maverick scientists who work in a dark basement are a myth. 2001-01-01 01:01:01

Découvrez Les IA Experts

Nando de Freitas Chercheur chez Deepind
Nige Willson Conférencier
Ria Pratyusha Kalluri Chercheur, MIT
Ifeoma Ozoma Directrice, Earthseed
Will Knight Journaliste, Wired