• Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, and Matthieu Geist. "AVEX: What Matters for Animal Vocalization Encoding" In International Conference on Learning Representations, 2026.
    @inproceedings{miron2026avex,
      title={AVEX: What Matters for Animal Vocalization Encoding},
      author={Marius Miron and David Robinson and Milad Alizadeh and Ellen Gilsenan-McMahon and Gagan Narula and Emmanuel Chemla and Maddie Cusimano and Felix Effenberger and Masato Hagiwara and Benjamin Hoffman and Sara Keen and Diane Kim and Jane K. Lawton and Jen-Yu Liu and Aza Raskin and Olivier Pietquin and Matthieu Geist},
      booktitle={International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=MFuM9KAEYc}
    }
    Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and newly proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find that self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and applications, we release the model checkpoints as well as the Animal Vocalization Encoder library AVEX (an API for model loading and inference, and a Python-based system for training and evaluating bioacoustics representation learning models).
    ICLR 2026
  • Team Cohere: Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, et al. "Command A: An Enterprise-Ready Large Language Model." arXiv preprint, 2025.
    @misc{cohere2025commanda,
      title={Command A: An Enterprise-Ready Large Language Model},
      author={Team Cohere and : and Aakanksha and Arash Ahmadian and Marwan Ahmed and Jay Alammar and Milad Alizadeh and Yazeed Alnumay and Sophia Althammer and et al.},
      year={2025},
      eprint={2504.00698},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
    }
    In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.
  • David Robinson, Marius Miron, Masato Hagiwara, Benno Weck, Sara Keen, Milad Alizadeh, Gagan Narula, Matthieu Geist, and Olivier Pietquin. "NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics." arXiv preprint, 2025.
    @misc{robinson2025naturelmaudio,
      title={NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics},
      author={David Robinson and Marius Miron and Masato Hagiwara and Benno Weck and Sara Keen and Milad Alizadeh and Gagan Narula and Matthieu Geist and Olivier Pietquin},
      year={2025},
      eprint={2411.07186},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
    }
    Large language models (LLMs) prompted with text and audio have achieved state-of-the-art performance across various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, their potential has yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior -- tasks that are crucial for conservation, biodiversity monitoring, and animal behavior studies. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our training dataset consists of carefully curated text-audio pairs spanning bioacoustics, speech, and music, designed to address the field's limited availability of annotated data. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. We evaluate NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets a new state of the art on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we release our model weights, benchmark data, and open-source the code for training and benchmark data generation and model training.
  • Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, and Matthias Gallé. "On Leakage of Code Generation Evaluation Datasets." In EMNLP Findings, 2024.
    @inproceedings{matton2024leakage,
      title={On Leakage of Code Generation Evaluation Datasets},
      author={Alexandre Matton and Tom Sherborne and Dennis Aumiller and Elena Tommasone and Milad Alizadeh and Jingyi He and Raymond Ma and Maxime Voisin and Ellen Gilsenan-McMahon and Matthias Gallé},
      year={2024},
      booktitle={EMNLP Findings}
    }
    In this paper we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. Key to our findings is a new dataset of 161 prompts with their associated python solutions, dataset which is released at https://huggingface.co/datasets/CohereForAI/lbpp.
    EMNLP Findings 2024
  • Emilien Dupont, Hrushikesh Loya, Milad Alizadeh, Adam Golinski, Yee Whye Teh, and Arnaud Doucet. "COIN++: Neural Compression Across Modalities." Transactions on Machine Learning Research, 2022.
    @article{dupont2022coin,
      title={COIN++: Neural Compression Across Modalities},
      author={Emilien Dupont and Hrushikesh Loya and Milad Alizadeh and Adam Golinski and Yee Whye Teh and Arnaud Doucet},
      journal={Transactions on Machine Learning Research},
      year={2022}
    }
    Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the feasibility of our method by compressing various data modalities, from images and audio to medical and climate data.
    TMLR 2022
  • Milad Alizadeh, Shyam A. Tailor, Luisa M Zintgraf, Joost van Amersfoort, Sebastian Farquhar, Nicholas Donald Lane, and Yarin Gal. "Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients." In International Conference on Learning Representations, 2022.
    @inproceedings{alizadeh2022prospect,
      title={Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients},
      author={Milad Alizadeh and Shyam A. Tailor and Luisa M Zintgraf and Joost van Amersfoort and Sebastian Farquhar and Nicholas Donald Lane and Yarin Gal},
      booktitle={International Conference on Learning Representations},
      year={2022}
    }
    Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network while consuming fewer computational resources for training and inference. However, current methods are insufficient to enable this optimization and lead to a large degradation in model performance. In this paper, we identify a fundamental limitation in the formulation of current methods, namely that their saliency criteria look at a single step at the start of training without taking into account the trainability of the network. While pruning iteratively and gradually has been shown to improve pruning performance, explicit consideration of the training stage that will immediately follow pruning has so far been absent from the computation of the saliency criterion. To overcome the short-sightedness of existing methods, we propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune. ProsPr combines an estimate of the higher-order effects of pruning on the loss and the optimization trajectory to identify the trainable sub-network. Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods.
    ICLR 2022
  • Emilien Dupont, Adam Goliński, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. "COIN: COmpression with Implicit Neural representations." In Neural Compression Workshop at ICLR 2021, 2021. (Spotlight)
    @misc{dupont2021coin,
      title={COIN: COmpression with Implicit Neural representations},
      author={Emilien Dupont and Adam Goliński and Milad Alizadeh and Yee Whye Teh and Arnaud Doucet},
      year={2021},
      eprint={2103.03123},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      booktitle={Neural Compression Workshop at ICLR 2021}
    }
    We propose a new simple approach for image compression: instead of storing the RGB values for each pixel of an image, we store the weights of a neural network overfitted to the image. Specifically, to encode an image, we fit it with an MLP which maps pixel locations to RGB values. We then quantize and store the weights of this MLP as a code for the image. To decode the image, we simply evaluate the MLP at every pixel location. We found that this simple approach outperforms JPEG at low bit-rates, even without entropy coding or learning a distribution over weights. While our framework is not yet competitive with state of the art compression methods, we show that it has various attractive properties which could make it a viable alternative to other neural data compression approaches.
  • Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. "Gradient ℓ₁ Regularization for Quantization Robustness." In International Conference on Learning Representations, 2020.
    @inproceedings{alizadeh2020gradient,
      title={Gradient ℓ₁ Regularization for Quantization Robustness},
      author={Milad Alizadeh and Arash Behboodi and Mart van Baalen and Christos Louizos and Tijmen Blankevoort and Max Welling},
      booktitle={International Conference on Learning Representations},
      year={2020}
    }
    We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application change. Unlike quantization-aware training using the straight-through estimator that only targets a specific bit-width and requires access to training data and pipeline, our regularization-based method paves the way for "on the fly" post-training quantization to various bit-widths. We show that by modeling quantization as a ℓ∞-bounded perturbation, the first-order term in the loss expansion can be regularized using the ℓ₁-norm of gradients. We experimentally validate our method on different architectures on CIFAR-10 and ImageNet datasets and show that the regularization of a neural network using our method improves robustness against quantization noise.
    ICLR 2020
  • Milad Alizadeh, Javier Fernández-Marqués, Nicholas D Lane, and Yarin Gal. "An Empirical study of Binary Neural Networks' Optimisation." In International Conference on Learning Representations, 2019.
    @article{alizadeh2018empirical,
      title={An Empirical study of Binary Neural Networks' Optimisation},
      author={Milad Alizadeh and Fernández-Marqués, Javier and Lane, Nicholas D and Gal, Yarin},
      booktitle={International Conference on Learning Representations},
      year={2019}
    }
    Binary neural networks using the Straight-Through-Estimator (STE) have been shown to achieve state-of-the-art results, but their training process is not well-founded. This is due to the discrepancy between the evaluated function in the forward path, and the weight updates in the back-propagation, updates which do not correspond to gradients of the forward path. Efficient convergence and accuracy of binary models often rely on careful fine-tuning and various ad-hoc techniques. In this work, we empirically identify and study the effectiveness of the various ad-hoc techniques commonly used in the literature, providing best-practices for efficient training of binary models. We show that adapting learning rates using second moment methods is crucial for the successful use of the STE, and that other optimisers can easily get stuck in local minima. We also find that many of the commonly employed tricks are only effective towards the end of the training, with these methods making early stages of the training considerably slower. Our analysis disambiguates necessary from unnecessary ad-hoc techniques for training of binary neural networks, paving the way for future development of solid theoretical foundations for these. Our newly-found insights further lead to new procedures which make training of existing binary neural networks notably faster.
    ICLR 2019
  • Vincent WS Tseng, Sourav Bhattachara, Javier Fernández-Marqués, Milad Alizadeh, Catherine Tong, and Nicholas D Lane. "Deterministic binary filters for convolutional neural networks." In International Joint Conferences on Artificial Intelligence Organization, 2018.
    @inproceedings{tseng2018deterministic,
      title={Deterministic binary filters for convolutional neural networks},
      author={Tseng, Vincent WS and Bhattachara, Sourav and Fernández-Marqués, Javier and Alizadeh, Milad and Tong, Catherine and Lane, Nicholas D},
      year={2018},
      organization={International Joint Conferences on Artificial Intelligence Organization}
    }
    We propose Deterministic Binary Filters, an approach to Convolutional Neural Networks that learns weighting coefficients of predefined orthogonal binary basis instead of the conventional approach of learning directly the convolutional filters. This approach results in model architectures with significantly fewer parameters (4x to 16x) and smaller model sizes (32x due to the use of binary rather than floating point precision). We show our deterministic filter design can be integrated into well-known network architectures (such as ResNet and SqueezeNet) with as little as 2% loss of accuracy (under datasets like CIFAR-10). Under ImageNet, they result in 3x model size reduction compared to sub-megabyte binary networks while reaching comparable accuracy levels.
    IJCAI 2018
  • Rudzidatul Akmam Dziyauddin, Dritan Kaleshi, Angela Doufexi, and Milad Alizadeh. "Performance evaluation of quality of service for joint packet dropping and scheduling." Wireless Personal Communications 83, no. 2 (2015): 1549–1566.
    @article{dziyauddin2015performance,
      title={Performance evaluation of quality of service for joint packet dropping and scheduling},
      author={Dziyauddin, Rudzidatul Akmam and Kaleshi, Dritan and Doufexi, Angela and Alizadeh, Milad},
      journal={Wireless Personal Communications},
      volume={83},
      number={2},
      pages={1549--1566},
      year={2015},
      publisher={Springer}
    }
    Quality of Service is particularly necessary to serve delay-sensitive applications in heavy-loaded wireless networks. In this paper we evaluate a strategy of combining packet dropping and scheduling policies at Medium Access Control layer in guaranteeing maximum packet latency for real-time applications. The purpose of this work is to evaluate how significance the mentioned combination schemes can meet the required latency and also the achievable system throughput. For the case study, a real time Polling Service class in the Worldwide Interoperability for Microwave Access System for downlink transmission is assumed. The main analysis is undertaken for User Datagram Protocol (UDP) traffic in stationary and mobile user scenarios under heavy load conditions, and the impact of mixed Transmission Control Protocol and UDP traffic is also investigated. Results show that the introduction of a packet dropping policy ensures that the latency is kept well within the required maximum latency requirement, regardless of the types of scheduler used. However, the packet drop percentage (or packet loss) depends strongly on the types of schedulers. All schedulers show similar goodput performance for low load conditions, and the results can only be distinguished for the cases of heavy load/overloaded conditions.
  • Milad Alizadeh, Rudzidatul Akmam Dziyauddin, Dritan Kaleshi, and Angela Doufexi. "A comparative study of mixed traffic scenarios for different scheduling algorithms in WiMAX." In 2012 IEEE 75th Vehicular Technology Conference (VTC Spring), 1–6. IEEE, 2012.
    @inproceedings{alizadeh2012comparative,
      title={A comparative study of mixed traffic scenarios for different scheduling algorithms in WiMAX},
      author={Alizadeh, Milad and Dziyauddin, Rudzidatul Akmam and Kaleshi, Dritan and Doufexi, Angela},
      booktitle={2012 IEEE 75th Vehicular Technology Conference (VTC Spring)},
      pages={1--6},
      year={2012},
      organization={IEEE}
    }
    WiMAX promises an advanced framework to support Quality-of-Service (QoS) requirements of different types of applications and scheduling is a key part in its QoS provisioning. The scheduling algorithms used in this paper are based on our proposed Greedy-Latency scheduler, a modified form of Greedy algorithm which can guarantee delay requirements of real-time applications while optimising the system throughput. Our study of TCP performance in WiMAX shows that unlike UDP traffic, there are fluctuations in TCP throughput even for low traffic loads. It is seen that employing Automatic Repeat reQuest (ARQ) and setting the right TCP window size are crucial for a stable optimal TCP performance. WiMAX QoS mechanism can successfully maintain the inter-class priority between TCP traffic in Best Effort (BE) class and UDP in higher priority Real-Time Polling Service (rtPS) class. For intra-class scenarios, it is observed that TCP flows in general need a protection mechanism as the UDP traffic tend to seize the channel. The proposed Greedy-Scheduler can provide better intra-class protection for TCP flows due to its packet dropping policy.