Shanahan, M., McDonell, K. & Reynolds, L. Role play with large language models. Nature 623, 493–498 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Shastri, B. J. et al. Photonics for artificial intelligence and neuromorphic computing. Nat. Photonics 15, 102–114 (2021).
Bernstein, L. et al. Single-shot optical neural network. Sci. Adv. 9, eadg7904 (2023).
Zheng, H. et al. Multichannel meta-imagers for accelerating machine vision. Nat. Nanotechnol. 19, 471–478 (2024).
Zheng, H. et al. Meta-optic accelerators for object classifiers. Sci. Adv. 8, eabo6410 (2022).
Luo, M. et al. Meta-optics based parallel convolutional processing for neural network accelerator. Laser Photonics Rev. 18, 2300984 (2024).
Liu, C. et al. A programmable diffractive deep neural network based on a digital-coding metasurface array. Nat. Electron. 5, 113–122 (2022).
Shen, Y. et al. Deep learning with coherent nanophotonic circuits. Nat. Photon. 11, 441–446 (2017).
Ashtiani, F., Geers, A. J. & Aflatouni, F. An on-chip photonic deep neural network for image classification. Nature 606, 501–506 (2022).
Feldmann, J. et al. Parallel convolutional processing using an integrated photonic tensor core. Nature 589, 52–58 (2021).
Lin, X. et al. All-optical machine learning using diffractive deep neural networks. Science 361, 1004–1008 (2018).
Zhou, T. et al. Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit. Nat. Photonics 15, 367–373 (2021).
Antonik, P., Marsal, N., Brunner, D. & Rontani, D. Human action recognition with a large-scale brain-inspired photonic computer. Nat. Mach. Intell. 1, 530–537 (2019).
Wang, T. et al. Image sensing with multilayer nonlinear optical neural networks. Nat. Photon. 17, 408–415 (2023).
Xia, F. et al. Nonlinear optical encoding enabled by recurrent linear scattering. Nat. Photon. 18, 1067–1075 (2024).
Luo, X. et al. Metasurface-enabled on-chip multiplexed diffractive neural networks in the visible. Light Sci. Appl. 11, 158 (2022).
Huang, C. et al. A silicon photonic–electronic neural network for fibre nonlinearity compensation. Nat. Electron. 4, 837–844 (2021).
Fu, T. et al. Photonic machine learning with on-chip diffractive optics. Nat. Commun. 14, 70 (2023).
Dong, B. et al. Partial coherence enhances parallelized photonic computing. Nature 632, 55–62 (2024).
Xu, Z. et al. Large-scale photonic chiplet Taichi empowers 160-TOPS/W artificial general intelligence. Science 384, 202–209 (2024).
McMahon, P. L. The physics of optical computing. Nat. Rev. Phys. 5, 717–734 (2023).
Yildirim, M., Dinc, N. U., Oguz, I., Psaltis, D. & Moser, C. Nonlinear processing with linear optics. Nat. Photon. 18, 1076–1082 (2024).
Goi, E. et al. Nanoprinted high-neuron-density optical linear perceptrons performing near-infrared inference on a CMOS chip. Light Sci. Appl. 10, 40 (2021).
Chen, Y. et al. All-analog photoelectronic chip for high-speed vision tasks. Nature 623, 48–57 (2023).
Wetzstein, G. et al. Inference in artificial intelligence with deep optics and photonics. Nature 588, 39–47 (2020).
Feng, H. et al. Integrated lithium niobate microwave photonic processing engine. Nature 627, 80–87 (2024).
Xu, X. et al. 11 TOPS photonic convolutional accelerator for optical neural networks. Nature 589, 44–51 (2021).
Liu, Z. et al. Swin Transformer: hierarchical vision transformer using shifted windows. In Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 10012–10022 (IEEE, 2021).
Cui, K. et al. Spectral convolutional neural network chip for in-sensor edge computing of incoherent natural light. Nat. Commun. 16, 81 (2025).
Wei, K. et al. Spatially varying nanophotonic neural networks. Sci. Adv. 10, eadp0391 (2024).
Qu, G. et al. All-dielectric metasurface empowered optical-electronic hybrid neural networks. Laser Photonics Rev. 16, 2100732 (2022).
Rahimi, A. & Recht, B. Random features for large-scale kernel machines. In Proc. 21st International Conference on Neural Information Processing Systems (NIPS’07) 1177–1184 (Curran Associates, 2007).
Choromanski, K. M. et al. Rethinking attention with performers. In Proc. International Conference on Learning Representations (ICLR 2021) (ICLR, 2021).
Zhang, Y. et al. Image super-resolution using very deep residual channel attention networks. In Proc. European Conference on Computer Vision (ECCV) 286–301 (CVF, 2018).
Wang, Q. et al. ECA-net: efficient channel attention for deep convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11534–11542 (CVF, 2020).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (NIPS’17) 6000–6010 (Curran Associates, 2017).
Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR 2021) (ICLR, 2021).
Cordts, M. et al. The Cityscapes dataset for semantic urban scene understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3213–3223 (CVF, 2016).
Perazzi, F. et al. A benchmark dataset and evaluation methodology for video object segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 724–732 (CVF, 2016).
Jocher, G. Ultralytics YOLOv5. https://github.com/ultralytics/yolov5 (2020).
Zhu, X. et al. Deformable DETR: deformable transformers for end-to-end object detection. In Proc. International Conference on Learning Representations (ICLR 2021) (ICLR, 2021).
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. Masked-attention Mask Transformer for universal image segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1290–1299 (CVF, 2022).
Pan, H., Hong, Y., Sun, W. & Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24, 3448–3460 (2022).
Xie, E. et al. SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021).
Ranftl, R., Bochkovskiy, A. & Koltun, V. Vision transformers for dense prediction. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 12179–12188 (CVF, 2021).
Bhat, S. F., Alhashim, I. & Wonka, P. AdaBins: depth estimation using adaptive bins. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4009–4018 (CVF, 2021).
Yang, L. et al. Depth anything: unleashing the power of large-scale unlabeled data. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10371–10381 (CVF, 2024).
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1623–1637 (2020).
Zitova, B. & Flusser, J. Image registration methods: a survey. Image Vis. Comput. 21, 977–1000 (2003).
Bergevin, R., Soucy, M., Gagnon, H. & Laurendeau, D. Towards a general multi-view registration technique. IEEE Trans. Pattern Anal. Mach. Intell. 18, 540–547 (1996).
Ravi, N. et al. Sam 2: Segment anything in images and videos. In Proc. International Conference on Learning Representations (ICLR 2025) (ICLR, 2025).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint at https://arxiv.org/abs/1708.07747 (2017).
Schüldt, C., Laptev, I. & Caputo, B. Recognizing human actions: a local SVM approach. In Proc. 17th International Conference on Pattern Recognition (ICPR 2004) Vol. 3, 32–36 (IEEE, 2004).
Zheng, Z., Wei, Y. & Yang, Y. University-1652: a multi-view multi-source benchmark for drone-based geo-localization. In Proc. 28th ACM International Conference on Multimedia 1395–1403 (ACM, 2020).
Berman, M., Triki, A. R. & Blaschko, M. B. The Lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 4413–4421 (CVF, 2018).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: inverted residuals and linear bottlenecks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4510–4520 (CVF, 2018).
Han, K. et al. GhostNet: more features from cheap operations. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1580–1589 (CVF, 2020).
Han, K. et al. Model Rubik’s cube: twisting resolution, depth and width for tinynets. Adv. Neural Inf. Process. Syst. 33, 19353–19364 (2020).
Tan, M. & Le, Q. EfficientNet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning 6105–6114 (PMLR, 2019).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2016).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. B. Mask R-CNN. In Proc. IEEE International Conference on Computer Vision (ICCV) 2961–2969 (CVF, 2017).
Lin, T.-Y., Goyal, P., Girshick, R. B., He, K. & Dollár, P. Focal loss for dense object detection. In Proc. IEEE International Conference on Computer Vision (ICCV) 2980–2988 (CVF, 2017).
Tan, M., Pang, R. & Le, Q. V. EfficientDet: scalable and efficient object detection. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10781–10790 (2020).
Liu, S. et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In Proc. European Conference on Computer Vision (ECCV 2024) 38–55 (Springer, 2025).
Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015) 234–241 (Springer, 2015).
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2881–2890 (CVF, 2017).
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. European Conference on Computer Vision (ECCV) 801–818 (CVF, 2018).
Eigen, D., Puhrsch, C. & Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proc. 28th International Conference on Neural Information Processing Systems (NIPS’14) 2366–2374 (MIT Press, 2014).
Wofk, D., Ma, F., Yang, T.-J., Karaman, S. & Sze, V. FastDepth: fast monocular depth estimation on embedded systems. In Proc. 2019 International Conference on Robotics and Automation (ICRA) 6101–6108 (IEEE, 2019).
Hazirbas, C., Ma, L., Domokos, C. & Cremers, D. FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In Proc. Asian Conference on Computer Vision (ACCV 2016) 213–228 (Springer, 2017).
Peng, J. Code for optical metasurfaces for general vision processing on the edge. Zenodo https://doi.org/10.5281/zenodo.19382032 (2026).

