Caselle M., Perez L.E.A., Balzer M., Dritschler T., Kopmann A., Mohr H., Rota L., Vogelgesang M., Weber M.

in Journal of Instrumentation, 12 (2017), C03015. DOI:10.1088/1748-0221/12/03/C03015

Abstract

© 2017 IOP Publishing Ltd and Sissa Medialab srl. Modern data acquisition and trigger systems require a throughput of several GB/s and latencies of the order of microseconds. To satisfy such requirements, a heterogeneous readout system based on FPGA readout cards and GPU-based computing nodes coupled by InfiniBand has been developed. The incoming data from the back-end electronics is delivered directly into the internal memory of GPUs through a dedicated peer-to-peer PCIe communication. High performance DMA engines have been developed for direct communication between FPGAs and GPUs using “DirectGMA (AMD)” and “GPUDirect (NVIDIA)” technologies. The proposed infrastructure is a candidate for future generations of event building clusters, high-level trigger filter farms and low-level trigger system. In this paper the heterogeneous FPGA-GPU architecture will be presented and its performance be discussed.

Caselle M., Perez L.E.A., Balzer M., Kopmann A., Rota L., Weber M., Brosi M., Steinmann J., Brundermann E., Muller A.-S.

in Journal of Instrumentation, 12 (2017), C01040. DOI:10.1088/1748-0221/12/01/C01040

Abstract

© 2017 IOP Publishing Ltd and Sissa Medialab srl. This paper presents a novel data acquisition system for continuous sampling of ultra-short pulses generated by terahertz (THz) detectors. Karlsruhe Pulse Taking Ultra-fast Readout Electronics (KAPTURE) is able to digitize pulse shapes with a sampling time down to 3 ps and pulse repetition rates up to 500 MHz. KAPTURE has been integrated as a permanent diagnostic device at ANKA and is used for investigating the emitted coherent synchrotron radiation in the THz range. A second version of KAPTURE has been developed to improve the performance and flexibility. The new version offers a better sampling accuracy for a pulse repetition rate up to 2 GHz. The higher data rate produced by the sampling system is processed in real-time by a heterogeneous FPGA and GPU architecture operating up to 6.5 GB/s continuously. Results in accelerator physics will be reported and the new design of KAPTURE be discussed.

Bergmann T., Balzer M., Bormann D., Chilingaryan S.A., Eitel K., Kleifges M., Kopmann A., Kozlov V., Menshikov A., Siebenborn B., Tcherniakhovski D., Vogelgesang M., Weber M.

in 2015 IEEE Nuclear Science Symposium and Medical Imaging Conference, NSS/MIC 2015 (2016), 7581841. DOI:10.1109/NSSMIC.2015.7581841

Abstract

© 2015 IEEE. The EDELWEISS experiment, located in the underground laboratory LSM (France), is one of the leading experiments using cryogenic germanium (Ge) detectors for a direct search for dark matter. For the EDELWEISS-III phase, a new scalable data acquisition (DAQ) system was designed and built, based on the ‘IPE4 DAQ system’, which has already been used for several experiments in astroparticle physics.

Rota L., Vogelgesang M., Perez L.E.A., Caselle M., Chilingaryan S., Dritschler T., Zilio N., Kopmann A., Balzer M., Weber M.

in Journal of Instrumentation, 11 (2016), P02007. DOI:10.1088/1748-0221/11/02/P02007

Abstract

© 2016 IOP Publishing Ltd and Sissa Medialab srl.Modern physics experiments produce multi-GB/s data rates. Fast data links and high performance computing stages are required for continuous data acquisition and processing. Because of their intrinsic parallelism and computational power, GPUs emerged as an ideal solution to process this data in high performance computing applications. In this paper we present a high-throughput platform based on direct FPGA-GPU communication. The architecture consists of a Direct Memory Access (DMA) engine compatible with the Xilinx PCI-Express core, a Linux driver for register access, and high- level software to manage direct memory transfers using AMD’s DirectGMA technology. Measurements with a Gen3 x8 link show a throughput of 6.4 GB/s for transfers to GPU memory and 6.6 GB/s to system memory. We also assess the possibility of using the architecture in low latency systems: preliminary measurements show a round-trip latency as low as 1 μs for data transfers to system memory, while the additional latency introduced by OpenCL scheduling is the current limitation for GPU based systems. Our implementation is suitable for real-time DAQ system applications ranging from photon science and medical imaging to High Energy Physics (HEP) systems.

Vogelgesang M., Rota L., Perez L.E.A., Caselle M., Chilingaryan S., Kopmann A.

in Proceedings of SPIE – The International Society for Optical Engineering, 9967 (2016), 996715. DOI:10.1117/12.2237611

Abstract

© Copyright 2016 SPIE. With ever-increasing data rates due to stronger light sources and better detectors, X-ray imaging experiments conducted at synchrotron beamlines face bandwidth and processing limitations that inhibit efficient workflows and prevent real-time operations. We propose an experiment platform comprised of programmable hardware and optimized software to lift these limitations and make beamline setups future-proof. The hardware consists of an FPGA-based data acquisition system with custom logic for data pre-processing and a PCIe data connection for transmission of currently up to 6.6 GB/s. Moreover, the accompanying firmware supports pushing data directly into GPU memory using AMD’s DirectGMA technology without crossing system memory first. The GPUs are used to pre-process projection data and reconstruct final volumetric data with OpenCL faster than possible with CPUs alone. Besides, more efficient use of resources this enables a real-time preview of a reconstruction for early quality assessment of both experiment setup and the investigated sample. The entire system is designed in a modular way and allows swapping all components, e.g. replacing our custom FPGA camera with a commercial system but keep reconstructing data with GPUs. Moreover, every component is accessible using a low-level C library or using a high-level Python interface in order to integrate these components in any legacy environment.

Vogelgesang M., Farago T., Morgeneyer T.F., Helfen L., Dos Santos Rolo T., Myagotin A., Baumbach T.

in Journal of Synchrotron Radiation, 23 (2016) 1254-1263. DOI:10.1107/S1600577516010195

Abstract

© 2016 International Union of Crystallography.Real-time processing of X-ray image data acquired at synchrotron radiation facilities allows for smart high-speed experiments. This includes workflows covering parameterized and image-based feedback-driven control up to the final storage of raw and processed data. Nevertheless, there is presently no system that supports an efficient construction of such experiment workflows in a scalable way. Thus, here an architecture based on a high-level control system that manages low-level data acquisition, data processing and device changes is described. This system is suitable for routine as well as prototypical experiments, and provides specialized building blocks to conduct four-dimensional in situ, in vivo and operando tomography and laminography.

Rota L., Caselle M., Chilingaryan S., Kopmann A., Weber M.

in IEEE Transactions on Nuclear Science, 62 (2015) 972-976, 7111377. DOI:10.1109/TNS.2015.2426877

Abstract

© 2014 IEEE. We developed a direct memory access (DMA) engine compatible with the Xilinx PCI Express (PCIe) core to provide a high-performance and low-occupancy alternative to commercial solutions. In order to maximize the PCIe throughput while minimizing the FPGA resources utilization, the DMA engine adopts a novel strategy where the DMA address list is stored inside the FPGA and not in the central memory of the host CPU. The FPGA design package is complemented with simple register access to control the DMA engine by a Linux driver. The design is compatible with Xilinx FPGA Families 6 and 7, and operates with the Xilinx PCIe endpoint Generation 1 and 2 with all lane configurations (x1, x2, x4, x8). A multi-engine architecture is also presented, where two x8 lanes cores are used in parallel together with a PCIe bridge, to exploit fully the capabilities of a PCIe Gen2 x16 lanes link. A data throughput of 3461 MBytes/s has been achieved with a single PCIe Gen2 x8 lanes endpoint. If the dual-engine architecture is used, the throughput is increased up to 6920 MBytes/s. The presented DMA is currently used in several experiments at the ANKA synchrotron light source.

Stevanovic U., Caselle M., Cecilia A., Chilingaryan S., Farago T., Gasilov S., Herth A., Kopmann A., Vogelgesang M., Balzer M., Baumbach T., Weber M.

in IEEE Transactions on Nuclear Science, 62 (2015) 911-918, 7111386. DOI:10.1109/TNS.2015.2425911

Abstract

© 1963-2012 IEEE. High-speed X-ray imaging applications play a crucial role for non-destructive investigations of the dynamics in material science and biology. On-line data analysis is necessary for quality assurance and data-driven feedback, leading to a more efficent use of a beam time and increased data quality. In this article we present a smart camera platform with embedded Field Programmable Gate Array (FPGA) processing that is able to stream and process data continuously in real-time. The setup consists of a Complementary Metal-Oxide-Semiconductor (CMOS) sensor, an FPGA readout card, and a readout computer. It is seamlessly integrated in a new custom experiment control system called Concert that provides a more efficient way of operating a beamline by integrating device control, experiment process control, and data analysis. The potential of the embedded processing is demonstrated by implementing an image-based trigger. It records the temporal evolution of physical events with increased speed while maintaining the full field of view. The complete data acquisition system, with Concert and the smart camera platform was successfully integrated and used for fast X-ray imaging experiments at KIT’s synchrotron radiation facility ANKA.

Rota L., Caselle M., Chilingaryan S., Kopmann A., Weber M.

in IEEE Transactions on Nuclear Science (2015). DOI:10.1109/TNS.2015.2426877

Abstract

We developed a direct memory access (DMA) engine compatible with the Xilinx PCI Express (PCIe) core to provide a high-performance and low-occupancy alternative to commercial solutions. In order to maximize the PCIe throughput while minimizing the FPGA resources utilization, the DMA engine adopts a novel strategy where the DMA address list is stored inside the FPGA and not in the central memory of the host CPU. The FPGA design package is complemented with simple register access to control the DMA engine by a Linux driver. The design is compatible with Xilinx FPGA Families 6 and 7, and operates with the Xilinx PCIe endpoint Generation 1 and 2 with all lane configurations (x1, x2, x4, x8). A multi-engine architecture is also presented, where two x8 lanes cores are used in parallel together with a PCIe bridge, to exploit fully the capabilities of a PCIe Gen2 x16 lanes link. A data throughput of 3461 MBytes/s has been achieved with a single PCIe Gen2 x8 lanes endpoint. If the dual-engine architecture is used, the throughput is increased up to 6920 MBytes/s. The presented DMA is currently used in several experiments at the ANKA synchrotron light source.