UROP Proceedings 2022-23

School of Engineering Department of Computer Science and Engineering 129 Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU, Xiaofang / CSE Student: CHEN, Sihan / SENG Course: UROP1100, Summer This report discusses the research efforts on data preparation for a Chinese Large Language Model (LLM). A large language model (LLM) is a type of artificial intelligence (AI) algorithm that employs deep learning techniques and vast data sets to comprehend, condense, produce, and anticipate novel content. As part of the data preparation and evaluation team, the responsibilities include preparing high-quality document data for pre-training data and preparing Chinese evaluation protocol. In detail, exact tasks were finished on constructing a sensitive word list, downloading books data, deduplication with heuristic rules, automatic text extraction from raw data, and improving the Chinese translation of commonsense data. During the process, tools including Baidu Netdisk, VS Code, and Terminal were used to abstract data from various sources and improve Large Language Models’ performance in Chinese in the data preparation part. Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU, Xiaofang / CSE Student: GAO, Zhimeng / COMP Course: UROP2100, Fall This project concerns the implementation of an open-street map of a city. The open-street map is designed to demonstrate information including the shortest path between two points, the estimated time taken to pass the path, congestion situation and estimation, simulated (one or more) car passing paths with a start/ pause/ stop feature, etc. My work is mainly focused on implementing multiple paths and simulating movement simultaneously, and reducing the latency caused by a large number of paths. Our implementation finally achieves showing ~1000 independent simulated paths at the same time, with speed proportional to the real speed for going through each single road and different colors for different speeds when passing, and the latency is very low even on a normal laptop. Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU, Xiaofang / CSE Student: HOU, Jingcheng / COMP Course: UROP1100, Spring Given the rise of large-scale s-t data devices and applications, how to process large-scale spatial-temporal data efficiently has drawn much attention. In 2017, Alarabi and Mokbel proposed a map reduce framework for s-t data by using ST-Hadoop. This report is going to present and analyse their demo for ST-Hadoop. Being one of the first mature open-source MapReduce frameworks, it raises S-T awareness in the Hadoop program by indexing S-T data with HDFS, leading to outperforming Hadoop and SpatialHadoop in the S-T data dealing area. This report will first introduce the principle of ST-Hadoop and then analyse the three layers inside the framework: the language layer, indexing layer and operation layer. Finally, the comparison result of the performance among the three frameworks would be presented.