
Data Pipeline Architecture with Near Real-Time Streaming Multiple Source Indonesian Online News Data Lake
Author(s) -
Angelina Pramana Thenata
Publication year - 2020
Publication title -
jurnal informatika dan sains/jisa (jurnal informatika dan sains)
Language(s) - English
Resource type - Journals
eISSN - 2776-3234
pISSN - 2614-8404
DOI - 10.31326/jisa.v3i1.657
Subject(s) - computer science , variety (cybernetics) , architecture , pipeline (software) , big data , volume (thermodynamics) , data science , world wide web , data mining , geography , physics , archaeology , quantum mechanics , artificial intelligence , programming language
The rapid development of information has made online news increasingly needed. Online news attracts readers' attention by providing convenience and speed in presenting news from various fields. However, the large amount (volume) of online news that spreads in a short time (velocity) and the public's need to consume news in various references (variety) can affect people's lives. Therefore, the government as the regulator and news agencies need to monitor online news circulating. Based on these problems, the researcher proposes a data lake architectural design that is suitable for online news and can run in real-time. Data lakes can solve the main problems of Big Data (volume, velocity, variety). In proposing this data lake architecture, the researcher conducted a literature study and analyzed the flow of the data lake architecture according to online news. Furthermore, the researcher will use this architecture to combine and uniform the online news data structure from several online news channels and then stream it in real-time to fill the data lake. The results of using the data lake architecture for online news will be stored on MongoDB which functions as a database to store all data for both the short and long term. Finally, this data lake will be a means to accommodate, dive into, and analyze the circulating online news data. Keywords – Data Lake, Online News, Real-Time.