With the advancement of modern technologies, in every moment, data are produced through the internet, business organization, web applications, networking, artificial intelligence, scientific res-earch, and social media very rapidly. Generally, these data types have vast amounts of volume, properties, and categories termed as Bigdata. The significant issues are managing, organizing, processing, presenting, storing, and analyzing Bigdata for decision making in future actions (Rahman and Hasibul, 2019).
Extracting data from multiple sources is challenging by traditional analyzing approaches because of cost, infrastructure, high processing units, sophisticated analysis tools, storages, and robust algorithms. In this case, the cloud platform is a remarkable solution to managing and handling Bigdata quickly and efficiently within a concise time (Khedekar and Tian, 2020).
The researchs main goal is to analyze a large volume of structured or unstructured data defined as Bigdata in the Google cloud platform (GCP) using BigQuery by SQL commands. BigQuery is a product of Google that is a wholly overseen, petabyte-scale, cost-effective for analyzing data in less execution time.
2. Related Works
The researchers did multiple numbers of studies on Bigdata analysis in the field of cloud computing. (Trajanov et al., 2016) conducted a study using BigQuery in a platform of cloud computing for research and educational purposes and recommends only how the platform can be utilized in data analysis. (Tomar and Tomar, 2018) present an overview of Bigdata and cloud computing integration from two sources, i.e., RedBus and Twitter. This paper only discusses the data analysis framework and some methods but does not present a better analysis process in detail.
(Kotecha and Joshiyara, 2018) present a method of managing and handling non-rational data on BigQuery and calculating the processing time. This paper only covers the analysis time with the datasets size using Google SDK rather than extracting the taken datasets necessary values. (Harsha, 2017) discuss some important roles and analyzing tools of Bigdata regarding cloud comput-ing technology. However, These papers have not represented any analyzing methods of Bigdata in a clear view. (Bathla, 2018) presents a theoretical discussion on Bigdata management tools in the cloud platform rather than analyzing the non-structural data within a concise time.
(Riahi and Riahi, 2018) discuss the fundamentals of Bigdata, its challenges, and its applications of it in data analytics. The paper covers Hadoop use, an open-source framework to manage data from diff-erent sources and analyze it. Our presented papers highlight cloud-based Bigdata analysis techniques using the BigQuery service of google without the infrastructural development and database adminis-trators. BigQuery has the option of SQL commends in an easy optionof extraction the useful information from Bigdata in realtime.
3. Bigdata
Nowadays, a large volume of data produced from several offline and online sources every second. These data refer to Bigdata. It is difficult to store, process, and analyze Bigdata through traditional database technologies. Bigdata is indistinct and requires substantial processes of classification and conversion knowledge into new insights. Gartner defined Bigdata as high volume, high velocity, and a wide variety of information assets that required new forms of processing to enable enhanced decision making; insight discovery also processes optima-zation" (Harjinder, 2019).
Several researchers define Bigdata from a differing point of view. Bigdata may be distinguished by as 5Vs, i.e., Volume, Variety, Velocity, Veracity, and Value.
1) Volume: It is a sign of the enormous data sets generated at high-frequency rates
2) Variety: This deals with the various categories of data, such as structured, semi-structured, and unstructured.
3) Velocity: This means the processing speed of data at which an application might create data.
4) Veracity: This means the accuracy, truth-fulness in the data, also if its authentication.
5) Value: The remarkable attributes of Bigdata that means the ways of finding hidden values from the datasets instantly (Juneja and Das, 2016).
Some vital statistics of Bigdata:
1) In everyday data are adding in the amount over 1 billion through Google queries and sending 294 billion mails
2) By a minute, 65,972 Instagram photos are posted, 448,800 tweets are composed, and 500 hours worth of YouTube videos is uploaded.
3) By 2020, the number of smartphone users could reach 6.1 billion. Moreover, considering the Internet of Things into account, there could be 26 billion connected devices by then (THORNTECH, 2020).
3.1 Categories of Bigdata
We can classify Bigdata according to these five aspects (a) data sources, (b) content format, (c) data stores, (d) data staging, and (e) data processing (Hashem et al., 2015).
Fig 1: Categories of Bigdata.
3.2 Bigdata Analytics
Analytics of Bigdata is a methodology used to investigate epic data sets containing grouped chara-cteristics of data sorts, such as gigantic data, to uncover every covered model, dark relationship, promote float, customer tendencies, and other steady business information. These definite results could provoke new pay openings, improved customer advantage, upgraded operational expertise, and high grounds over competitor affiliations and distinctive business reimbursement. Analytics may be cate-gorized into the following types:
1. Descriptive analytics: The most direct class of examination, one that grants you to merge big data into humbler, more important pieces of information.
2. Predictive analytics: The accompanying step-up in the decrease of information utilizes a grouping of quantifiable, showing, data min-ing, and AI procedures to concentrate later and apparent data, allowing specialists to make estimates about what is to come.
3. Prescriptive analytics: Generally, it can be explained as a prescient investigation and need to underwrite a movement so that the business chief can take this information and act (Memon et al., 2017).
3.3 Bigdata Applications
Bigdata has many vital applications in the field of technology. Major applications are included below (Memon et al., 2017).
a) The Third Eye-Data idea
b) In Banking
c) In Agriculture
d) In Finance
e) In Economy
f) Manufacturing
g) Bioinformatics, etc.
4. Cloud Computing
It is a technology that cans support enormous reso-urces by the requirement of users online in large-scale parameters (Harjinder, 2019). Cloud compu-ting is a paradigm for allowing omnipresent, easy, on-demand network access to a common pool of configurable computing resources (e.g., networks, servers, storage, software, and services) that can be easily provisioned and released with minimal main-tenance effort or interference amongst service provi-ders. Five essential characteristics, service models, and four deployment models compose this cloud model (Pritzker, 2011). Clouds are classified into (Rashid & Chaturvedi, 2019):
1) Private cloud
2) Public cloud
3) Community cloud and
4) Hybrid cloud.
Cloud deployment models are grouped into:
a. Data as a Service (DaaS)
b. Software as a Service (SaaS)
c. Platform as a Service (PaaS) and
d. Infrastructure as a Service (IaaS)
Data as a Service (DaaS) is a paradigm in which data via a cloud-based platform is readily available (Bala-chandran and Prasad, 2017). Cloud computing has high-level processing units, storages, and appli-cations that do not depend on user devices perfor-mance. Many users can access resources and demanded services remotely from the cloud on a pay-as-use basis. That is why users are not needed to buy and install costly resources locally.
In the digital world, the amount of Bigdata increased very rapidly. It is a significant issue for managing, processing, storing, and analyzing Bigdata by the hardware and applications in local machines. We can resolve the issue by using a cloud computing platform. Some cloud services providers are Google, AWS, IBM, and Microsoft, and they have their Bigdata analyzing grobust systems and products in a cost-efficient manner (Islam and Reza, 2019). However, Google Cloud Computing (GCP) is a com-pletely managed platform that provides excellent services to business users.
4.1 Google Cloud Computing (GCP)
GCP has powerful tools for managing, storing, and efficiently analyzing Bigdata by reducing money and time. BigQuery is a product of Google in Data ware house that used analyzing and representing numerous types of sample data sets in real time for making the right decisions in the case of industries or business purposes (Saif and Wazir, 2018). Big-Query, Cloud Dataflow, Cloud Datalab, Google Cloud Dataproc, Cloud Datalab, Cloud Dataflow, Cloud Pub/Sub, and Google Genomics are the main data analyzing Google services. Among the above services, BigQuery is as nerveless, user-friendly low-cost Data ware house for analytic (Kumar, 2016).
Fig 2: Bigdata Services in Google Cloud Platform.
4.2 BigQuery
BigQuery is the fully controlled data in the cloud platform. The ware house allows carrying out sub-stantial economic queries-data amounts at speeds one would expect from Google. Taking advantage of low pricing and Googles world-class scalability and protection infrastructure provides business insights with strength (Kotecha and Joshiyara, 2018). Big-Query is a petabyte-scale, one of the fastest data warehouse solutions for Bigdata analysis. Without infrastructure and database administrator, one can easily query, represent, and analyze Bigdata as similar SQL commends by BigQery. Hence, most institutions and business organizations are used, from startups to Fortune 500 companies (Kumar, 2016). Fig 3 shows the data sources that are integrated into BigQuery (Bussiness2Community, 2020).
Fig 3: Bigdata from multiple sources into Google BigQuery.
4.3 Bigdata Integration into Cloud
Bigdata and cloud computing and are very closely interrelated. We cannot think about analyzing of Bigdata in our local machine considering processing time and environmental setup. Besides data in different forms and sources, it is not easy to extract useful information for decision-making. The follow-ing Fig 4 shows the basic architecture of integration of Bigdata into cloud computing from multiple sources (Tomar & Tomar, 2018).
Fig 4: Integration of Bigdata into Cloud.
5. Case Study
We will focus the studies for Bigdata on google cloud. We consider this by a problem statement for the case of dataset 01(ted_main.csv) and dataset 02 (appstore_games.csv). You can load any dataset of the following formats in Google cloud:
CSV
JSON (newline delimited only)
Avro
ORC
Parquet
Explanation of the steps:
1. Firstly, setting up the environment for BigQuery in Google Cloud -
Login to Console
Login to Big Query
Browsing Publicly available sample tables
2. Real-life case study-Downloading the Publicly available data set
Uploading on Google Big Query
Result set/ Query E
3. Results of the case study
For the case of a public dataset
For the case of real-time dataset
We create a project on Google Cloud Platform and use the service of BigQuery. It is possible to access publicly available datasets and queries through structured query language (SQL) to see various outputs and data processing speed in BigQuerys data ware house.
1. Accessing publicly available sample data sets in BigQuery Data ware house:
a) Click on product and services
b) Click on a product category of Big-Query
c) Click on bigquery-public-data-sets
2. Browsing publicly available datasets and running some queries with the query editor.
3. After clicking on the tables, for example, Wiki-pedia and natality, one can see metadata about the table. Metadata represents information about data. In Fig 5 and Fig 6, column details can be seen about a Wikipedia table and the natality table. The tables can be queried by clicking the Query Table button on the top right in the web console.
Fig 5: Sample dataset of Wikipedia on BigQuery.
Fig 6: Sample dataset of natality on BigQuery.
In the following section, data sets of Wikipedia and natality are taken that are publicly available. Data sets are uploaded to BigQuery Data ware house, and then queries are executed to display the above results. This section aims to find a publicly available dataset, upload it into BigQuery Data ware house and then run a query to find out the result. We take a sample dataset TED talks collected from www.kaggle.com in the format of CSV. These datasets provide metadata on all TED Talk audio-video recordings posted to TED.coms official website until March 2020 (TED, 2020). This dataset downloaded has information about all the recordings which were uploaded on YouTube at different times. TED stands for "Technology, Entertainment, and Design" as a media company that publishes free dissemination talks online under the slogan "ideas worth spreading" (WIKIPEDIA, 2020).
Problem Statement: The main target is to find the top 2000 topics from Ted Talks for dataset 01 (ted_main.csv) at YouTube having maximum views of all time from dataset downloaded and find top 1500 games at App Store counting maximum user rating from for data set 02 (appstore_games.csv). For the solution, the problem statement, we used datasets from external and internal sources.
Dataset 01 termed as ted_main.csv from
www.kaggle.com and Dataset 02 termed as app store_games.csv from Google drive.
URL:https://drive.google.com/file/d/1p2FjJTGMkMmCluhVu0xDchfKUVLIPgnb/view?usp=sharing Using BigQuery in Data ware house and the following steps were performed to achieve the desired result for the case of data set 01 (ted_main.csv).
1. Finding the Datasets
After some research on google, a website named www.kaggle.com was found to have multiple publicly available datasets. There are two steps needed for dataset download: login to www.kaggle.com by an email id and password. Following the URL, a CSV file with all Ted Main Dataset records was downloaded on the local computer. https://www.kaggle.com/rounakbanik/ted-talks.
2. Need to upload the set of data to BigQuery Data ware house
a) Logging into BigQuery on the below URL:https://bigquery.cloud.google.com/dataset/bigquery-256708:BigData_on_cloud
b) Creating new datasets in BigQuery
After logging into BigQuery, clicked on the new project (Fig 7). It can be seen that the drop-down menu from creative projects highlights a few options, and the first option is to create a new dataset. The creation of a dataset is a process to upload data on BigQuery Data ware house.
Fig 7: Process to create a new dataset.
For Dataset: 01 ted_main.csv
In Fig 8, more details are added for table creation based on available source data, i.e., CSV file termed as ted_main.csv is uploaded from the local com-puter. In the next row, after setting the table name, click on the table button on the top area of the page is clicked to create the table in BigQuery Data ware house. Following the above steps data set exists on BigQuery. The next step is to upload data sources on BigQuery Data ware house.
In Fig 9, the file path is given, which was downloaded from www.kaggle.com.
Fig 8: Uploading file to BigQuery Data ware house.
In this Fig 8, the table name is added, used for querying the data.
Fig 9: Adding the name of the table.
3. Querying table in the editor
The table is ready for query and finding the top 2000 topics viewed by maximum count. This is achieved as per the below query in Fig 11. Fig 10 shows a preview of the taken dataset.
Fig 10: Preview of table details for data set 01.
The final result for Dataset 01: Click query table as per Fig10, and writing below SQL query resulted in needed output for finding out the top 2000 topics in the TED event on dataset 01 ted_main.cvs has maximum user rating.
Fig 11: Displays the results after writing SQL query in query editor on Dataset 01.
Fig 12: Name vs. views of the data on uploaded data set 01.
For Dataset: 02 appstore_games.csv
This dataset termed appstore_games.csvis taken from Google drive by sharing the link because of uploading issues. For the case study, the dataset was taken from Google drive firstly. Linking by the URL to as shown the Fig 12, the table was created. Created table details are presented in the Fig 13 and Fig 14. Moreover, finally, the query table is ready as Fig 14.
Fig 13: Uploading data set 02 from Google drive.
The final result for Dataset 02: Click query table as per Fig 14, and writing below SQL query resulted in needed output for finding out the top 1500 topics in the TED event on data set 02 appstore_games.cvs having maximum user rating.
Fig 14: Displays the results after writing SQL query in Query editor on Dataset 02.
Fig 15: Name vs. Views of the data on Created Data set 02.