BIG DATA 2025

Big Data needs new tools and technologies that can encompass the complexity of unstructured and continuously expanding data. For this, traditional relational database technologies or RDBMS are not adequate. In addition, advanced analysis and visualization applications are needed in order to extract the full potential of the data and exploit it for our business objectives. Let’s see some of the main tools below:

Hadoop: it is an open source tool that allows us to manage large volumes of data, analyze them and process them. Hadoop implements MapReduce, a programming model that supports parallel computing over large collections of data.

NoSQL: these are systems that do not use SQL as query language, which, despite not being able to guarantee the integrity of the data (ACID principles: atomicity, consistency, integrity and durability), allows them to obtain significant gains in scalability and performance when working with Big Data. One of the most popular NoSQL databases is MongoDB.

Spark: is an open source cluster computing framework that allows you to process data quickly. It allows you to write applications in Java, Scala, Python, R and SQL and works on Hadoop, Apache Mesos, Kubernetes, as well as independently or in the cloud. You can access hundreds of data sources.

Storm: is a free code distributed real-time computing system. Storm allows to process unlimited data flows in real time in a simple way, being able to be used with any programming language.

Hive: is a Data Warehouse infrastructure built on Hadoop. It facilitates the reading, writing and administration of large data sets that reside in distributed storage using SQL.
A: it is one of the programming languages ​​most used in statistical analysis and data mining. It can be integrated with different databases and allows to generate graphics with high quality.

D3.js: is a JavaScript library to produce dynamic and interactive visualizations of data in web browsers, using HTML, SVG and CSS.
4 key steps to get into Big Data
In order to start enjoying the benefits of Big Data, any organization needs to have four key assets:

First, the data. In an environment where the data is exploding, its availability does not seem to be the problem. What should concern us is rather to be able to maintain their quality, and know how to handle and exploit them correctly.
For this, adequate analytical tools are needed, which also does not represent a barrier for companies today, due to the wide availability in the market of both proprietary and open source tools and platforms.

Which brings us fully to the third fundamental asset, which is the human factor. Having the right professionals in our organization, as data scientists, but also experts in the legal implications of data management and privacy, is emerging as the most important challenge.
However, equipping ourselves with these three assets and putting them to work will not guarantee our success with Big Data either. To be true data driven companies, we will need to carry out a radical transformation of our processes and business culture, to make the data truly stand at the center of our company, and ensure that all departments, from IT to senior management, assume this new focus.

The challenges of Big Data

Nowadays no company can ignore Big Data and the implications it has on its business. However, it is a relatively new and constantly evolving concept, and there are many challenges that organizations face when dealing with big data. Among them:

Technology: Big Data tools like Hadoop are not so easy to administer and require specialized data professionals as well as important resources for maintenance.
Scalability: a Big Data project can grow with great speed, so a company has to take it into account when allocating resources so that the project does not suffer interruptions and the analysis is continuous.

Talent: the necessary profiles for Big Data are scarce and companies are faced with the challenge of finding the right professionals and, at the same time, of training their employees on this new paradigm.

The actionable insights: in front of the amount of data, the challenge for a company is to identify clear business objectives and analyze the appropriate data to achieve them.
Data quality: as we have seen before, it is necessary to keep data clean so that decision making is based on quality data. The costs: the data will continue to grow

BIG DATA

Big data is an evolutionary term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be extracted to obtain information.

Large data is often characterized by three Vs: the extreme volume of data, the large variety of data types, and the speed at which data should be processed. Although large data does not equal any specific volume of data, the term is often used to describe terabytes, petabytes, and even exabytes of data captured over time.

Breaking the 3 Vs of big data
Such voluminous data can come from countless different sources, such as commercial sales records, the results collected from scientific experiments or real-time sensors used in the Internet of Things (IoT). The data can be raw or preprocessed using independent software tools before the analyzes are applied.

Data can also exist in a wide variety of file types, including structured data, such as SQL database stores; unstructured data, such as document files; or data transmission from sensors. In addition, big data can include multiple simultaneous data sources, which otherwise could not be integrated. For example, a large data analysis project may attempt to measure the success of a product and future sales by correlating past sales data, return data, and online buyer review data for that product.

Finally, speed refers to the length of time in which large volumes of data must be analyzed. Each large data analysis project will ingest, correlate and analyze the data sources, and then provide a response or result based on a general query. This means that human analysts must have a detailed understanding of the available data and have some sense of what answer they are looking for.

Speed ​​is also significant, since data analysis expands in fields such as machine learning and artificial intelligence, where analytical processes mimic perception through the search and use of patterns in the collected data.

Big data in infrastructure demands
The need for large data rates imposes unique demands on the underlying computing infrastructure. The computing power needed to quickly process large volumes and varieties of data can overload a single server or server cluster. Organizations must apply the appropriate computing power to big data tasks to achieve the desired speed. This can potentially demand hundreds or thousands of servers that can distribute the work and operate collaboratively.

Reaching that speed in a cost-effective way is also a headache. Many business leaders are reluctant to invest in an extensive server and storage infrastructure that can only be used occasionally to complete big data tasks. As a result, public cloud computing has emerged as a primary vehicle for hosting large data analysis projects. A public cloud provider can store petabytes of data and scale thousands of servers long enough to complete the big data project. The business only pays for the storage and calculation time actually used, and the cloud instances can be disabled until they are needed again.

To further improve service levels, some public cloud providers offer large data capabilities, such as highly distributed Hadoop computing instances, data warehouses, databases, and other cloud-related services. Amazon Web Services Elastic MapReduce (Amazon EMR) is an example of big data services in a public cloud.

The human side of big data analytics
Ultimately, the value and effectiveness of big data depends on the human operators in charge of understanding the data and formulating the appropriate queries to manage big data projects. Some large data tools have specialized niches and allow less technical users to make several predictions from everyday business data. However, other tools are emerging, such as Hadoop devices, to help companies implement an adequate computing infrastructure to address large data projects, minimizing the need for hardware and knowledge of distributed computing software.

But these tools only address limited use cases. Many other large data tasks, such as determining the effectiveness of a new drug, may require extensive scientific and computational experience.

IEEE 802

IEEE 802 is a project of the Institute of Electrical and Electronics Engineers (better known by its initials, IEEE). It is also identified with the acronym LMSC (LAN / MAN Standards Committee). Its mission is to develop standards for local area networks (LAN) and metropolitan area networks (MAN), mainly in the lower two layers of the OSI model. [1]

IEEE 802 was a project created in February 1980 in parallel with the design of the OSI Model. It was developed in order to create standards so that different types of technologies could be integrated and work together. Project 802 defines aspects related to physical cabling and data transmission.

IEEE that acts on computer networks. Specifically and according to its own definition of local area networks (RAL, in English LAN) and metropolitan area networks (MAN in English). The IEEE 802 name is also used to refer to the proposed standards, some of which are well known: Ethernet (IEEE 802.3), or Wi-Fi (IEEE 802.11). It is even trying to standardize Bluetooth in 802.15 (IEEE 802.15).

It focuses on defining the lowest levels (according to the OSI reference model or any other model). Specifically subdivides the second level, the liaison, into two sub-levels: the Logical Link (LLC), collected in 802.2, and the Medium Access Control (MAC), sublayer of the Logical Link layer. The rest of the standards act both in the Physical Level, as in the sub-level of Control of Access to the Environment.

In February of 1980 a committee of local networks was formed in the IEEE with the intention of standardizing a system of 1 or 2 Mbps that basically was Ethernet (the one of the time). They decided to standardize the physical, liaison and superior levels. They divided the link level into two sub-levels: the logical link, in charge of the logic of re-dispatches, flow control and error checking, and the sub-level of access to the environment, in charge of arbitrating conflicts of simultaneous access to the network by the stations.

By the end of the year, the standard had already been extended to include the Token Ring (control ring ring) of IBM and one year later, and due to pressures from industrial groups, Token Bus was included (bus network with witness passing ), which included real-time and redundancy options, and which was assumed to be suitable for factory environments.

Each of these three “standards” had a different physical level, a different access level sub-level but with some common feature (address space and error checking), and a single logical link level for all of them.

Afterwards, work camps were expanded, including metropolitan area networks (some ten kilometers), personnel (a few meters) and regional networks (some hundred kilometers), wireless networks (WLAN), security methods, comfort were included. , etc.

Parallel, Distributed, and Network-Based Processing

The growing number of interesting and significant research papers submitted to PDP demonstrates that the conference is becoming an ever more important international event in the field of parallel and distributed computing research. In particular, the Program Committee of this edition received 239 submissions from 54 countries, which is a record number.
On average each paper received 3.5 reviews. The result was the selection of 68 regular papers for publication in these proceedings: the overall acceptance rate of full papers in the PDP 2015 is 28%31% including special sessions.

Parallel, Distributed, and Network-Based Processing has undergone impressive change over recent years. New architectures and applications have rapidly become the central focus of the discipline. These changes are often a result of cross-fertilisation of parallel and distributed technologies with other rapidly evolving technologies. It is of paramount importance to review and assess these new developments in comparison with recent research achievements in the well-established areas of parallel and distributed computing, from industry and the scientific community. PDP 2015 will provide a forum for the presentation of these and other issues through original research presentations and will facilitate the exchange of knowledge and new ideas at the highest technical level.

Topics of interest include, but are not restricted to:

  • Parallel Computing: massively parallel machines; embedded parallel and distributed systems; multi- and many-core systems; GPU and FPGA based parallel systems; parallel I/O; memory organisation.
  • Distributed and Network-based Computing: Cluster, Grid, Web and Cloud computing; mobile computing; interconnection networks.
  • Big Data: large scale data processing; distributed databases and archives; large scale data management; metadata; data intensive applications.
  • Models and Tools: programming languages and environments; runtime support systems; performance prediction and analysis; simulation of parallel and distributed systems.
  • Systems and Architectures: novel system architectures; high data throughput architectures; service-oriented architectures; heterogeneous systems; shared-memory and message-passing systems; middleware and distributed operating systems; dependability and survivability; resource management.
  • Advanced Algorithms and Applications: distributed algorithms; multi-disciplinary applications; computations over irregular domains; numerical applications with multi-level parallelism; real-time distributed applications.