Bienvenidos
IPN-Dharma IA Lab
Es una iniciativa de Laboratorio de Inteligencia Artificial del CIC del IPN con la colaboración de DHARMA para motivar a investigadores, profesores y estudiantes a aprovechar los cursos, recursos y herramientas de las principales plataformas tecnológicas de la industria en las áreas de Aprendizaje Automático, Ciencia de Datos, Computación en la Nube, Inteligencia Artificial e Internet de las Cosas con el propósito de generar una experiencia práctica a través de un modelo de aprendizaje entre pares y por objetivos.
Programas Relacionados
Nivel 3: Construyendo Soluciones
Spark Fundamentals
Ever waited over night to run a report and to come back to your computer in the morning to find it still running. When the heat is on and you have a deadline, something is not working. With larger and larger data sets you need to be fluent in the right tools to be able to make your commitments. This learning path is your opportunity to learn from industry leaders about Spark. This path provides hands on opportunities and projects to build your confidence within this tool set.
Solid understanding and experience, with core tools, in any field promotes excellence and innovation. Apache Spark, as a general engine for large scale data processing, is such a tool within the big data realm. This learning path addresses the fundamentals of this program's design and its application in the everyday.
Solid understanding and experience, with core tools, in any field promotes excellence and innovation. Apache Spark, as a general engine for large scale data processing, is such a tool within the big data realm. This learning path addresses the fundamentals of this program's design and its application in the everyday.
Cursos en este programa
1) Spark Fundamentals I
Ignite your interest in Spark with an introduction to the core concepts that make this general processor an essential tool set for working with Big Data.
Learn the fundamentals of Spark, the technology that is revolutionizing the analytics and big data world! Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to go.
Learn the fundamentals of Spark, the technology that is revolutionizing the analytics and big data world! Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to go.
- Learn how it performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.
- Learn how it provides in-memory cluster computing for lightning fast speed and supports Java, Python, R, and Scala APIs for ease of development.
- Learn how it can handle a wide range of data processing scenarios by combining SQL, streaming and complex analytics together seamlessly in the same application.
- Learn how it runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.
2) Spark MLlIB
Spark provides a machine learning library known as MLlib. Spark MLlib provides various machine learning algorithms such as classification, regression, clustering, and collaborative filtering. It also provides tools such as featurization, pipelines, persistence, and utilities for handling linear algebra operations, statistics and data handling.
This course will start you off on your journey and walk you through some of the machine learning libraries and how to use them.
This course will start you off on your journey and walk you through some of the machine learning libraries and how to use them.
3) Exploring Spark's GraphX
Spark provides a graph-parallel computation library in GraphX. Graph-parallel is a paradigm that allows representation of your data as vertices and edges. Spark's GraphX provides a set of fundamental operators in addition to a growing collection of algorithms and builders to simplify graph analytics tasks.
In this course, you will learn about Spark GraphX components and the background of graph-parallel operations. You will see how Spark implements this with RDDs and how it compares vs Data Parallel operations. You will get to explore how to visualize your data using various graph operators.
In this course, you will learn about Spark GraphX components and the background of graph-parallel operations. You will see how Spark implements this with RDDs and how it compares vs Data Parallel operations. You will get to explore how to visualize your data using various graph operators.
4) Analyzing Big Data in R using Apache Spark
Apache Spark is a popular cluster computing framework used for performing large scale data analysis. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users.
Master Apache Spark, a popular cluster computing framework used for performing large scale data analysis. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users.
Master Apache Spark, a popular cluster computing framework used for performing large scale data analysis. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users.
- Learn why R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks.
- Learn how SparkR, an R package that provides a light-weight frontend, uses Apache Spark from R.
5) Spark Fundamentals II
Building on your foundational knowledge of Spark, take this opportunity to move your skills to the next level. With a focus on Spark Resilient Distributed Data Set operations this course exposes you to concepts that are critical to your success in this field.
Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
- Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
- Learn how to optimize your data for joins using Spark’s memory caching.
- Learn how to use the more advanced operations available in the API.
- The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.