SlideShare a Scribd company logo
1 of 22
Download to read offline
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  [Incomplete]	
  Data	
  Tools	
  
Landscape	
  [for	
  Hackers]	
  in	
  
2015	
  
Wes	
  McKinney	
  @wesmckinn	
  
Data^3	
  MeeMng	
  —	
  Minneapolis,	
  MN	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  A	
  parMal	
  look	
  at	
  different	
  languages	
  and	
  tools	
  
•  LimiMng	
  scope	
  to	
  either:	
  
• Permissively	
  licensed	
  open	
  source	
  soSware,	
  e.g.	
  Apache-­‐licensed	
  (OSS)	
  
• Non-­‐dual-­‐licensed	
  copyleS	
  OSS	
  (e.g.	
  GPL)	
  
• i.e.	
  “do	
  you	
  [the	
  community]	
  have	
  any	
  incenMve	
  to	
  create	
  patches?”	
  
•  Some	
  trends	
  (that	
  I	
  see,	
  anyway)	
  
•  Challenges	
  and	
  opportuniMes	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
•  Python	
  data	
  firestarter	
  
•  Financial	
  analyMcs	
  in	
  R	
  /	
  Python	
  starMng	
  2007	
  
•  pandas	
  project	
  born	
  of	
  frustraMon	
  in	
  2008	
  
•  2010-­‐2012	
  
• Hiatus	
  from	
  gainful	
  employment	
  
• Make	
  pandas	
  ready	
  for	
  primeMme	
  
• Write	
  "Python	
  for	
  Data	
  Analysis"	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  (cont’d)	
  
•  2013-­‐2014:	
  Co-­‐founder/CEO	
  of	
  DataPad	
  (analyMcs	
  startup,	
  with	
  early	
  pandas	
  
collaborator	
  Chang	
  She)	
  
•  Late	
  2014:	
  DataPad	
  team	
  joins	
  Cloudera	
  
•  Now:	
  backend	
  systems	
  and	
  all-­‐things-­‐Python	
  @	
  Cloudera	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SQL:	
  SMll	
  a	
  lingua	
  franca	
  
•  “SQL:	
  the	
  Fortran	
  of	
  AnalyMcs”	
  
•  OSen	
  a	
  concise,	
  declaraMve	
  way	
  to	
  express	
  data	
  transforms,	
  analyMcs,	
  etc.	
  
•  RelaMvely	
  easy	
  to	
  parse,	
  analyze	
  
•  SQL	
  recently	
  has	
  seen	
  resurgence	
  with	
  focus	
  on	
  interacMve-­‐speed	
  SQL	
  engines,	
  
especially	
  on	
  top	
  of	
  HDFS/Hadoop	
  
•  Relevant	
  and	
  impaclul	
  features	
  (e.g.	
  JSON	
  support)	
  sMll	
  arriving	
  in	
  established	
  
RDBMS	
  like	
  PostgreSQL	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Historical	
  Python	
  Context	
  
•  ScienMfic	
  /	
  HPC	
  compuMng	
  focus	
  in	
  1990s,	
  2000s	
  
• Python	
  web	
  community	
  developed	
  in	
  parallel,	
  matured	
  faster!	
  
•  NumPy	
  became	
  community	
  standard	
  in	
  2005,	
  born	
  from	
  Numeric	
  +	
  Numarray	
  
•  Pyrex,	
  later	
  Cython,	
  easier	
  C	
  /	
  C++	
  wrapping	
  
•  f2py:	
  easy	
  Fortran	
  wrapping	
  
•  Anaconda	
  distribuMon	
  
• Finally	
  solving	
  Python	
  deployment	
  for	
  all	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
EssenMal	
  Python	
  stack	
  
•  NumPy:	
  low-­‐level	
  array	
  processing	
  
•  SciPy:	
  essenMal	
  computaMonal	
  algos	
  
•  pandas:	
  data	
  wrangling	
  
•  scikit-­‐learn:	
  machine	
  learning	
  
•  matplotlib	
  (+	
  add-­‐ons,	
  like	
  seaborn):	
  visualizaMon	
  
•  numba:	
  numeric	
  hotspot	
  LLVM	
  compiler	
  
•  Domain-­‐specific	
  toolkits:	
  nltk,	
  scikit-­‐image,	
  statsmodels,	
  Theano,	
  PyCUDA/
PyOpenCL	
  and	
  many	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  A	
  Pythonic	
  take	
  on	
  the	
  classic	
  R	
  “data	
  frame”	
  data	
  structure	
  
•  CriMcal	
  piece	
  to	
  make	
  the	
  Python	
  stack	
  useful	
  in	
  everyday	
  work	
  
•  Added	
  axis	
  metadata	
  /	
  labeling	
  for	
  represenMng	
  mulMdimensional	
  data	
  
•  Focus	
  on	
  easy	
  data	
  wrangling,	
  IO,	
  ploung,	
  and	
  basic	
  analyMcs	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Jeff	
  Reback’s	
  “pandas	
  as	
  PyData	
  middleware”	
  diagram	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Newer	
  /	
  Up-­‐and-­‐coming	
  Python	
  projects	
  
•  Bokeh:	
  interacMve	
  /	
  reacMve	
  visualizaMon	
  for	
  the	
  web	
  
•  Blaze:	
  uniform	
  data	
  expression	
  API	
  
•  Odo:	
  easy	
  data	
  migraMon	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
R	
  Project	
  
•  Trusted	
  base	
  of	
  staMsMcs	
  libraries	
  
• Latest	
  and	
  greatest	
  stats	
  research	
  oSen	
  hits	
  R	
  first	
  
•  RStudio	
  
•  The	
  "Hadley	
  stack”	
  
• VisualizaMon:	
  ggplot2	
  (staMc)	
  and	
  ggvis	
  (interacMve)	
  
• Data	
  Wrangling:	
  dplyr	
  
• legacy:	
  plyr	
  /	
  reshape2	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
dplyr	
  
•  Started	
  late	
  2012	
  by	
  Hadley	
  Wickham,	
  supported	
  by	
  RStudio	
  
•  Composable	
  /	
  chainable	
  analyMcs	
  and	
  data	
  wrangling	
  expressions	
  
•  In-­‐memory	
  and	
  SQL	
  backends	
  
•  Has	
  avracted	
  folks	
  back	
  to	
  R	
  from	
  Python	
  in	
  a	
  lot	
  of	
  cases	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  great	
  R	
  stuff	
  
•  shiny:	
  interacMve	
  web	
  apps	
  in	
  R	
  
•  Rcpp	
  
•  data.table	
  
•  xts	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
IPython	
  
•  IPython	
  started	
  out	
  as	
  a	
  bever	
  interacMve	
  Python	
  
•  Grew	
  to	
  include	
  web-­‐based	
  computaMonal	
  notebook,	
  GUI	
  console,	
  and	
  other	
  
components	
  
• (Google	
  even	
  integrated	
  into	
  Google	
  Drive!)	
  
•  IPython	
  Notebook	
  architecture	
  enabled	
  “kernel”	
  processes	
  to	
  be	
  wriven	
  in	
  nearly	
  
any	
  language	
  (even	
  bash!)	
  	
  
•  How	
  to	
  build	
  community	
  beyond	
  Python?	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enter	
  Jupyter	
  
•  hvp://jupyter.org	
  
•  Breaking	
  out	
  notebook	
  machinery	
  into	
  a	
  standalone	
  non-­‐Python-­‐specific	
  project	
  	
  
•  Enable	
  project	
  components	
  to	
  evolve	
  at	
  own	
  pace,	
  without	
  large	
  monolithic	
  
releases	
  
•  JupyterHub:	
  upcoming	
  mulM-­‐user	
  notebook	
  server	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  few	
  words	
  about	
  Hadoop	
  +	
  Big	
  Data	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Spark	
  
•  Originated	
  from	
  Berkeley	
  AMPLab	
  
•  General	
  purpose	
  distributed	
  memory-­‐centric	
  data	
  processing	
  framework	
  
•  Official	
  APIs:	
  Scala,	
  Java,	
  Python	
  
Source:	
  databricks.com	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  1.3:	
  DataFrames!	
  
•  R/pandas-­‐inspired	
  API	
  for	
  tabular	
  data	
  manipulaMon	
  in	
  Scala,	
  Python,	
  etc.	
  
•  Logical	
  operaMon	
  graphs	
  rewriven	
  internally	
  in	
  more	
  efficient	
  form	
  
•  Good	
  interop	
  with	
  Spark	
  SQL	
  
•  Some	
  interoperability	
  with	
  pandas	
  
•  Will	
  help	
  close	
  the	
  semanMc	
  gap	
  between	
  Spark	
  and	
  R/Python	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  problems	
  in	
  need	
  of	
  solving	
  
•  A	
  Shiny-­‐like	
  quick-­‐and-­‐dirty	
  data	
  app	
  development	
  framework	
  for	
  Python	
  
•  IPython/Jupyter	
  notebook	
  collaboraMon	
  
•  A	
  community-­‐standard,	
  Apache-­‐licensed	
  C/C++	
  data	
  frame	
  library	
  with	
  best-­‐in-­‐
class	
  performance	
  
•  Ubiquitous	
  support	
  for	
  emerging	
  analyMcal	
  on-­‐disk	
  storage	
  standards	
  like	
  Parquet	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  interesMng	
  stuff	
  to	
  look	
  at	
  	
  
•  Torch7	
  /	
  LuaJIT:	
  high	
  performance	
  ML	
  /	
  deep	
  learning	
  on	
  GPUs	
  
• Facebook	
  AI	
  group	
  open	
  sourced	
  several	
  ML	
  modules	
  
•  Apache	
  Flink	
  
• Up-­‐and-­‐coming	
  Scala-­‐based	
  data	
  processing	
  framework	
  
• Some	
  overlap	
  with	
  Spark	
  use	
  cases	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  interesMng	
  industry	
  trends	
  
•  MicrosoS	
  
• Acquired	
  RevoluMon	
  AnalyMcs,	
  leading	
  commercial	
  R	
  vendor	
  
• Launched	
  Azure	
  ML:	
  R,	
  Python,	
  and	
  more	
  on	
  Azure	
  cloud	
  
•  Dato	
  (ya	
  GraphLab)	
  
• faster,	
  more	
  scalable	
  machine	
  learning,	
  with	
  Python	
  interface	
  (Paid	
  commercial	
  
product,	
  free	
  for	
  non-­‐commercial/academic	
  use)	
  
• Largest-­‐ever	
  VC	
  investment	
  in	
  a	
  data	
  tools	
  company	
  beung	
  big	
  on	
  Python	
  
•  Databricks	
  
• Offering	
  cloud	
  Spark-­‐notebook-­‐as-­‐a-­‐service	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  

More Related Content

What's hot

Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit
 

What's hot (20)

Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 

Viewers also liked

User Experience for Business Analysts
User Experience for Business AnalystsUser Experience for Business Analysts
User Experience for Business AnalystsCarol Smith
 
Salesforce DX Pilot Product Overview
Salesforce DX Pilot Product OverviewSalesforce DX Pilot Product Overview
Salesforce DX Pilot Product OverviewSalesforce Partners
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
 
How To Be A Hacker
How To Be A HackerHow To Be A Hacker
How To Be A HackerPaul Tarjan
 
Success Community Wizard Overview
Success Community Wizard OverviewSuccess Community Wizard Overview
Success Community Wizard OverviewDavid Giller
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17Carol Smith
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)정훈 서
 

Viewers also liked (16)

Hacking
HackingHacking
Hacking
 
User Experience for Business Analysts
User Experience for Business AnalystsUser Experience for Business Analysts
User Experience for Business Analysts
 
Riding the Enterprise Integration train
Riding the Enterprise Integration trainRiding the Enterprise Integration train
Riding the Enterprise Integration train
 
Salesforce DX Pilot Product Overview
Salesforce DX Pilot Product OverviewSalesforce DX Pilot Product Overview
Salesforce DX Pilot Product Overview
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
How To Be A Hacker
How To Be A HackerHow To Be A Hacker
How To Be A Hacker
 
Hacking For Innovation Delhi
Hacking For Innovation DelhiHacking For Innovation Delhi
Hacking For Innovation Delhi
 
Success Community Wizard Overview
Success Community Wizard OverviewSuccess Community Wizard Overview
Success Community Wizard Overview
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
 
Startup Pitch Decks
Startup Pitch DecksStartup Pitch Decks
Startup Pitch Decks
 

Similar to An Incomplete Data Tools Landscape for Hackers in 2015

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and RWork-Bench
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Pandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data ExperiencePandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data ExperienceTuri, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-PipelinesTimothy Spann
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Codemotion
 

Similar to An Incomplete Data Tools Landscape for Hackers in 2015 (20)

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Pandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data ExperiencePandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 

More from Wes McKinney (18)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 

Recently uploaded

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 

Recently uploaded (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 

An Incomplete Data Tools Landscape for Hackers in 2015

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   A  [Incomplete]  Data  Tools   Landscape  [for  Hackers]  in   2015   Wes  McKinney  @wesmckinn   Data^3  MeeMng  —  Minneapolis,  MN  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  A  parMal  look  at  different  languages  and  tools   •  LimiMng  scope  to  either:   • Permissively  licensed  open  source  soSware,  e.g.  Apache-­‐licensed  (OSS)   • Non-­‐dual-­‐licensed  copyleS  OSS  (e.g.  GPL)   • i.e.  “do  you  [the  community]  have  any  incenMve  to  create  patches?”   •  Some  trends  (that  I  see,  anyway)   •  Challenges  and  opportuniMes  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Python  data  firestarter   •  Financial  analyMcs  in  R  /  Python  starMng  2007   •  pandas  project  born  of  frustraMon  in  2008   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeMme   • Write  "Python  for  Data  Analysis"  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?  (cont’d)   •  2013-­‐2014:  Co-­‐founder/CEO  of  DataPad  (analyMcs  startup,  with  early  pandas   collaborator  Chang  She)   •  Late  2014:  DataPad  team  joins  Cloudera   •  Now:  backend  systems  and  all-­‐things-­‐Python  @  Cloudera  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   SQL:  SMll  a  lingua  franca   •  “SQL:  the  Fortran  of  AnalyMcs”   •  OSen  a  concise,  declaraMve  way  to  express  data  transforms,  analyMcs,  etc.   •  RelaMvely  easy  to  parse,  analyze   •  SQL  recently  has  seen  resurgence  with  focus  on  interacMve-­‐speed  SQL  engines,   especially  on  top  of  HDFS/Hadoop   •  Relevant  and  impaclul  features  (e.g.  JSON  support)  sMll  arriving  in  established   RDBMS  like  PostgreSQL  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Historical  Python  Context   •  ScienMfic  /  HPC  compuMng  focus  in  1990s,  2000s   • Python  web  community  developed  in  parallel,  matured  faster!   •  NumPy  became  community  standard  in  2005,  born  from  Numeric  +  Numarray   •  Pyrex,  later  Cython,  easier  C  /  C++  wrapping   •  f2py:  easy  Fortran  wrapping   •  Anaconda  distribuMon   • Finally  solving  Python  deployment  for  all  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   EssenMal  Python  stack   •  NumPy:  low-­‐level  array  processing   •  SciPy:  essenMal  computaMonal  algos   •  pandas:  data  wrangling   •  scikit-­‐learn:  machine  learning   •  matplotlib  (+  add-­‐ons,  like  seaborn):  visualizaMon   •  numba:  numeric  hotspot  LLVM  compiler   •  Domain-­‐specific  toolkits:  nltk,  scikit-­‐image,  statsmodels,  Theano,  PyCUDA/ PyOpenCL  and  many  others  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  A  Pythonic  take  on  the  classic  R  “data  frame”  data  structure   •  CriMcal  piece  to  make  the  Python  stack  useful  in  everyday  work   •  Added  axis  metadata  /  labeling  for  represenMng  mulMdimensional  data   •  Focus  on  easy  data  wrangling,  IO,  ploung,  and  basic  analyMcs  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Jeff  Reback’s  “pandas  as  PyData  middleware”  diagram  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Newer  /  Up-­‐and-­‐coming  Python  projects   •  Bokeh:  interacMve  /  reacMve  visualizaMon  for  the  web   •  Blaze:  uniform  data  expression  API   •  Odo:  easy  data  migraMon  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   R  Project   •  Trusted  base  of  staMsMcs  libraries   • Latest  and  greatest  stats  research  oSen  hits  R  first   •  RStudio   •  The  "Hadley  stack”   • VisualizaMon:  ggplot2  (staMc)  and  ggvis  (interacMve)   • Data  Wrangling:  dplyr   • legacy:  plyr  /  reshape2  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Started  late  2012  by  Hadley  Wickham,  supported  by  RStudio   •  Composable  /  chainable  analyMcs  and  data  wrangling  expressions   •  In-­‐memory  and  SQL  backends   •  Has  avracted  folks  back  to  R  from  Python  in  a  lot  of  cases  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  great  R  stuff   •  shiny:  interacMve  web  apps  in  R   •  Rcpp   •  data.table   •  xts  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   IPython   •  IPython  started  out  as  a  bever  interacMve  Python   •  Grew  to  include  web-­‐based  computaMonal  notebook,  GUI  console,  and  other   components   • (Google  even  integrated  into  Google  Drive!)   •  IPython  Notebook  architecture  enabled  “kernel”  processes  to  be  wriven  in  nearly   any  language  (even  bash!)     •  How  to  build  community  beyond  Python?  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Enter  Jupyter   •  hvp://jupyter.org   •  Breaking  out  notebook  machinery  into  a  standalone  non-­‐Python-­‐specific  project     •  Enable  project  components  to  evolve  at  own  pace,  without  large  monolithic   releases   •  JupyterHub:  upcoming  mulM-­‐user  notebook  server  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   A  few  words  about  Hadoop  +  Big  Data  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Spark   •  Originated  from  Berkeley  AMPLab   •  General  purpose  distributed  memory-­‐centric  data  processing  framework   •  Official  APIs:  Scala,  Java,  Python   Source:  databricks.com  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  1.3:  DataFrames!   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaMon  in  Scala,  Python,  etc.   •  Logical  operaMon  graphs  rewriven  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  Will  help  close  the  semanMc  gap  between  Spark  and  R/Python  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Some  problems  in  need  of  solving   •  A  Shiny-­‐like  quick-­‐and-­‐dirty  data  app  development  framework  for  Python   •  IPython/Jupyter  notebook  collaboraMon   •  A  community-­‐standard,  Apache-­‐licensed  C/C++  data  frame  library  with  best-­‐in-­‐ class  performance   •  Ubiquitous  support  for  emerging  analyMcal  on-­‐disk  storage  standards  like  Parquet  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Other  interesMng  stuff  to  look  at     •  Torch7  /  LuaJIT:  high  performance  ML  /  deep  learning  on  GPUs   • Facebook  AI  group  open  sourced  several  ML  modules   •  Apache  Flink   • Up-­‐and-­‐coming  Scala-­‐based  data  processing  framework   • Some  overlap  with  Spark  use  cases  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  interesMng  industry  trends   •  MicrosoS   • Acquired  RevoluMon  AnalyMcs,  leading  commercial  R  vendor   • Launched  Azure  ML:  R,  Python,  and  more  on  Azure  cloud   •  Dato  (ya  GraphLab)   • faster,  more  scalable  machine  learning,  with  Python  interface  (Paid  commercial   product,  free  for  non-­‐commercial/academic  use)   • Largest-­‐ever  VC  investment  in  a  data  tools  company  beung  big  on  Python   •  Databricks   • Offering  cloud  Spark-­‐notebook-­‐as-­‐a-­‐service  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn