Data analytics life cycle at the Global Innovation Network and Analysis (GINA)

Jayanth Jadhav
4 min readDec 6, 2023

--

Team Overview:

  • EMC’s Global Innovation Network and Analytics (GINA) team comprises of senior technologists located in centers of excellence (COEs) worldwide.
  • The team’s primary focus is to drive innovation, research, and university partnerships by engaging employees across global COEs.

Initiative in 2012:

  • In 2012, a newly hired director within the GINA team aimed to enhance innovation-related activities.
  • The director sought to establish mechanisms for tracking and analyzing information related to innovation, research, and university partnerships.

Enhancing Knowledge Capture:

  • The GINA team aimed to improve the capture of both formal and informal information.
  • Special emphasis was placed on capturing insights from informal conversations with thought leaders within EMC, academia, and other organizations.

Global Knowledge Sharing:

  • The team envisioned a mechanism to facilitate global knowledge sharing among GINA members, even when geographically separated.

Data Repository Objectives:

Planned creation of a data repository to achieve three main goals:

  • Store both structured and unstructured data.
  • Track research activities conducted by global technologists.
  • Mine the accumulated data for patterns and insights to enhance the team’s operations and strategy.

Strategic Impact:

  • The anticipated impact of this approach was to foster a global platform for sharing ideas, increasing collaboration, and improving knowledge sharing within the GINA team.

Discovery Phase:

- Began identifying data sources
- Consulted experts (Tom Davenport, Peter Gloor) to help decide to crowdsource work by seeking EMC volunteers

Roles filled:
— Business User/Sponsor/Manager: VP from Office of CTO
— Business Intelligence Analyst: IT Representatives
— Data Engineer/DBA: IT Representatives
— Data Scientist: Distinguished Engineer

Data fell into two categories:
— 5 years of EMC’s Innovation Roadmap idea submissions (mix of structured and unstructured data)
— Innovation/research activity minutes and notes (mix of structured and unstructured data)

Key Initial Hypotheses
- 10 main initial hypotheses around mapping innovation, evaluating ideas, measuring knowledge transfer, identifying research boundary spanners, etc.

Initial Hypothesis

Grouping of Hypotheses
- Descriptive analytics: analyze current activities to spark creativity, collaboration, asset generation
- Predictive analytics: advise management on where to invest in the future

Data Preparation phase:

  • Set up a new analytics sandbox to store and experiment with the data
  • Data scientists and engineers noticed during exploration that some data needed conditioning and normalization
  • Realized some critical missing datasets were needed to test analytic hypotheses
  • Recognized that without sufficient data quality and accessibility, subsequent lifecycle steps wouldn’t be possible
  • Had to determine what level of data quality was sufficient for the GINA project aims
  • Discovered issues like misspelled researcher names and extra spaces around names in the datastore
  • Needed to address these small data problems to enable better analysis and aggregation in later phases

Model Planning phase:

  • Social network analysis techniques seemed feasible for much of the dataset to analyze innovator networks
  • In some cases, lacked data to appropriately test hypotheses
  • For IH9, decided to initiate a longitudinal study to start tracking data over time on people developing new intellectual property

This future data collection would allow testing of:

  • IH8: Whether frequent knowledge sharing reduces time to generate a corporate asset from an idea
  • IH9: Whether lineage maps show when knowledge sharing did or didn’t result in an asset

Needed to establish goals and parameters for the longitudinal study:

  • Identify the right milestones to meet the end goal of an idea becoming a successful corporate asset
  • Trace how people move ideas between milestones towards the goal
  • Trace ideas that fail and those that succeed
  • Compare the journeys of successful and unsuccessful ideas
  • Compare the times and outcomes using statistical tests like t-tests or classification algorithms

Model Building phase:

Employed several analytical methods:

  • Natural Language Processing (NLP) on textual idea descriptions
  • Social network analysis using R and RStudio
  • Developed social graphs and visualizations of innovation networks using ggplot2
  • Figures show social graphs depicting relationships between idea submitters across countries
  • Identified “hubs” — people with high connectivity and “betweenness” scores
  • Cluster in one graph had geographic variety, proving hypothesis about geographic boundary spanners

One person stood out with an unusually high score — queried data to learn about his influence:

  • Attended top conferences and visited teams globally to share insights
  • Presented at widely attended virtual sessions with global attendees
  • Introduced researchers to dozens of corporate innovators
  • This suggests the hypothesis about identifying influencers spanning geographies/units is correct
  • Used Tableau for visualization and exploration
  • Used Pivotal Greenplum database for repository and analytics engine

Communicating Results:

  • Identified most impactful and relevant findings
  • Project successful in identifying boundary spanners and hidden innovators
  • CTO office launched longitudinal studies to track innovation over time
  • Promoted knowledge sharing about innovation and researchers spanning company and externally
  • Enabled cultivating additional intellectual property and new research topics
  • Forged university relationships for joint academic research
  • Accomplished with limited budget using volunteer force of skilled engineers and data scientists

Key Finding:

  • Disproportionately high density of innovators in Cork, Ireland office
  • 15% of innovation contest finalists and winners were from Cork despite its small size
  • Learned Cork received focused innovation training from consultant, increasing contributions
  • Would have been hard to identify this innovator cluster through traditional methods
  • Social network analysis revealed a highly contributing pocket of people

Communication:

  • Shared findings internally through presentations and conferences
  • Promoted externally through social media and blogs

Operationalizing Results :

  • Running analytics against innovation activity data yielded great insights into innovation culture
  • Key findings:
  • Need more data in future, marketing initiative to convince people to share innovation/research activities
  • Some sensitive data, need to consider security and privacy regarding who can run models and see results
  • Need parallel initiative to improve Business Intelligencedashboards, reporting, queries on global research
  • Mechanism needed to continually reevaluate model after deployment
  • Assessing benefits and defining retraining process are main goals
  • Showed how analytics can drive new insights into traditionally hard to measure areas
  • Informed investment decisions in university research and identified high-value innovators
  • Developed recommender systems using topic modeling to help idea submitters refine proposals for new intellectual property

--

--