Data analytics life cycle at the Global Innovation Network and Analysis (GINA)

Jayanth Jadhav

4 min readDec 6, 2023

Team Overview:

EMC’s Global Innovation Network and Analytics (GINA) team comprises of senior technologists located in centers of excellence (COEs) worldwide.
The team’s primary focus is to drive innovation, research, and university partnerships by engaging employees across global COEs.

Initiative in 2012:

In 2012, a newly hired director within the GINA team aimed to enhance innovation-related activities.
The director sought to establish mechanisms for tracking and analyzing information related to innovation, research, and university partnerships.

Enhancing Knowledge Capture:

The GINA team aimed to improve the capture of both formal and informal information.
Special emphasis was placed on capturing insights from informal conversations with thought leaders within EMC, academia, and other organizations.

Global Knowledge Sharing:

The team envisioned a mechanism to facilitate global knowledge sharing among GINA members, even when geographically separated.

Data Repository Objectives:

Planned creation of a data repository to achieve three main goals:

Store both structured and unstructured data.
Track research activities conducted by global technologists.
Mine the accumulated data for patterns and insights to enhance the team’s operations and strategy.

Strategic Impact:

The anticipated impact of this approach was to foster a global platform for sharing ideas, increasing collaboration, and improving knowledge sharing within the GINA team.

Discovery Phase:

- Began identifying data sources
- Consulted experts (Tom Davenport, Peter Gloor) to help decide to crowdsource work by seeking EMC volunteers

Roles filled:
— Business User/Sponsor/Manager: VP from Office of CTO
— Business Intelligence Analyst: IT Representatives
— Data Engineer/DBA: IT Representatives
— Data Scientist: Distinguished Engineer

Data fell into two categories:
— 5 years of EMC’s Innovation Roadmap idea submissions (mix of structured and unstructured data)
— Innovation/research activity minutes and notes (mix of structured and unstructured data)

Key Initial Hypotheses
- 10 main initial hypotheses around mapping innovation, evaluating ideas, measuring knowledge transfer, identifying research boundary spanners, etc.

Grouping of Hypotheses
- Descriptive analytics: analyze current activities to spark creativity, collaboration, asset generation
- Predictive analytics: advise management on where to invest in the future

Data Preparation phase:

Set up a new analytics sandbox to store and experiment with the data
Data scientists and engineers noticed during exploration that some data needed conditioning and normalization
Realized some critical missing datasets were needed to test analytic hypotheses
Recognized that without sufficient data quality and accessibility, subsequent lifecycle steps wouldn’t be possible
Had to determine what level of data quality was sufficient for the GINA project aims
Discovered issues like misspelled researcher names and extra spaces around names in the datastore
Needed to address these small data problems to enable better analysis and aggregation in later phases

Model Planning phase:

Social network analysis techniques seemed feasible for much of the dataset to analyze innovator networks
In some cases, lacked data to appropriately test hypotheses
For IH9, decided to initiate a longitudinal study to start tracking data over time on people developing new intellectual property

This future data collection would allow testing of:

IH8: Whether frequent knowledge sharing reduces time to generate a corporate asset from an idea
IH9: Whether lineage maps show when knowledge sharing did or didn’t result in an asset

Needed to establish goals and parameters for the longitudinal study:

Identify the right milestones to meet the end goal of an idea becoming a successful corporate asset
Trace how people move ideas between milestones towards the goal
Trace ideas that fail and those that succeed
Compare the journeys of successful and unsuccessful ideas
Compare the times and outcomes using statistical tests like t-tests or classification algorithms

Model Building phase:

Employed several analytical methods:

Natural Language Processing (NLP) on textual idea descriptions
Social network analysis using R and RStudio
Developed social graphs and visualizations of innovation networks using ggplot2
Figures show social graphs depicting relationships between idea submitters across countries
Identified “hubs” — people with high connectivity and “betweenness” scores
Cluster in one graph had geographic variety, proving hypothesis about geographic boundary spanners

One person stood out with an unusually high score — queried data to learn about his influence:

Attended top conferences and visited teams globally to share insights
Presented at widely attended virtual sessions with global attendees
Introduced researchers to dozens of corporate innovators
This suggests the hypothesis about identifying influencers spanning geographies/units is correct
Used Tableau for visualization and exploration
Used Pivotal Greenplum database for repository and analytics engine

Communicating Results:

Identified most impactful and relevant findings
Project successful in identifying boundary spanners and hidden innovators
CTO office launched longitudinal studies to track innovation over time
Promoted knowledge sharing about innovation and researchers spanning company and externally
Enabled cultivating additional intellectual property and new research topics
Forged university relationships for joint academic research
Accomplished with limited budget using volunteer force of skilled engineers and data scientists

Key Finding:

Disproportionately high density of innovators in Cork, Ireland office
15% of innovation contest finalists and winners were from Cork despite its small size
Learned Cork received focused innovation training from consultant, increasing contributions
Would have been hard to identify this innovator cluster through traditional methods
Social network analysis revealed a highly contributing pocket of people

Communication:

Shared findings internally through presentations and conferences
Promoted externally through social media and blogs

Operationalizing Results :

Running analytics against innovation activity data yielded great insights into innovation culture
Key findings:
Need more data in future, marketing initiative to convince people to share innovation/research activities
Some sensitive data, need to consider security and privacy regarding who can run models and see results
Need parallel initiative to improve Business Intelligence — dashboards, reporting, queries on global research
Mechanism needed to continually reevaluate model after deployment
Assessing benefits and defining retraining process are main goals
Showed how analytics can drive new insights into traditionally hard to measure areas
Informed investment decisions in university research and identified high-value innovators
Developed recommender systems using topic modeling to help idea submitters refine proposals for new intellectual property