Data analytics life cycle at the Global Innovation Network and Analysis (GINA)
Team Overview:
- EMC’s Global Innovation Network and Analytics (GINA) team comprises of senior technologists located in centers of excellence (COEs) worldwide.
- The team’s primary focus is to drive innovation, research, and university partnerships by engaging employees across global COEs.
Initiative in 2012:
- In 2012, a newly hired director within the GINA team aimed to enhance innovation-related activities.
- The director sought to establish mechanisms for tracking and analyzing information related to innovation, research, and university partnerships.
Enhancing Knowledge Capture:
- The GINA team aimed to improve the capture of both formal and informal information.
- Special emphasis was placed on capturing insights from informal conversations with thought leaders within EMC, academia, and other organizations.
Global Knowledge Sharing:
- The team envisioned a mechanism to facilitate global knowledge sharing among GINA members, even when geographically separated.
Data Repository Objectives:
Planned creation of a data repository to achieve three main goals:
- Store both structured and unstructured data.
- Track research activities conducted by global technologists.
- Mine the accumulated data for patterns and insights to enhance the team’s operations and strategy.
Strategic Impact:
- The anticipated impact of this approach was to foster a global platform for sharing ideas, increasing collaboration, and improving knowledge sharing within the GINA team.
Discovery Phase:
- Began identifying data sources
- Consulted experts (Tom Davenport, Peter Gloor) to help decide to crowdsource work by seeking EMC volunteers
Roles filled:
— Business User/Sponsor/Manager: VP from Office of CTO
— Business Intelligence Analyst: IT Representatives
— Data Engineer/DBA: IT Representatives
— Data Scientist: Distinguished Engineer
Data fell into two categories:
— 5 years of EMC’s Innovation Roadmap idea submissions (mix of structured and unstructured data)
— Innovation/research activity minutes and notes (mix of structured and unstructured data)
Key Initial Hypotheses
- 10 main initial hypotheses around mapping innovation, evaluating ideas, measuring knowledge transfer, identifying research boundary spanners, etc.
Grouping of Hypotheses
- Descriptive analytics: analyze current activities to spark creativity, collaboration, asset generation
- Predictive analytics: advise management on where to invest in the future
Data Preparation phase:
- Set up a new analytics sandbox to store and experiment with the data
- Data scientists and engineers noticed during exploration that some data needed conditioning and normalization
- Realized some critical missing datasets were needed to test analytic hypotheses
- Recognized that without sufficient data quality and accessibility, subsequent lifecycle steps wouldn’t be possible
- Had to determine what level of data quality was sufficient for the GINA project aims
- Discovered issues like misspelled researcher names and extra spaces around names in the datastore
- Needed to address these small data problems to enable better analysis and aggregation in later phases
Model Planning phase:
- Social network analysis techniques seemed feasible for much of the dataset to analyze innovator networks
- In some cases, lacked data to appropriately test hypotheses
- For IH9, decided to initiate a longitudinal study to start tracking data over time on people developing new intellectual property
This future data collection would allow testing of:
- IH8: Whether frequent knowledge sharing reduces time to generate a corporate asset from an idea
- IH9: Whether lineage maps show when knowledge sharing did or didn’t result in an asset
Needed to establish goals and parameters for the longitudinal study:
- Identify the right milestones to meet the end goal of an idea becoming a successful corporate asset
- Trace how people move ideas between milestones towards the goal
- Trace ideas that fail and those that succeed
- Compare the journeys of successful and unsuccessful ideas
- Compare the times and outcomes using statistical tests like t-tests or classification algorithms
Model Building phase:
Employed several analytical methods:
- Natural Language Processing (NLP) on textual idea descriptions
- Social network analysis using R and RStudio
- Developed social graphs and visualizations of innovation networks using ggplot2
- Figures show social graphs depicting relationships between idea submitters across countries
- Identified “hubs” — people with high connectivity and “betweenness” scores
- Cluster in one graph had geographic variety, proving hypothesis about geographic boundary spanners
One person stood out with an unusually high score — queried data to learn about his influence:
- Attended top conferences and visited teams globally to share insights
- Presented at widely attended virtual sessions with global attendees
- Introduced researchers to dozens of corporate innovators
- This suggests the hypothesis about identifying influencers spanning geographies/units is correct
- Used Tableau for visualization and exploration
- Used Pivotal Greenplum database for repository and analytics engine
Communicating Results:
- Identified most impactful and relevant findings
- Project successful in identifying boundary spanners and hidden innovators
- CTO office launched longitudinal studies to track innovation over time
- Promoted knowledge sharing about innovation and researchers spanning company and externally
- Enabled cultivating additional intellectual property and new research topics
- Forged university relationships for joint academic research
- Accomplished with limited budget using volunteer force of skilled engineers and data scientists
Key Finding:
- Disproportionately high density of innovators in Cork, Ireland office
- 15% of innovation contest finalists and winners were from Cork despite its small size
- Learned Cork received focused innovation training from consultant, increasing contributions
- Would have been hard to identify this innovator cluster through traditional methods
- Social network analysis revealed a highly contributing pocket of people
Communication:
- Shared findings internally through presentations and conferences
- Promoted externally through social media and blogs
Operationalizing Results :
- Running analytics against innovation activity data yielded great insights into innovation culture
- Key findings:
- Need more data in future, marketing initiative to convince people to share innovation/research activities
- Some sensitive data, need to consider security and privacy regarding who can run models and see results
- Need parallel initiative to improve Business Intelligence — dashboards, reporting, queries on global research
- Mechanism needed to continually reevaluate model after deployment
- Assessing benefits and defining retraining process are main goals
- Showed how analytics can drive new insights into traditionally hard to measure areas
- Informed investment decisions in university research and identified high-value innovators
- Developed recommender systems using topic modeling to help idea submitters refine proposals for new intellectual property