Public | Automated Build

Last pushed: a year ago
Short Description
Inherited from cassandra with python api
Full Description

Progress so far

  • Experience with cloud and big data technologies
    1. Docker containers
    2. Virtual cloud (composed of docker swarm over hybrid cloud)
    3. Text analysis: doc2vec, LSI
    4. Spark
    5. Shell (e.g., options for very large sort, keep data compressed, ...)
    6. Hashtables for lookup
    7. Processing large graphs
      • Divide and conquer (subdivide into connected subgraphs)
      • Collapse bi-partite graphs
    8. Identify noise in Graphs
    9. Causes problems for identity matching
    10. Digital signature and false positives
      • Content of a file
      • File path
      • Commit hash
    11. True negatives or copied, but...
      • a license
      • an IDE template
      • a formatting template (.css)
      • a programming template (internationalization: .po)
  • Progress towards application
    1. Developer communities for expertise finding
    2. Statistics on individual developers created
    3. project-to-developer mapping created
    4. Individual identity matching
      1. Based on commit messages (w2v works, evaluation via LSI
        pending)
      2. Based on redit comments (LSI does not work, evaluation
        via w2v pending)
      3. Based on shared files: all.idx.XX.CF.gz contains file
        equivalence classes, replacing hash by developer name is
        pending
    5. Risks in the supply chain
      1. left-pad: tiny but widely deployed. Why connected
      2. Heartbleed, CodeRed, Blaster, .....
      3. Truck factor: what if a developer leaves?
      4. Orphanage

Proposed Plans and Work Breakdown

Potentially useful tools

The community detection code

Aim: Activity Profile[^1] based Identification and Classification/Rating

Construct Activity Profile

  • Using data from different sources
  • Potentially useful in identifying individuals or their properties
  • Score each individual in some pre-defined socio-technical categories.
    • e.g., role, skill,
      Big 5 personality traits in Psychology)
    • e.g., project recruiting context in OSS
      • domain knowledge
      • Language Preference,
      • Coding style, Coding efficiency,
      • Communication style/expertise, work culture, available time etc.
    • Do better than no. of Github Followers or StackOverflow reputation
  • Extend profiles to Projects or even particular pieces of code.

Top level Tasks:

  • Determining what characteristics to consider for profiling
  • Determining socio-technical (and other) categories
  • Determining scoring criteria

Granular Level Tasks:

  • Integrating the concept of profiling with the current Identity matching project
  • Using text mining on Github commit messages, Reddit messages, tweets to extract relevant social information

Aim: Intelligent Inventory management and Risk mitigation based on Truckfactor analysis (Risk assessment) primarily in an Open Source ecosystem

Idea:

  • Identifying bottlenecks in a Supply chain based on risk assessment for the whole chain. Having an inventory of possible choices selected to fit the design and expertise of the developers
  • A working prototype for a selected number of choices in the high risk areas of the project to test what works and to be prepared for any sudden breakdown of that link in the chain -> proposal

Tasks:

  • Risk assessment across projects for a whole supply chain
  • Extend the concept of OSSFinder project to accommodate the concept of risk
  • Coming up with more sophisticated criteria for inventory design (profile + network)
  • Adding provisions for periodical reassessment to update and manage the inventory.

Initial two-week projects

Objective 1: Investigate the possibility of using written text to profile (and match) individuals

Experiment 1:

Use commit message from approximately 40B commits:

Data Set 1:

da3:/data/delta/delta.idx.*.gz

ID: 256501;
length: 39;
commit hash: e2fe85c236736c866481de288f636ab06ef49787;
name: Dmitry Kasatkin
email: [*dmitry.kasatkin@intel.com*](mailto:dmitry.kasatkin@intel.com);
timestamp: 1327598002;
file: yank555-lu\_slimlp\_5.1.x\_kernel\_motorola\_shamu
source: gitBBnew.2.deltaall.gz

da3:/data/delta/delta.id2content

id: 256501
message: lib/mpi: checks for zero divisor length

Method 1: Doc2Vec:

da3

https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

Selected 41 auth with 1K+ messages each
tagged by message content id
Result so far: Most similar messages typically not from the same author
Expanding to full set of authors
Combining messages of each author as paragraph tags
Method 2:

LSI: https://radimrehurek.com/gensim/tut3.html

Experiment 2:

da0

Do the above on 844291111 redit comments:

  • 4M of redit but it crashes
    Find a sample of users (with at least 1post with 200+ chars)
    Find all comments associated with user and compare models among models

Data Set 2:

MongoDB: da1.eecs.utk.edu

Database.Collection: foreseer-reddit.comments

Objective 2: Investigate the possibility of using files modified to identify individuals

Use Data Set 1

da1

Investigate developer profile similarity function defined in, for
example:

Mockus, A. (2009, May). Succession: Measuring transfer of code and
developer productivity. In Proceedings of the 31st International
Conference on Software Engineering (pp. 67-77). IEEE Computer Society.

  • individual to project name on a subset two idx files
    /home/yli118/identifying/java - reads delta.idx..gz -> top ten ids for any given id
    plan: distance based on individual to file:

Objective 3: Summarize individuals in a scorecard

Use Data Set 1

da2

  • Progress:
    1. what statisics are collected: (duration, #changes, all commit time stamps)
    2. what portion of data is processed: the entire delta.idx.*
    3. how/where te results are stored:
      /home/lwan1/{intervals,changes}.out
      scorecard.py
    4. still working on sentiment
  • Duration of activity (time from first to last change)

  • Number of changes, files, other people changing the same files,

    number of projects

  • Skill: changes to files in different languages

  • Productivity growth

  • Uniformity of activity over time

  • Tone in text messages

  • LDA of text messages

Objective 4: Giant graph: get connected components

Experiment 1: get connected components using version history and
content id

Use Data Set 1

Also

Data Set 3: da3:/data/bkp/All.new.idx.*.gz

content id: 283056503

Size: 18147

File/version:
NewNewNew34.0/github.com_Velek_k-9.git/AndroidManifest.xml/4b9f21897ddb057fa94ee7b57d85985f2f2dad5f\
Match to Data Set 1 using File/version

da3:

  • takes content id and project and removes content ids associated
    with a single project
    progress: of the 16 files 11 are done
    output: /data/bkp/*.filtered
    Once done: create connected components

Experiment 2: get connected components using version history, content
id, and authorship

da[0-2]: cluster

hash+author: da3:/data/play/Graphic_authorship/hash.author.gz

content id + hash + author

[^1]: the use of personal characteristics or behaviour patterns to make
generalizations

Docker Pull Command
Owner
audris
Source Repository

Comments (0)